The complex category did not reach the threshold ≥75% of acceptable reproducibility at the third assessment, but the simple indicator category barely reached the 75% limit (Figure 3). In the first evaluation, neither of the two indicator categories, simple or complex, had an acceptable ERR rate. Irreducative values improved after two evaluations for both categories, with a 15-percentage-point improvement between the first and third evaluations (p – 0.475) and a 13-point improvement (p -0.558) in complex indicators. The basic measure for Inter-Rater`s reliability is a percentage agreement between advisors. Health program managers need access to reliable information to identify problems, monitor progress and make evidence-based decisions. Often this information is obtained through indicator-based tools, but the reliability of these indicators is unknown. We identified problems in the way supervisors understood and calculated indicators by examining the 300EUR of savings valuation indicators. Our study suggests that targeted, multi-channel efforts, including training, tool revisions and repeated instructions, can improve the reproducibility of savings indicator assessments. We now have a series of indicators with an average irrure score of 72%, only shy of the acceptable level, and three of the five areas that achieved an acceptable IRR of ≥75%. We learned that it is preferable, where possible, to use simple binary indicators to develop an indicator-based assessment tool and that evaluating and improving the use of IRR should be an iterative process. Uniform standards for data reproducibility, evaluation methods and best practices for assessing indicator errors would allow more programs in resource-limited countries to improve data quality. In the case of indicators with sub-questions or sub-indicators, the evaluation of the team agreement was assessed separately for each sub-question and was then assessed on the sub-questions related to this indicator. We calculated the average agreement percentage for all MMS teams to measure ERREUR for an indicator (i.e.

the proportion of teams that scored 100 points). An irr-score calculation figure for indicators, sub-indicators and domains is provided (Additional File 3). Kappa statistics are often used to test the reliability of interreters. The importance of the reliability of reference values lies in the fact that it represents the extent to which the data collected in the study are correct representations of the measured variables. The measurement of the extent to which data collectors assign the same score to the same variables is called the reliability of the interrater. Although there were many methods for measuring the reliability of Interraters, they were traditionally measured as a percentage of agreement, calculated as the number of chord results divided by the total number of points. In 1960, Jacob Cohen criticized the use of the agreement as a percentage because of its inability to take random agreement into account.