Inter-rater reliability
In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, inter-coder reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.
Assessment tools that rely on ratings must exhibit good inter-rater reliability, otherwise they are not valid tests.
There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, such as Cohen's kappa, Scott's pi and Fleiss' kappa; or inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.
Statistics[edit]
Joint probability of agreement[edit]
The joint-probability of agreement is the simplest and the least robust measure. It is estimated as the percentage of the time the raters agree in a nominal or categorical rating system. It does not take into account the fact that agreement may happen solely based on chance. There is some question whether or not there is a need to 'correct' for chance agreement; some suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions.[3]
When the number of categories being used is small (e.g. 2 or 3), the likelihood for 2 raters to agree by pure chance increases dramatically. This is because both raters must confine themselves to the limited number of options available, which impacts the overall agreement rate, and not necessarily their propensity for "intrinsic" agreement (an agreement is considered "intrinsic" if it is not due to chance).
Therefore, the joint probability of agreement will remain high even in the absence of any "intrinsic" agreement among raters. A useful inter-rater reliability coefficient is expected (a) to be close to 0 when there is no "intrinsic" agreement and (b) to increase as the "intrinsic" agreement rate improves. Most chance-corrected agreement coefficients achieve the first objective. However, the second objective is not achieved by many known chance-corrected measures.[4]
Disagreement[edit]
For any task in which multiple raters are useful, raters are expected to disagree about the observed target. By contrast, situations involving unambiguous measurement, such as simple counting tasks (e.g. number of potential customers entering a store), often do not require more than one person performing the measurement.
Measurement involving ambiguity in characteristics of interest in the rating target are generally improved with multiple trained raters. Such measurement tasks often involve subjective judgment of quality. Examples include ratings of physician 'bedside manner', evaluation of witness credibility by a jury, and presentation skill of a speaker.
Variation across raters in the measurement procedures and variability in interpretation of measurement results are two examples of sources of error variance in rating measurements. Clearly stated guidelines for rendering ratings are necessary for reliability in ambiguous or challenging measurement scenarios.
Without scoring guidelines, ratings are increasingly affected by experimenter's bias, that is, a tendency of rating values to drift towards what is expected by the rater. During processes involving repeated measurements, correction of rater drift can be addressed through periodic retraining to ensure that raters understand guidelines and measurement goals.