Intraclass correlation

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC),^[1] is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures, it operates on data structured as groups rather than data structured as paired observations.

The intraclass correlation is commonly used to quantify the degree to which individuals with a fixed degree of relatedness (e.g. full siblings) resemble each other in terms of a quantitative trait (see heritability). Another prominent application is the assessment of consistency or reproducibility of quantitative measurements made by different observers measuring the same quantity.

Relationship to Pearson's correlation coefficient[edit]

In terms of its algebraic form, Fisher's original ICC is the ICC that most resembles the Pearson correlation coefficient. One key difference between the two statistics is that in the ICC, the data are centered and scaled using a pooled mean and standard deviation, whereas in the Pearson correlation, each variable is centered and scaled by its own mean and standard deviation. This pooled scaling for the ICC makes sense because all measurements are of the same quantity (albeit on units in different groups). For example, in a paired data set where each "pair" is a single measurement made for each of two units (e.g., weighing each twin in a pair of identical twins) rather than two different measurements for a single unit (e.g., measuring height and weight for each individual), the ICC is a more natural measure of association than Pearson's correlation.

An important property of the Pearson correlation is that it is invariant to application of separate linear transformations to the two variables being compared. Thus, if we are correlating X and Y, where, say, Y = 2X + 1, the Pearson correlation between X and Y is 1 — a perfect correlation. This property does not make sense for the ICC, since there is no basis for deciding which transformation is applied to each value in a group. However, if all the data in all groups are subjected to the same linear transformation, the ICC does not change.

Use in assessing conformity among observers[edit]

The ICC is used to assess the consistency, or conformity, of measurements made by multiple observers measuring the same quantity.^[11] For example, if several physicians are asked to score the results of a CT scan for signs of cancer progression, we can ask how consistent the scores are to each other. If the truth is known (for example, if the CT scans were on patients who subsequently underwent exploratory surgery), then the focus would generally be on how well the physicians' scores matched the truth. If the truth is not known, we can only consider the similarity among the scores. An important aspect of this problem is that there is both inter-observer and intra-observer variability. Inter-observer variability refers to systematic differences among the observers — for example, one physician may consistently score patients at a higher risk level than other physicians. Intra-observer variability refers to deviations of a particular observer's score on a particular patient that are not part of a systematic difference.

The ICC is constructed to be applied to exchangeable measurements — that is, grouped data in which there is no meaningful way to order the measurements within a group. In assessing conformity among observers, if the same observers rate each element being studied, then systematic differences among observers are likely to exist, which conflicts with the notion of exchangeability. If the ICC is used in a situation where systematic differences exist, the result is a composite measure of intra-observer and inter-observer variability. One situation where exchangeability might reasonably be presumed to hold would be where a specimen to be scored, say a blood specimen, is divided into multiple aliquots, and the aliquots are measured separately on the same instrument. In this case, exchangeability would hold as long as no effect due to the sequence of running the samples was present.

Since the intraclass correlation coefficient gives a composite of intra-observer and inter-observer variability, its results are sometimes considered difficult to interpret when the observers are not exchangeable. Alternative measures such as Cohen's kappa statistic, the Fleiss kappa, and the concordance correlation coefficient^[12] have been proposed as more suitable measures of agreement among non-exchangeable observers.

One-way random effects: each subject is measured by a different set of k randomly selected raters;

Two-way random: k raters are randomly selected, then, each subject is measured by the same set of k raters;

Two-way mixed: k fixed raters are defined. Each subject is measured by the k raters.

ICC is supported in the open source software package R (using the function "icc" with the packages psy or irr, or via the function "ICC" in the package psych.) The rptR package ^[13] provides methods for the estimation of ICC and repeatabilities for Gaussian, binomial and Poisson distributed data in a mixed-model framework. Notably, the package allows estimation of adjusted ICC (i.e. controlling for other variables) and computes confidence intervals based on parametric bootstrapping and significances based on the permutation of residuals. Commercial software also supports ICC, for instance Stata or SPSS^[14]

The three models are:

Number of measurements:

Consistency or absolute agreement:

The consistency ICC cannot be estimated in the one-way random effects model, as there is no way to separate the inter-rater and residual variances.

An overview and re-analysis of the three models for the single measures ICC, with an alternative recipe for their use, has also been presented by Liljequist et al. (2019).^[18]

Less than 0.40—poor.

Between 0.40 and 0.59—fair.

Between 0.60 and 0.74—good.

Between 0.75 and 1.00—excellent.

Cicchetti (1994)^[19] gives the following often quoted guidelines for interpretation for kappa or ICC inter-rater agreement measures:

A different guideline is given by Koo and Li (2016):^[20]

Correlation ratio

Design effect

Effect size#Eta-squared (η2)

A comparison of two indices for the intraclass correlation coefficient

AgreeStat 360: cloud-based inter-rater reliability analysis, Cohen's kappa, Gwet's AC1/AC2, Krippendorff's alpha, Brennan-Prediger, Fleiss generalized kappa, intraclass correlation coefficients