Pearson correlation coefficient
In statistics, the Pearson correlation coefficient (PCC)[a] is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).
Not to be confused with Coefficient of determination.Naming and history[edit]
It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s, and for which the mathematical formula was derived and published by Auguste Bravais in 1844.[b][6][7][8][9] The naming of the coefficient is thus an example of Stigler's Law.
Mathematical properties[edit]
The values of both the sample and population Pearson correlation coefficients are on or between −1 and 1. Correlations equal to +1 or −1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
A key mathematical property of the Pearson correlation coefficient is that it is invariant under separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants with b, d > 0, without changing the correlation coefficient. (This holds for both the population and sample Pearson correlation coefficients.) More general linear transformations do change the correlation: see § Decorrelation of n random variables for an application of this.
The square of the sample correlation coefficient is typically denoted r2 and is a special case of the coefficient of determination. In this case, it estimates the fraction of the variance in Y that is explained by X in a simple linear regression. So if we have the observed dataset and the fitted dataset then as a starting point the total variation in the Yi around their average value can be decomposed as follows
where the are the fitted values from the regression analysis. This can be rearranged to give
The two summands above are the fraction of variance in Y that is explained by X (right) and that is unexplained by X (left).
Next, we apply a property of least squares regression models, that the sample covariance between and is zero. Thus, the sample correlation coefficient between the observed and fitted response values in the regression can be written (calculation is under expectation, assumes Gaussian statistics)
Thus
where is the proportion of variance in Y explained by a linear function of X.
In the derivation above, the fact that
can be proved by noticing that the partial derivatives of the residual sum of squares (RSS) over β0 and β1 are equal to 0 in the least squares model, where
In the end, the equation can be written as
where
The symbol is called the regression sum of squares, also called the explained sum of squares, and is the total sum of squares (proportional to the variance of the data).