Pearson correlation coefficient

In statistics, the Pearson correlation coefficient (PCC)^[a] is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).

Not to be confused with Coefficient of determination.

Naming and history[edit]

It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s, and for which the mathematical formula was derived and published by Auguste Bravais in 1844.^[b]^[6]^[7]^[8]^[9] The naming of the coefficient is thus an example of Stigler's Law.

$\operatorname {cov}$ is the

covariance

$\sigma _{X}$ is the of $X$

standard deviation

$\sigma _{Y}$ is the standard deviation of $Y$ .

Mathematical properties[edit]

The values of both the sample and population Pearson correlation coefficients are on or between −1 and 1. Correlations equal to +1 or −1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).

A key mathematical property of the Pearson correlation coefficient is that it is invariant under separate changes in location and scale in the two variables. That is, we may transform X to $a + bX$ and transform Y to $c + dY$ , where a, b, c, and d are constants with $b, d > 0$ , without changing the correlation coefficient. (This holds for both the population and sample Pearson correlation coefficients.) More general linear transformations do change the correlation: see § Decorrelation of n random variables for an application of this.

Function of raw scores and means

Standardized covariance

Standardized slope of the regression line

Geometric mean of the two regression slopes

Square root of the ratio of two variances

Mean cross-product of standardized variables

Function of the angle between two standardized regression lines

Function of the angle between two variable vectors

Rescaled variance of the difference between standardized scores

Estimated from the balloon rule

Related to the bivariate ellipses of isoconcentration

Function of test statistics from designed experiments

Ratio of two means

One aim is to test the that the true correlation coefficient ρ is equal to 0, based on the value of the sample correlation coefficient r.

null hypothesis

The other aim is to derive a that, on repeated sampling, has a given probability of containing ρ.

confidence interval

${\text{SS}}_{\text{reg}}=\sum _{i}({\hat {Y}}_{i}-{\bar {Y}})^{2}$

${\text{SS}}_{\text{tot}}=\sum _{i}(Y_{i}-{\bar {Y}})^{2}$ .

The square of the sample correlation coefficient is typically denoted r² and is a special case of the coefficient of determination. In this case, it estimates the fraction of the variance in Y that is explained by X in a simple linear regression. So if we have the observed dataset $Y_{1},\dots ,Y_{n}$ and the fitted dataset ${\hat {Y}}_{1},\dots ,{\hat {Y}}_{n}$ then as a starting point the total variation in the Y_i around their average value can be decomposed as follows

where the ${\hat {Y}}_{i}$ are the fitted values from the regression analysis. This can be rearranged to give

The two summands above are the fraction of variance in Y that is explained by X (right) and that is unexplained by X (left).

Next, we apply a property of least squares regression models, that the sample covariance between ${\hat {Y}}_{i}$ and $Y_{i}-{\hat {Y}}_{i}$ is zero. Thus, the sample correlation coefficient between the observed and fitted response values in the regression can be written (calculation is under expectation, assumes Gaussian statistics)

Thus

where $r(Y,{\hat {Y}})^{2}$ is the proportion of variance in Y explained by a linear function of X.

In the derivation above, the fact that

can be proved by noticing that the partial derivatives of the residual sum of squares ( $RSS$ ) over β₀ and β₁ are equal to 0 in the least squares model, where

In the end, the equation can be written as

where

The symbol ${\text{SS}}_{\text{reg}}$ is called the regression sum of squares, also called the explained sum of squares, and ${\text{SS}}_{\text{tot}}$ is the total sum of squares (proportional to the variance of the data).

If the sample size is moderate or large and the population is normal, then, in the case of the bivariate , the sample correlation coefficient is the maximum likelihood estimate of the population correlation coefficient, and is asymptotically unbiased and efficient, which roughly means that it is impossible to construct a more accurate estimate than the sample correlation coefficient.

normal distribution

If the sample size is large and the population is not normal, then the sample correlation coefficient remains approximately unbiased, but may not be efficient.

If the sample size is large, then the sample correlation coefficient is a of the population correlation coefficient as long as the sample means, variances, and covariance are consistent (which is guaranteed when the law of large numbers can be applied).

consistent estimator

If the sample size is small, then the sample correlation coefficient r is not an unbiased estimate of ρ. The adjusted correlation coefficient must be used instead: see elsewhere in this article for the definition.

[10]

Correlations can be different for imbalanced data when there is variance error in sample.^[31]

dichotomous

$r,n$ are defined as above,

$\mathbf {_{2}F_{1}} (a,b;c;z)$ is the .

Gaussian hypergeometric function

's statistics base-package implements the correlation coefficient with cor(x, y), or (with the P value also) with cor.test(x, y).

R

The Python library via pearsonr(x, y).

SciPy

The Python library implements Pearson correlation coefficient calculation as the default option for the method pandas.DataFrame.corr

Pandas

via the Correlation function, or (with the P value) with CorrelationTest.

Wolfram Mathematica

The C++ library via the correlation_coefficient function.

Boost

has an in-built correl(array1, array2) function for calculating the pearson's correlation coefficient.

Excel

Anscombe's quartet

Association (statistics)

Coefficient of colligation

Yule's Q

Coefficient of multiple correlation

Concordance correlation coefficient

Correlation and dependence

Correlation ratio

Disattenuation

Distance correlation

Maximal information coefficient

Multiple correlation

Normally distributed and uncorrelated does not imply independent

Odds ratio

Partial correlation

Polychoric correlation

Quadrant count ratio

RV coefficient

Spearman's rank correlation coefficient

. comparingcorrelations.org. – A free web interface and R package for the statistical comparison of two dependent or independent correlations with overlapping or non-overlapping variables.

"cocor"

. nagysandor.eu. – an interactive Flash simulation on the correlation of two normally distributed variables.

"Correlation"

. hackmath.net. Linear regression.

"Correlation coefficient calculator"

(PDF). frank.mtsu.edu/~dkfuller. – large table.

"Critical values for Pearson's correlation coefficient"

. – A game where players guess how correlated two variables in a scatter plot are, in order to gain a better understanding of the concept of correlation.