Correlation: Explanation, Calculation and Examples

Table of contents

Introduction
1. Correlation: Linearity und Monotonicity
2. The Bravais-Pearson Correlation Coefficient
- 2.1 Calculation of the Bravais-Pearson Correlation Coefficient
3. The Spearman Rank Correlation
- 3.1 Calculation of the Spearman Rank Correlation
4. Correlation Coefficients and Linear Regression
- 4.1 Correlation and Causation: Spurious Correlations
- 4.2 Correlation Coefficient and Coefficient of Determination R2

- Harvey Dent: You either die a hero or you live long enough to see yourself become the villain.

The Dark Knight

This article is about the correlation between two (or more) variables in general, but especially in the context of linear regression. We will explain two different correlation coefficients, give examples, discuss the saying “correlation does not equal causality” and show the connection to the coefficient of determination R² of linear regression.

In linear regression models, we are particularly interested in the parameter $β_{i}$ , which represents the slope of a regression line. However, we also ask ourselves how much additional information this parameter can (or cannot) provide. In particular, we will consider what the data reflected by a parameter ${\hat{β}}_{i}$ might look like. Analyzing the data behind the regression model is important because we cannot explain all data patterns using only a regression coefficient. There are constellations where there is room for misinterpretation. This applies in particular to the linearity (dispersion) and monotonicity (direction) of the data - both relevant aspects for linear regression.

In the above cases, for example, the regression coefficients ${\hat{β}}_{1}$ , i.e. the slopes of the straight lines, reflect the relationships between X and Y “incorrectly”. The coefficients are technically correct, as the OLS method “only” calculates the regression line with the smallest quadratic deviations. However, the relationships are distorted by outliers or non-linearity. These outliers could be caused by a measurement error, for example. However, the actual cause must be investigated in detail, as some outliers can also be caused by “real” causal relationships.

Correlation: Linearity und Monotonicity

How can such outliers or similar phenomena such as non-linearity in the data be measured? To describe data patterns that are not captured by the regression coefficients, there are statistical measures such as Bravais-Pearson's correlation coefficient ρ (Rho) or Spearman's rank correlation. To do this, we must first deal with the two associated statistical concepts: Linearity and monotonicity.

Linearity means that there is a linear relationship between the independent variables $X_{1}, ..., X_{N}$ and the dependent variable Y. This means that changes in Y can be explained by a proportional change in one or more independent variables $X_{i}$ . Visually, this means that the data points in a scatter plot lie along a straight line or are distributed close to this line. If the data is linear, we can easily find a regression line using the ordinary least squares (OLS) method that minimizes the deviations between the predicted values ${\hat{Y}}_{i}$ and the actual values $Y_{i}$ . This is important because the accuracy of the predictions of a linear regression depends heavily on the assumption of linearity. If the data is described by a non-linear relationship, the model could be inaccurate and allow erroneous conclusions to be drawn. Therefore, it is crucial to check the linearity of the data before performing a linear regression. This can be done by visually inspecting the data, calculating correlations or performing hypothesis tests. An understanding of linearity not only helps with model validation, but also with the interpretation of the results, as the coefficients of the regression equation represent the average change in the dependent variable Y per unit of the independent variable $X_{i}$ .

Monotonicity of the data points is also an important aspect in the context of linear regression, as this property can also provide information on the relationship between $X_{i}$ and the dependent variable Y. A function or data series is called monotonic if it is either continuously increasing or decreasing. This means: Monotonicity measures the extent to which larger $x_{i}$ values are followed by larger $y_{i}$ values. Mathematically, this can be defined as:

(strictly) monotonically increasing if for any two points $x_{1}, x_{2} \in D with x_{1} \leq x_{2} the following applies: y_{1} \leq y_{2}$
If X increases, then Y does not decrease. It either remains the same or also increases.
(strictly) monotonically decreasing,if for any two points $x_{1}, x_{2} \in D with x_{1} \geq x_{2} the following applies: y_{1} \geq y_{2}$
If X increases, then Y does not increase. It either remains the same or decreases.

In the case of linear regression, this means that an increase in the independent variable

X_{i}

leads to a predictable increase or decrease in the dependent variable Y. Monotonic data can simplify the analysis and interpretation of the regression results, as the relationship between the variables is clear and consistent. If the data is monotonic, the model can make more reliable predictions and the coefficients of the regression equation have a clearer meaning. Non-monotonic data, on the other hand, may indicate a more complex relationship that may not be captured by a simple linear model. In such cases, it may be necessary to consider additional transformations of the data, more variables, or the application of non-linear models to achieve an appropriate fit. In addition, investigating monotonicity can help to identify potential problems such as multicollinearity or the presence of outliers that could affect the accuracy and reliability of the regression model. Overall, the monotonicity of the data plays a crucial role in the construction, interpretation and validation of linear regression models.

The question now: How do we measure monotonicity and linearity of our data to get a better and quantifiable picture of our data patterns? Answer: With the help of correlation coefficients.

The Bravais-Pearson Correlation Coefficient

The Bravais-Pearson correlation coefficient ρ (or sometimes simply called Pearson correlation or similar) is a metric that indicates the degree of linear correlation between two quantitative variables X and Y. In other words, this correlation indicates how much the data points lie on a straight line (regardless of their slope - because this is indicated by the regression coefficient from the linear regression model!).

The (Pearson) correlation coefficient is calculated in this way:

$\begin{aligned} ρ_{X, Y} & = \frac{Cov (X, Y)}{σ_{X} σ_{Y}} = \frac{E [(X - μ_{X}) (Y - μ_{Y})]}{σ_{X} σ_{Y}} = \frac{E [X Y] - E [X] E [Y]}{\sqrt{E [X^{2}] - (E [X])^{2}} \sqrt{E [Y^{2}] - (E [Y])^{2}}} \\ = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}} \sqrt{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}} = \frac{\sum_{i}^{} x_{i} y_{i} - n \bar{x} \bar{y}}{\sqrt{\sum_{i}^{} x_{i}^{2} - n {\bar{x}}^{2}} \sqrt{\sum_{i}^{} y_{i}^{2} - n {\bar{y}}^{2}}} \end{aligned}$

$Cov (X, Y)$ : is the covariance of X and Y
$σ_{X} bzw. σ_{Y}$ : is the standard deviation of X or Y
$E [X] bzw. E [Y]$ : is the expected value of X or Y. In a concrete sample, we use the mean value for the calculation.

The correlation coefficient measures the degree of linearity of a linear relationship between two variables X and Y and is limited to a value between -1 and 1. Correlations close to zero mean that there is no linear relationship between the variables, while correlations close to -1 or +1 indicate a strong linear relationship. Intuitively, it can be said that the “easier” it is to draw a regression line through a scatter plot of X and Y, the more strongly the variables are correlated.

Calculation of the Bravais-Pearson Correlation Coefficient

Here is an example of how to calculate the Pearson correlation with numbers.

Write down the x and y values of a sample
Calculate the corresponding $x^{2}, y^{2} and x y$ from the values in 1.
Use the respective values from 1. and 2. to calculate the sum of all observations (i = 1,..., N)
Calculate mean values
Calculate Pearson correlation

A table that follows these instructions step-by-step could then look like this:

i	X	Y	$X^{2}$	$Y^{2}$	$X Y$
1	0.3	11.1	0.09	123.21	3.33
2	0.7	13.5	0.49	182.25	9.45
3	1.1	15	1.21	225	16.5
4	1.3	16.4	1.69	268.96	21.32
5	2.6	20.1	6.76	404.01	52.26
Sum	6	76.1	10.24	1203.43	103.16
Mean	1.2	15.22	2.048	240.686	20.632

Then calculate intermediate results:

E [X Y] - E [X] E [Y] = 20.632 - 1.2 \times 15.22 = 2.308

E [X^{2}] - {E [X]}^{2} = 2.048 - {(1.2)}^{2} = 0.608

E [Y^{2}] - {E [Y]}^{2} = 240.686 - {(15.22)}^{2} = 9.0376

And now calculate the correlation by inserting it into the formula:

ρ_{X, Y} = \frac{E [X Y] - E [X] E [Y]}{\sqrt{E [X^{2}] - (E [X])^{2}} \sqrt{E [Y^{2}] - (E [Y])^{2}}} = \frac{2.308}{\sqrt{0.608} \times \sqrt{9.0376}} = 0.9846

In our example, the result is a positive correlation, which is very high as it is close to the maximum value of 1.

The Spearman Rank Correlation

Consider the following scenario, assuming the relationship between X and Y looks like this:

The Pearson correlation coefficient is particularly suitable for measuring linear relationships between X and Y, but is not the right metric for measuring non-linear but monotonically increasing (or decreasing) data. This is where the Spearman rank correlation coefficient comes in. The Spearman rank correlation captures monotonically increasing (or decreasing) data patterns, which can be linear or non-linear. The Spearman rank correlation also ranges between -1 and 1 and is calculated as follows:
$ρ_{SP} = 1 - \frac{6 \times \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)}$ where d stands for the difference between the ranks (see example below).

The Pearson correlation and Spearman correlation are approximately the same for linear data, but the Pearson correlation underestimates non-linear data patterns. The Spearman correlation coefficient, on the other hand, indicates how much Y increases when X also increases. It measures how well any monotonic function can reflect the relationship between X and Y. No assumption is made about the shape of the distribution - it is therefore irrelevant for the Spearman correlation whether the relationship between X and Y is linear (or quadratic, cubic, etc.). The Spearman correlation only takes into account the ranks of the individual data points and answers the question: To what extent do higher Y values follow higher X values?

A Spearman correlation of 1 means that a higher X value is always followed by a higher Y value:

strictly monotonically increasing data points

Calculation of the Spearman Rank Correlation

Here is an example of how to calculate the Spearman correlation in concrete figures.

Write down the x and y values of a sample
Assign the ranks of X or Y. The smallest x-value (or y-value) gets the 1 and then count up.
Calculate the respective difference
Square differences and calculate sum
Calculate Spearman rank correlation

A table that follows these instructions step-by-step could then look like this:

i	X	Y	$Rank (X_{i})$	$Rank (Y_{i})$	$D_{i}$	$D_{i}^{2}$
1	0.3	2	2	4	-2	4
2	1.5	0.5	5	1	4	16
3	-0.1	0.9	1	2	-1	1
4	0.8	1.3	3	3	0	0
5	1	2.6	4	5	-1	1
Sum						22

The Spearman rank correlation coefficient can now be calculated:

ρ_{SP} = 1 - \frac{6 \times \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)} = 1 - \frac{6 \times 22}{5 (5^{2} - 1)} = -0.1

The Spearman correlation is weakly negative in this example.

Correlation Coefficients and Linear Regression

The Pearson correlation is closely related to linear regression and regression coefficients.

The regression coefficient measures the slope of the linear relationship between two variables X and Y and can take any value between -∞ and +∞. The regression coefficient is therefore the variable that indicates the estimated change in Y for a given value of X. Specifically, $β_{i}$ indicates the change in the expected value of Y that corresponds to an increase in $X_{i}$ by one unit. This information alone cannot be derived from the correlation coefficient. Slopes or regression coefficients close to zero mean that the target variable Y changes only slowly when the predicted variable X changes. Slopes that are further away from zero (either in the negative or positive direction) mean that Y changes faster when the predictor X changes. In other words, the further the regression coefficient is from zero, the steeper (increasing or decreasing) the linear regression line.

The Pearson correlation, on the other hand, has no dimension, i.e. its value is independent of the unit of X and Y with which it was calculated. More precisely, the correlation coefficient ranges in the interval [-1, 1]. The correlation therefore provides a metric with an upper and lower bound that can be interpreted independently of the scale of the two variables X and Y. The closer the estimated correlation is to ±1, the closer the two variables are to a perfect linear relationship. The regression slope or the regression coefficient in itself does not provide this information.

In summary, this means that the Pearson correlation coefficient and the regression coefficient from the linear regression must have the same sign, but do not the same value. In a sense, both parameters provide similar information - they each say something about the strength of the linear relationship between X and Y. However, they provide different information.

Correlation and Causation: Spurious Correlations

- Apollo 13 Crew: "Houston, we have a problem!"

April 13, 1970

Correlation does not imply causation. Where have we heard that before? Everywhere. Every statistical analysis. What does that mean exactly? More on that now.

Specifically, this always involves spurious correlation. A spurious correlation occurs when two variables have a statistical relationship with each other, but this relationship is not based on a causal connection. In other words: the two variables appear to be related, but in reality they are not. This can be caused by the influence of a third, unaccounted variable or simply arise by chance.

For example, an analysis could show that the consumption of ice cream correlates with the number of drowning accidents in a given period. However, it would be wrong to conclude that the consumption of ice cream causes drowning. In this case, it is a spurious correlation due to a third variable: the weather. When the weather is warm, people tend to eat more ice cream and at the same time go swimming more often, which leads to an increased number of drowning accidents:

There are several reasons why spurious correlations "work": :

Direction of causality: Even if two variables are actually causally linked, it is not always clear which variable is the cause and which is the effect. This is known as the problem of “reverse causality”. A classic example is the correlation between insomnia and stress. It could be that insomnia causes stress, but it is just as likely that stress leads to insomnia.
Missing variables: Often a third variable influences both the cause and the effect, creating the correlation between the other two variables. This third variable is referred to as a “confounding variable”. An example of this would be the correlation between educational level and life expectancy. Both are strongly linked to socioeconomic class, which is a confounding variable. Whether missing variables are relevant for the linear regression model depends on its purpose. It is important to clarify whether the linear regression is used for data analysis and explanations of relationships between X and Y, or for the prediction of new data points. More on this here.
Random correlation: In large data sets, purely random correlations can occur that do not reflect any meaningful relationship between the variables. This is particularly problematic in cases where large amounts of data are analyzed without a specific hypothesis.

To avoid the mistake of confusing correlation with causality, careful analysis and a methodical approach are required. Some steps that can help here are:

Experimental studies: One of the most reliable methods for establishing causality is to conduct controlled experiments in which researchers deliberately change one variable and observe the effects on another variable.
Consideration of missing variables: Potentially missing variables should be identified and controlled for in statistical analyses. This can be done, for example, through multivariate analyses in which several variables are examined simultaneously.
Longitudinal studies: In contrast to cross-sectional studies, which collect data at a single point in time, longitudinal studies track the same variables over a longer period of time. This can help to identify causal relationships more clearly.
Critical thinking: You should always be skeptical of statistical results and consider alternative explanations. It is important to recognize and question the possibility of spurious correlations before jumping to conclusions.

Correlation Coefficient and Coefficient of Determination R2

In another article, we learned that accuracy and precision are two indicators of the predictive quality of a linear regression model. However, this performance depends largely on the 2 properties above: the linearity and monotonicity of the data points. A high linear correlation and a monotonic relationship between the dependent (Y) and independent ( $X_{i}$ ) variables significantly influence the predictive power of the regression model. Both properties are essential factors for a high-performance linear regression model. Intuitively, one can imagine that a trained regression model has a better model quality / performance (in the form of the coefficient of determination R²) if the underlying data have a high degree of linearity.

Technically, there is even a way to show that the squared Pearson correlation coefficient corresponds to the coefficient of determination R² of a linear regression: $ρ^{2} = \frac{Cov (X_{i}, Y_{i})}{\sqrt{Var (Y_{i}) \times Var (X_{i})}} = \frac{{Cov (X_{i}, Y_{i})}^{2} \times Var (X_{i})}{Var (Y_{i}) \times {Var (X_{i})}^{2}} = {[\frac{Cov (X_{i}, Y_{i})}{Var (X_{i})}]}^{2} \times \frac{Var (X_{i})}{Var (Y_{i})} = {[β]}^{2} \times \frac{Var (X_{i})}{Var (Y_{i})} = \frac{Var ({\hat{Y}}_{i})}{Var (Y_{i})} = R^{2}$
This illustrates the close relationship between the coefficient of determination and the Pearson correlation coefficient. We have shown in the derivation of the coefficient of determination that $β^{2} \times Var (X_{i}) = Var ({\hat{Y}}_{i})$ .

Correlation and Linear Regression Analysis