Measuring Performance of Linear Regression Models

Table of contents

- Prediction is very difficult, especially if it's about the future!

Niels Bohr, Physicist

Tldr: To judge how “good” a linear regression is, you first have to ask yourself what purpose it serves.

  • If linear regression is used to evaluate data (also known as causal inference), a "good" model exists if the effects of X on Y are undistorted. Data analysis concentrates on the question of what happens to Y when X changes. The aim here is to calculate the most accurate regression coefficients β^i possible in order to explain or describe the relationship between X and Y.
  • If the aim of linear regression is to predict data, then the higher the accuracy and precision of a model, the better it is. Data prediction focuses on the estimation of Y for a given X. The aim here is to calculate forecasts Y^i that are as accurate as possible. The precision of linear regression is measured, for example, by the coefficient of determination.

How do we know how "good" a linear regression is for our problem or question? How can we say that a regression analysis fits our data "well"? To assess the quality of a linear regression, users must first ask themselves the following question: Is the linear regression used to predict new data points or for causal explanation/evaluation of existing data points? Of course, linear regression can do both. But the purpose for which it is intended determines the benchmark for its quality.

Originally, linear regression was used to causally explain existing data patterns. In this case, a "good" linear regression means that the effects of the explanatory variable X on the target Y are undistorted (technically: there are unbiased regression coefficients β^i ).

interpretatbility of linear regression

The graph shows that one of the main strengths of regression analysis is its interpretability, which makes it particularly suitable for data evaluation. With the help of regression analyses, the relationships and patterns between the data can be explained particularly clearly. Unfortunately, it is not possible to judge whether a causal regression model is unbiased or not, e.g. on the basis of a value to be calculated. Rather, it is always a question of whether the assumptions of linear regression are fulfilled or not: Have all variables been taken into account in order to describe the “true” relationship between X and Y, or are the errors normally distributed, etc.? More on this below.

However, linear regression can also be used to predict new data points. In this case, a “good” linear regression is not a model with unbiased estimators, but a model with the highest possible accuracy and precision. Outside of research, the prediction of data is the primary application of linear regression models. With the advent of big data, machine learning and increasing computing capacity, the use of statistical modeling with regression models of all kinds has experienced explosive growth. Even today, however, linear regression is and remains the most important baseline model with all its benefits.

Accuracy and Precision of Linear Regression Models

"Fast is fine but accuracy is everything."

Wyatt Earp, american lawman

If the aim of linear regression is to predict new data points, then a regression model is particularly "good" if its predictions are as close as possible to reality. The challengeis this: At the time of the prediction, the reality that will occur is still unknown. The model must therefore be flexible enough to be able to classify new, unknown data points.

Strictly speaking, there are 2 different metrics to measure the quality of a linear regression for predictions: 1. accuracy and 2. precision.

  1. Accuracy indicates how close the predictions of a linear regression are to the true values. Measures that indicate the accuracy of the regression are, for example, the MAE (mean absolute error), MSE (mean squared error) or RMSE (root mean squared error).
  2. Precision refers to the variance of predictions. A precise model with low variance has particularly consistent and reproducible predictions. A typical measure of precision is the coefficient of determination R² or the adjusted coefficient of determination R².

quality of linear regression

A regression model can have high accuracy and low precision. Although estimates are then given that are close to true values, these estimates are widely scattered around the target. Conversely, models with high precision but low accuracy are also possible. These properties are illustrated using the targets shown above. Of course, models with high accuracy and high precision are optimal.

Calculating Accuracy: Prediction Errors

The accuracy of regressions can be calculated using the following formulas:

Mean absolute error MAE = 1n i=1 n | yi - y^i |
Mean squared error MSE = 1n i=1 n ( yi - y^i ) 2
Root mean squared error RMSE = MSE = 1n i=1 n ( yi - y^i ) 2 These metrics calculate the error of a forecast, i.e. how far the forecast y^i is from the actual value yi . The lower these values are, the more accurate the linear regression model is.
It is important to recognize here that a predicted y^i can sometimes be above and sometimes below the actual yi . To compensate for the direction of the error, the amount is therefore considered for the MAE or the square for the MSE. In the case of the MSE, the square root can then be applied in order to bring the error deviation back to the same scale as Y. The RMSE is therefore easier to interpret than the MSE.

Calculating Precision: Coefficient of Determination R²

Calculating the precision of a linear regression is somewhat more complex. As mentioned above, the coefficient of determination is used for this. The coefficient of determination is sometimes also referred to as the coefficient of determination. In both cases, however, the mathematical symbol is R².
The coefficient of determination is calculated as follows:

R2 = β12 Var(Xi) Var(Yi)
How does this formula come about? The variance (=dispersion) of Y values consists of 2 uncorrelated components: 1. the variance that is related to X and 2. the variance that is not related to X.The reason for this is that a data point yi consists of the sum of the prediction y^i (our regression line, which depends on X!) und and the error ei : Yi = Y^i + ei
If we apply the variance to both sides, we can write:

Var(Yi) = Var(Y^i) + Var(ei) Division on both sides by Var(Yi) results in:
Var(Yi) Var(Yi) = Var(Y^i) Var(Yi) + Var(ei) Var(Yi) 1 = R2 + Var(ei) Var(Yi) R2 = 1 - Var(ei) Var(Yi)
Here you can see that the coefficient of determination R² corresponds to the proportion of the explained variance Var(Y^i) in the total variance Var(Yi) . We can also simplify the upper variance decomposition:
Var(Yi) = Var(Y^i) + Var(ei) = Var( β^0 + β^1 Xi ) + Var(ei) = β12 Var(Xi) + Var(ei) Var(ei) = Var(Yi) - β12 Var(Xi)

Inserting in R² gives:
R2 = 1 - Var(Yi) - β12 Var(Xi) Var(Yi) R2 = β12 Var(Xi) Var(Yi)
We have thus shown how the formula for the coefficient of determination R² is derived. The higher this metric, the more precise our regression model.

The fact is that R² continues to increase, i.e. the metric gets better and better, the more variables are added to the regression model. This is because variance is shifted from one component to the other, namely from Var(ei) to Var(Y^i) . The regression model Y^ = β^0 + β^1 X gains more "power" when it is fed with additional variables. Unfortunately, the coefficient of determination R² also increases when "nonsensical" variables are added to the regression model. These are variables that do not describe the relationship between X and Y at all or only weakly.

To counteract this, there is the adjusted coefficient of determination: Adj. R², which takes into account the number of variables in the regression model. This value only increases if meaningful variables that are relevant for the relationship between X and Y are added to the model. The formula for the adjusted coefficient of determination is:
Radj2 = 1 - (1-R2) × n-1 n-k-1

  • R 2   is the "normal" coefficient of determination.
  • n   is the number of data points.
  • k   is the number of predictors (explanatory variables).

Interpretation of Coefficient of Determintation R²

R² is used as a measure of the "fit", i.e. the precision of the model, but as described above it actually says something about the variance of the model. An R² of 1 or 100% means that the linear regression line explains 100% - i.e. the whole - variance of the data. Visually, this means that all data points lie on the regression line. In models with a high R², the data points are "close" to the regression line, while in models with a low R², the data points are somewhat scattered. The closer the points are to the regression line, the higher the R² value:

coefficient of determination for linear regression

It should be mentioned here that R² has nothing to do with the slope of the regression line! The coefficient of determination R² is closely related to the (Bravais-Pearson) correlation coefficient ρ (Rho). The correlation coefficient ρX,Y describes the direction and strength of the relationship between two variables X and Y. The correlation coefficient ρ ranges between -1 and 1.
Mathematically, the coefficient of determination can even be interpreted as a squared Pearson correlation coefficient (of 2 variables X and Y). How this works is explained in more detail here.

The coefficient of determination R², on the other hand, indicates how much the data points lie on the regression line, but says nothing about the direction or strength of the linearity. The same R² value can apply to mirror-inverted correlations, e.g. ρ = 1 and ρ = -1. The coefficient of determination can also be automatically transferred to multivariate linear regression. R² and adj. R² apply to linear regression with one X, but also with X1 , ... , XN . In the case of multivariate linear regression with several X1 , ... , XN the coefficient of determination R² refers to the target variable Y and the forecast Y^ .

Linear Regression: Predicting vs. Explaining Data

- You use linear regression for causal conclusions, I use linear regression for predictions.
We are not the same.

Gus Fring, Breaking Bad

Although regression can be used to investigate and explain causal relationships between X and Y as well as to predict unknown Y, there are some important differences in the way linear regression is or should be used in the two application areas. For predictions, bias is acceptable if it increases the accuracy of the predictions.

Explaining (Causal Inference) Predicting
Missing Variables For causal regression analyses, it is important to have unbiased estimates of the regression coefficients β^0 , ... , β^N . For non-experimental data, the biggest threat to this is bias due to missing variables. In particular, we need to be concerned about variables that both affect the dependent variable (Y) and are correlated with the explanatory variables (X) that are currently included in the model. The absence of such variables can completely invalidate conclusions for data analyses. In forecasting models, the distortion of estimators due to omitted variables is less of a problem. The aim here is to obtain optimal predictions based on the available variables X1 , ... , XN . There is no reason why we should try to obtain optimal estimates of the "true" variables. Missing variables are only relevant insofar as we could improve the predictions by including variables that are not currently available. However, this has nothing to do with the bias of the regression coefficients.
Of course, a higher R² is better than a low R², but this metric is more important for forecasting. Even with a low R², hypotheses about the effect of a variable X on a target Y can be tested. This is because a low R² can be compensated for in parameter estimation and hypothesis testing by a large sample size. For the calculation of predictions, however, the maximization of the coefficient of determination R² or adj. R² is decisive. Technically, the more important criterion is the standard error of the prediction, which depends on both R² and Var(Yi) in the population/sample. In any case, a large sample size cannot compensate for the lack of predictive power of a model.
Multicollinearity For causal analysis of relationship in the data, multicollinearity is a major problem. The problem is that it is difficult to obtain reliable estimates of the regression coefficients β^0 , ... , β^N when two or more variables are highly correlated with each other. Since the goal is to calculate accurate & unbiased coefficients, this can be devastating. When calculating predictions, we can tolerate multicollinearity well, as it is not the individual regression coefficients β^i that are of interest. Even if two variables X1 and X2 are strongly correlated, it may be worth including both in the regression model if each of them contributes significantly to the predictive power of the regression model.
Missing Data Over time, the options for dealing with missing data have developed considerably. These include methods such as imputation, maximum likelihood and inverse probability weighting. All of these methods focus primarily on parameter estimation and hypothesis testing for causal regression analyses. Unfortunately, the methods for estimating missing data points are not well suited to the needs of prediction models. On the one hand, the absence of an observation itself can provide useful information for the forecast. On the other hand, data is often missing not only for the “training” sample used to fit the linear regression, but also for new cases for which predictions are needed. It is useless to have optimally calculated regression coefficients if you do not have the corresponding values x1 , ... , xN with which they can be multiplied.

One could argue that a correct causal regression model is likely to provide a better basis for predictions in the long run than a regression model based on a linear combination of arbitrary variables. It is plausible that correct causal models are more stable over time and across different data sets compared to ad hoc prediction models because the former describes the “true” relationship of the data rather than minimizing arbitrary prediction errors. But those who create forecasting models often cannot wait long. Forecasts are needed here and now and you have to make the best of what you have (from the data).

Ready to use the linear regression calculator?

Use Regression Online and focus on what really matters: your area of expertise
Interactive
Results immediately
Plot included
Established tool