Linear Regression: Introduction

Table of contents

- Master Kan: "Truth is hard to understand."

- Kwai Chang Caine: "It is a fact, it is not the truth. Truth is often hidden, like a shadow in darkness."

Kung Fu, season 1, episode 14

Just want to quickly understand how linear regression works? The best way to understand the regression analysis is to calculate it yourself step by step. To do this, jump directly to the example below.

For all other curious people: Welcome to the world of regression, a branch of statistics that allows us to understand relationships within data and make predictions. In linear regression analysis, we're looking for truth in the form of a linear relationship between variables - but we must rely on evidence, or data, to identify these relationships.
It's important to recognize that while the relationships found through linear regression may reveal facts, the truth may be more complex. Linear regression gives us a tool to shed light on data and outline relationships, but it doesn't always give us the whole picture. It is a simplified representation of reality, a projection that is useful but limited.
In this introduction, we'll explore how linear regression works, how it's applied, and what insights it can provide. At the same time, we'll keep in mind that our models are approximations of reality, built on the basis of available evidence, and that there is always room for interpretation and further investigation.
Experts know that linear regression is the original data science. The technical concepts behind the linear regression model are so fundamental that they lay the foundation for modern methods in machine learning and AI.

Linear Regression for Beginners: The Linear Regression Line

At its core, linear regression is a simple statistical technique. It determines the best fitting straight line through a cloud of points. The goal is to identify the linear relationship between two variables represented by that data.

Graph of Linear Regression (Linear Regression Line)
The line that best fits through the cloud of points is called the regression line. It can be written using a standard linear equation such as y ^ = β^0 + x × β^1 . The parameters are interpreted as follows:
  • y ^ :   The estimated dependent variable (the "^" indicates that this is an estimate).
  • β^0 :  The estimated y-intercept.
  • x :   The independent variable.
  • β^1 :  The estimated slope of the line (regression coefficient).

Plot regression analysis with linear regression line

β^0 is the value of y ^ when x = 0 ( y ^ = β^0 + 0 × β^1 ), which is precisely the point where our "best" line (the regression line) intersects the Y-axis. β^1 tells us how much Y increases or decreases (in the case of a negative slope) for each additional unit of X.

Linear Regression Advanced: Least Squares

- Young Caine: "I am puzzled."

- Master Po: "That is the beginning of wisdom."

Kung Fu, season 2, episode 25
The key question now is: How do we find this "best" line? And what exactly are our parameters β^0 and β^1 that uniquely define the regression line? And why is it this line and not another one slightly higher - what does "best" line actually mean? In this case, the best line is the one for which the overall differences (or "distances") between the actual data points and the line are as small as possible. We try to position the line through the cloud of points so that the distances between the true values and the estimated values (i.e., the regression line) are minimal.
plot of linear regression with prediction errors
Mathematically, this optimization problem can also be formulated as follows:

min i = 1 n ( e i ) 2 = min i = 1 n ( y i - y i ^ ) 2
We use the summation symbol ∑ because we consider not just one but all distances for our minimization. And to be precise, we're minimizing not just all distances, but all squared distances. This is because data points are sometimes above and sometimes below the line, and we don't want the distances to cancel each other out. The distances are positive for points above the line and negative for points below the line. We solve this problem by squaring the distances. Minimizing these squared distances also minimizes the "normal" distances - this method is called the method of least squares (or OLS, Ordinary Least Squares).
Continue: We apply the expectation value to both sides of the equation to detach our optimization problem from the specific data set with points i = 1, ..., n. In other words, we want to abstract from the sample and make inferences about the "true" parameters β0 and β1 of the population. We achieve this by minimizing the expected error in the population:

min E [ i = 1 n ( e i ) 2 ] = min E [ i = 1 n ( y i - y i ^ ) 2 ] min E ( e 2 ) = min E [ ( Y - Y ^ ) 2 ] = min E [ ( Y - ( β^0 + β^1 X ) ) 2 ] = min E [ ( Y - β^0 - β^1 X ) 2 ]

Calculating Linear Regression Parameters

The sum of squared distances i = 1 n ( e i ) 2 , is sometimes simplified as SSR (Sum of Squared Residuals) or SSE (Sum of Squared Errors). By minimizing SSR, we also minimize the (squared) expected value of e. To do this, we need to take the derivatives of SSR with respect to β^0 abd β^1 set them equal to zero:

(1.) Partial derivative with respect to β^0 :

SSR β^0 = - 2 E [ ( Y - β^0 - β^1 X ) ] = ! 0

From this, it follows that:

E [ Y ] = β^0 + β^1 E [ X ] β^0 = E [ Y ] - β^1 E [ X ]

(2.) Partial derivative with respect to β^1 :

SSR β^1 = E [ 2 ( Y - β^0 - β^1 X ) ( - X ) ] = - 2 E [ X ( Y - β^0 - β^1 X ) ] = ! 0

This results in:

E [ X Y - β^0 X - β^1 X 2 ] = 0 E [ X Y ] - β^0 E [ X ] = β^1 E [ X 2 ]

When we use β^0 from (1.):

E[XY] - (E[Y] - β^1 E[X]) E[X] = β^1 E[X2 ] E[XY] - E[Y] E[X] = β^1 ( E [ X 2 ] - E [ x ] 2 ) β^1 = E[XY] - E[Y] E[X] (E[X2] - E[X]2) β^1 = Cov(X,Y) Var(X)

As a reminder: E[XY] - E[Y] E[X] = Cov(X,Y) and E[X2] - E[X]2 = Var(X) .


Why did we do all these calculations? The takeaway: We can now uniquely determine our regression line by the parameters β^0 and β^1 . We have shown how to solve the least squares minimization problem and now know that our regression line is defined by these two estimators:
β^0 = E [ Y ] - β^1 E [ X ] using summation notation: β^0 = 1 n ( i = 1 n y i ) - β^1 1 n ( i = 1 n x i )
and
β^1 = Cov(X,Y) Var(X) using summation notation: β^1 = i = 1 n ( x i - x ̄ ) ( y i - y ̄ ) i = 1 n ( x i - x ̄ ) 2

We also assume that as we collect more observations, our estimated sample parameters β^0 and β^1 will converge to the true population parameters β0 and β1 (called "unbiasedness"). This means that even though we are working with a random sample of data, we can still estimate a "true" regression line.

Linear Regression for Experts: Regression Analysis with Matrices

- Kwai Chang Caine: "A worker is known by his tools. A shovel for a man who digs. An ax for a woodsman."

Kung Fu, season 1, episode 8
And we use regression.
So far, we have not used any terms or notation that go beyond basic algebra and calculus (and probability). This has forced us to limit ourselves to the simple linear regression model with one predictor variable. However, this becomes unbearable when we have multiple predictor variables, i.e. we want to examine the influence of multiple X's on Y. Fortunately, with a little (for some, a little more) linear algebra, we can abstract away from many of the 'bookkeeping' details, so multivariate linear regression hardly seems more complicated than the simple version.
This section will not remind the reader how matrix algebra works. Rather, some prior knowledge is assumed, because now the part for experts begins. In the following, the bold letters such as a denote matrices throughout, as opposed to the scalar a.

Our data consist of n observations of the predictor variable X and the response variable Y, i.e. ( x 1 , y 1 ) , , ( x n , y n ) . We would like to fit the following model:
Y = β 0 + β 1 X + ε
where E [ ε | X = x ] = 0 , Var [ ε | X = x ] = σ 2 and ε is uncorrelated between samples. First, we summarize all our observations of the target variables in an (n×1) matrix:
y = [ y 1 y 2 y n ]
We also summarize our coefficients in a (2×1) matrix:
β = [ β 0 β 1 ]
We also want to summarize the observations of the predictor variable X, but we need something that looks a little unusual at first glance: x = [ 1 x1 1 x2 1 xn ]

This is an (n×2) matrix, where the first column is always 1 and the second column contains the actual observations of X. We have this seemingly redundant first column because it helps us when we multiply x by β:

x 𝛽 = [ 𝛽 0 + 𝛽 1 x 1 𝛽 0 + 𝛽 1 x 2 𝛽 0 + 𝛽 1 x n ] xβ is therefore an (n×1) matrix that contains our predictions. The matrix x is sometimes also referred to as a design matrix.

The Mean Squared Error MSE

For each data point, the use of the linear regression model, i.e. the application of the coefficients β, leads to a certain error in the prediction - so we have n prediction errors. These form a vector: e ( β ) = y - x β Here, an (n×1) matrix is subtracted from an (n×1) matrix. In the section above, we said that we minimize the expected error by minimizing the sum of squared deviations / errors. Here we write this as follows: MSE ( β ) = 1 n i n e i 2 ( β ) But how do we formulate this expression using matrices? Like this: MSE ( β ) = 1 n e T e

To see that this works, let's take a look at what matrix multiplication really looks like: [ e1 e2 en ] [ e1 e2 en ]

This clearly corresponds to i = 1 n ( e i ) 2 , the form from the section above. With a little more math we get this:

MSE ( β ) = 1 n e T e; = 1 n ( y x β ) T ( y x β ) = 1 n ( y T β T x T ) ( y x β ) = 1 n ( y T y y T x β β T x T y + β T x T x β )

Note that ( y T x β ) T = β T x T y and that this is an (1×1)-Matrix such that y T x β = β T x T y . Thus we get:

MSE ( β ) = 1 n ( y T y 2 β T x T y + β T x T x β )

Minimizing MSE

First, we find the gradient of MSE as a function of β:

MSE ( β ) = 1 n ( y T y 2 β T x T y + β T x T x β ) = 1 n ( 0 2 x T y + 2 x T x β ) = 2 n ( x T x β x T y )

In the optimum, we set this equation to zero, so there is a β ^ for which the following holds:

x T x β ^ - x T y = ! 0

This equation for the two-dimensional vector β ^ is also called the normal equation. If we solve for β ^ , we get:

β ^ = ( x T x ) -1 x T y

We have now found a matrix equation with which we can calculate both coefficients of our regression line. We can now use our solution for the estimators β^0 and β^1 from the previous section for this equation:

β^1 = c X Y s 2 X = x y ¯ - x ¯ y ¯ x 2 ¯ - x ¯ 2

und

β^0 = y ¯ - β^1 x ¯

So now we know how the linear regression is derived and structured. We have definitions for β^0 , the Y-intercept and β^1 , the linear slope(s) of our explanatory X variable(s) However, the mere functioning of the model is not enough. After applying the data analysis, the results of the model still need to be interpreted. You can see how this is done here.

Example: Calculating a Linear Regression

To understand the linear regression even better, it is worth calculating the regression analysis manually. This allows us to see how all the formulas are used and how the data are related. Here is an example of how to proceed step-by-step if you want to calculate a linear regression “manually” with specific numbers.

  1. Write down the x and y values of a sample
  2. From the values in step 1, calculate the corresponding x2 , y2 and xy
  3. Compute the sum over all observations (i = 1, ..., N) from x, y, x2, y2 and xy
  4. Calculate mean values from step 3
  5. Caluclate regression coefficients β^0 and β^1

i X Y X2 Y2 X Y
1 40 80 1600 6400 3200
2 45 80 2025 6400 3600
3 60 90 3600 8100 5400
4 80 140 6400 19600 11200
5 75 100 5625 10000 7500
Sum 300 490 19250 50500 30900
Mean 60 98 3850 10100 6180
Calculate the covariance or variance in an intermediate step:

Cov(X,Y) = E[XY] - E[X] E[Y] = 6180 - 60 × 98 = 300
Var(X) = E[X2] - E[X]2 = 3850 - (60)2 = 250
Now calculate the regression coefficients β^0 and β^1 by substituting into the formulas:

β^1 = Cov(X,Y) Var(X) = 300 250 = 1,2
β^0 = E[Y] - β^1 E[X] = 98 - 1,2 × 60 = 26

The regression line is therefore: Y ^ = 26 + 1,2 × X ^

Thus, the Y-intercept of the regression line = 26 and Y increases by +1.2 for an increase of X by +1 (one unit). The exemplary calculation worked well for the 5 data points from the sample. For more data, the calculation is of course much too complex, so we have built the online calculator for linear regression!

Ready to use the linear regression calculator?

Use Regression Online and focus on what really matters: your area of expertise
Interactive
Results immediately
Plot included
Established tool