- Master Kan: "Truth is hard to understand."
- Kwai Chang Caine: "It is a fact, it is not the truth. Truth is often hidden, like a shadow in darkness."
Just want to quickly understand how linear regression works? The best way to understand the regression analysis is to calculate it yourself step by step. To do this, jump directly to the example below.
For all other curious people: Welcome to the world of regression, a branch of statistics that allows us to understand
relationships within data and make predictions. In linear regression analysis, we're looking for truth in the form of
a linear relationship between variables - but we must rely on evidence, or data, to identify these relationships.
It's important to recognize that while the relationships found through linear regression may reveal facts, the truth may be
more complex. Linear regression gives us a tool to shed light on data and outline relationships, but it doesn't always give
us the whole picture. It is a simplified representation of reality, a projection that is useful but limited.
In this introduction, we'll explore how linear regression works, how it's applied, and what insights it can provide. At the
same time, we'll keep in mind that our models are approximations of reality, built on the basis of available evidence, and
that there is always room for interpretation and further investigation.
Experts know that linear regression is the original data science. The technical concepts behind the linear regression model
are so fundamental that they lay the foundation for modern methods in machine learning and AI.
Linear Regression for Beginners: The Linear Regression Line
At its core, linear regression is a simple statistical technique. It determines the best fitting straight line through a cloud of points. The goal is to identify the linear relationship between two variables represented by that data.The line that best fits through the cloud of points is called the regression line. It can be written using a standard linear equation such as . The parameters are interpreted as follows:
- : The estimated dependent variable (the "^" indicates that this is an estimate).
- : The estimated y-intercept.
- : The independent variable.
- : The estimated slope of the line (regression coefficient).
is the value of when ( ), which is precisely the point where our "best" line (the regression line) intersects the Y-axis. tells us how much Y increases or decreases (in the case of a negative slope) for each additional unit of X.
Linear Regression Advanced: Least Squares
The key question now is: How do we find this "best" line? And what exactly are our parameters and that uniquely define the regression line? And why is it this line and not another one slightly higher - what does "best" line actually mean? In this case, the best line is the one for which the overall differences (or "distances") between the actual data points and the line are as small as possible. We try to position the line through the cloud of points so that the distances between the true values and the estimated values (i.e., the regression line) are minimal.- Young Caine: "I am puzzled."
- Master Po: "That is the beginning of wisdom."
Mathematically, this optimization problem can also be formulated as follows:
We use the summation symbol ∑ because we consider not just one but all distances for our minimization. And to be precise, we're minimizing not just all distances, but all squared distances. This is because data points are sometimes above and sometimes below the line, and we don't want the distances to cancel each other out. The distances are positive for points above the line and negative for points below the line. We solve this problem by squaring the distances. Minimizing these squared distances also minimizes the "normal" distances - this method is called the method of least squares (or OLS, Ordinary Least Squares).
Continue: We apply the expectation value to both sides of the equation to detach our optimization problem from the specific data set with points i = 1, ..., n. In other words, we want to abstract from the sample and make inferences about the "true" parameters and of the population. We achieve this by minimizing the expected error in the population:
Calculating Linear Regression Parameters
The sum of squared distances , is sometimes simplified as SSR (Sum of Squared Residuals) or SSE (Sum of Squared Errors). By minimizing SSR, we also minimize the (squared) expected value of . To do this, we need to take the derivatives of SSR with respect to abd set them equal to zero:(1.) Partial derivative with respect to :
From this, it follows that:
(2.) Partial derivative with respect to :
This results in:
When we use from (1.):
As a reminder: and .
Why did we do all these calculations? The takeaway: We can now uniquely determine our regression line by the parameters and . We have shown how to solve the least squares minimization problem and now know that our regression line is defined by these two estimators:
We also assume that as we collect more observations, our estimated sample parameters and will converge to the true population parameters and (called "unbiasedness"). This means that even though we are working with a random sample of data, we can still estimate a "true" regression line.
Linear Regression for Experts: Regression Analysis with Matrices
And we use regression.- Kwai Chang Caine: "A worker is known by his tools. A shovel for a man who digs. An ax for a woodsman."
So far, we have not used any terms or notation that go beyond basic algebra and calculus (and probability). This has forced us to limit ourselves to the simple linear regression model with one predictor variable. However, this becomes unbearable when we have multiple predictor variables, i.e. we want to examine the influence of multiple X's on Y. Fortunately, with a little (for some, a little more) linear algebra, we can abstract away from many of the 'bookkeeping' details, so multivariate linear regression hardly seems more complicated than the simple version.
This section will not remind the reader how matrix algebra works. Rather, some prior knowledge is assumed, because now the part for experts begins. In the following, the bold letters such as a denote matrices throughout, as opposed to the scalar a.
Our data consist of observations of the predictor variable and the response variable , i.e. . We would like to fit the following model:
This is an (n×2) matrix, where the first column is always 1 and the second column contains the actual observations of . We have this seemingly redundant first column because it helps us when we multiply x by β:
β is therefore an (n×1) matrix that contains our predictions. The matrix is sometimes also referred to as a design matrix.The Mean Squared Error MSE
For each data point, the use of the linear regression model, i.e. the application of the coefficients β, leads to a certain error in the prediction - so we have prediction errors. These form a vector: Here, an (n×1) matrix is subtracted from an (n×1) matrix. In the section above, we said that we minimize the expected error by minimizing the sum of squared deviations / errors. Here we write this as follows: But how do we formulate this expression using matrices? Like this:To see that this works, let's take a look at what matrix multiplication really looks like:
This clearly corresponds to , the form from the section above. With a little more math we get this:
Note that and that this is an (1×1)-Matrix such that . Thus we get:
Minimizing MSE
First, we find the gradient of MSE as a function of β:
In the optimum, we set this equation to zero, so there is a for which the following holds:
This equation for the two-dimensional vector is also called the normal equation. If we solve for , we get:
We have now found a matrix equation with which we can calculate both coefficients of our regression line. We can now use our solution for the estimators and from the previous section for this equation:
und
So now we know how the linear regression is derived and structured. We have definitions for , the Y-intercept and , the linear slope(s) of our explanatory X variable(s) However, the mere functioning of the model is not enough. After applying the data analysis, the results of the model still need to be interpreted. You can see how this is done here.
Example: Calculating a Linear Regression
To understand the linear regression even better, it is worth calculating the regression analysis manually. This allows us to see how all the formulas are used and how the data are related. Here is an example of how to proceed step-by-step if you want to calculate a linear regression “manually” with specific numbers.
- Write down the x and y values of a sample
- From the values in step 1, calculate the corresponding
- Compute the sum over all observations (i = 1, ..., N) from
- Calculate mean values from step 3
- Caluclate regression coefficients and
i | X | Y | |||
---|---|---|---|---|---|
1 | 40 | 80 | 1600 | 6400 | 3200 |
2 | 45 | 80 | 2025 | 6400 | 3600 |
3 | 60 | 90 | 3600 | 8100 | 5400 |
4 | 80 | 140 | 6400 | 19600 | 11200 |
5 | 75 | 100 | 5625 | 10000 | 7500 |
Sum | 300 | 490 | 19250 | 50500 | 30900 |
Mean | 60 | 98 | 3850 | 10100 | 6180 |
Now calculate the regression coefficients and by substituting into the formulas:
The regression line is therefore:
Thus, the Y-intercept of the regression line = 26 and Y increases by +1.2 for an increase of X by +1 (one unit). The exemplary calculation worked well for the 5 data points from the sample. For more data, the calculation is of course much too complex, so we have built the online calculator for linear regression!