- Hamid: What's that?
- Rambo: It's blue light.
- Hamid: What does it do?
- Rambo: It turns blue.
The linear regression analysis has been calculated, the model is ready, and now we see something like this:
What does this result tell us? This section explains how to interpret the regression table and what conclusions we can draw
from the data. While we see the table from the linear regression, simply calculating and presenting it is, of course, not enough.
It would be too easy if the analysis ended here. This article focuses on the individual metrics that emerge from the data
analysis and how they should be understood.
Some statistical background knowledge is assumed, but the theoretical foundations
and background can also be reviewed here.
We know that the linear regression model looks like this:
- is the estimated value for our target – the variable we are interested in. In the example above, this is the price of a property.
- is the estimated Y-intercept.
- represents the explanatory / independent variables (also called regressors), such as the area or condition of a property.
- represents the estimated changes of the regression line (called the regression coefficient or beta coefficient) with respect to
The following happens when calculating this linear regression equation: The best line that most closely matches the data points X and Y is sought. Visually, this corresponds to a line that best fits through a scatter plot of X and Y. More precisely, the squared deviations between , the linear regression line, and y, the actual values of Y, are minimized. This method is called OLS (Ordinary Least Squares) or the method of least squares. A more detailed explanation and more information can be found in the introduction to linear regression..
In our example, the linear regression equation looks like this:
Beta coefficients of a Regression Analysis
After using the linear regression calculator, you will find the variables of the data analysis in the first column.
- is the constant of the linear regression equation (the Y-intercept).
- is the variable whose influence on the target we want to measure (regressor/predictor).
- is a control variable in this linear regression. There can be any number of these control variables.
The 2nd column 'coef' contains the values of the beta coefficients that belong to the corresponding variables. The coefficients are defined as:
The latter refers to the regression coefficients of the variables included in the linear regression: is the coefficient of , is the coefficient of and so on. It is a convention to refer to the explanatory variable (also called the predictor or exogenous variable) as , with as its corresponding coefficient. All subsequent variables are control variables. In this example, this is also reflected in the results table of the online linear regression.
First, 'const' appears as the constant of the linear model with , then 'Area' appears as the explanatory variable with its coefficient . Then comes the control variable 'Condition' as with its coefficient .
If is positive, i.e. , this indicates a positive relationship between and Y. If increases by one unit, Y increases by . If is negative, i.e. , this indicates a negative relationship between the variable and the target Y. The larger becomes, the smaller Y becomes. The same applies to , , etc.
is the constant in the linear regression model. The target variable Y is equal to if all other . This can be clearly seen from the regression equation. Visually corresponds to the Y-intercept in the regression plot.
The Standard Error of Linear Regression
The column "std err" stands for standard error (SE) and shows the standard errors for our regression coefficients
. The standard error generally refers to the standard deviation, i.e. dispersion, of a parameter. The standard
deviation itself, however, is the dispersion of the raw data points. In the case of linear regression, the standard error
is therefore the standard deviation of the beta coefficients
.
It indicates how much each
in the sample differs on average from the true
in the population.
The standard error is defined by :
- as sample size,
- as variance of the predictor and
- as the variance of the residuals ε from the regression.
- is the corresponding square root of the variance – that is, the respective standard deviation.
In addition to the "normal" standard error, there are also so-called "robust" standard errors. These robust standard errors have a key advantage in that they take into account heteroskedastic data. Heteroskedastic data is data that does not have a constant variance. This means that the dispersion of the data (along one or more dimensions) increases or decreases. When this happens, the standard error is biased. This leads to incorrect conclusions about beta coefficients, confidence intervals, etc.
In the robust standard error equation, the unequal variance is accounted for by an interaction term . If deviates significantly from the mean , the multiplication with the residual e "penalizes" the RSE more. A wider distribution of Y at the edges (with respect to X) of the data contributes more to the RSE. Thus, the RSE reflects the increased spread of Y at large (or small, depending on the case) X values. The normal standard error does not take this uneven variance into account. However, it can be seen that if the data does indeed have constant variance, i.e. is homoscedastic, then RSE ≈ SE.
The Test Statistic T of the Linear Regression Model
Under the column "t" you will find the test statistics for the respective beta coefficients
. The test statistic T is an aggregate value calculated from the available data sample. T is used to decide whether the effect of
on Y is also statistically significant. Statistical significance refers to the situation where the effect of
actually exists in principle and is not just a result of chance (see null hypothesis).
This concept is discussed in more detail in the
article on hypothesis testing.
The test statistic T measures the extent to which our data reflect the null hypothesis
, and whether or not we can reject
. T is defined differently for each statistical parameter; in the case of linear regression and beta coefficients, T is defined as follows:
Because T is calculated from a sample of data, it is also subject to random variation, which is represented by the t-distribution.
For example, if T is greater than +1.96 or less than -1.96, the regression coefficient is considered statistically significant with a
5% probability of error (also known as the significance level Alpha). This means that the influence of
on Y is real and more than just chance. "Mere" chance is represented by the null hypothesis
.
The values for the test statistic T vary depending on the significance level being considered or the desired probability of error.
The values for T correspond to quantiles of the t-distribution and can be found in corresponding tables. The most relevant t-values
and their associated metrics are:
Significance level Alpha | Critical value t |
Confidence interval |
---|---|---|
5% | |1.96| | 95% |
1% | |2.576| | 99% |
0.01% | |3.291| | 99.99% |
The p-value of Regression Analysis
The p-value indicates the probability of obtaining a T' that is even more extreme than the calculated T from the available data sample.
Thus, p-values represent probabilities under which the null hypothesis would be (falsely) rejected.
For example, if
(or 5%), the beta coefficient is considered statistically significant and the influence of X on Y is real and not due to chance. This
statement can then be made at the 5% significance level, which means there is a 5% chance of error. Additional p-values, along with
their corresponding significance levels and T-values, are as follows:
Significance level Alpha / p-value | Critical value t |
Confidence interval |
---|---|---|
5% | |1.96| | 95% |
1% | |2.576| | 99% |
0.01% | |3.291| | 99.99% |
It is common to see decimal values in a regression table like the one shown above with an e, such as . This is scientific notation and means that the number is multiplied by . The exponent -11 indicates that the decimal point is moved 11 places to the left. Therefore, the number .
Confidence Intervals of Regression Coefficients
The last two columns show the borders of the 95% confidence intervals around
.
Since the beta coefficients also vary and can take different values depending on the sample, their distribution can be described by
intervals. The second to last column on the left shows the lower bound, i.e. the 2.5% quantile of this distribution. The last column,
the right one, represents the 97.5% quantile. If a calculated
falls outside the 95% confidence interval, it can be said with a 5% probability of error that
is statistically significant, and the null hypothesis
can be rejected. In this case, it is said that the corresponding variable
has a real effect on the target Y.
There are several types of confidence intervals; in statistical regression analysis, confidence intervals that are closed at both
the upper and lower ends are usually considered. The formula for their calculation is:
The formula for their calculation is:
Conclusion: Lineare Regression Online
We now understand each column of the online linear regression table and are ready to interpret the results of the data analysis.- Gollum: My Precious!
We'll start with the second row, since it contains the results for the variable we're interested in: (area). It lists all the metrics relevant to the statistical significance of the variable , which in this example is the area of a property. The results indicate that area does indeed have a real impact on the price of a property (who would have guessed?!) and that the observed does not just differ from 0 by chance. Of course, it is more interesting to describe the strength of this effect. In this case, the price increases by $3268.2735 per additional square meter, since .
The following rows in the regression table, namely , represent the results for of the control variables, such as the condition of a property in this case. Recall that it is a convention to denote the explanatory variable (also called predictor or exogenous variable) as with as the corresponding coefficient. All subsequent variables are control variables. Again, we can see which ones are statistically significant.
The top row contains the evaluation for , the constant in the linear regression model. The constant represents the value of Y, when all . It is the base value from which the linear regression line starts. Visually, corresponds to the y-intercept. This initial value, the constant in the linear regression model, should always be considered in the context of the problem. Here, it makes little sense to assume a property with an area of 0 square feet.
Bonus: Linear Regression Calculator
All of the above analyses and metrics are calculated using the online linear regression calculator. Once the data analysis is complete,
all of these results can be downloaded as a CSV or Excel file. This allows the results to be used for further analysis, such as inclusion
in a report or scientific study. A plot of the linear regression line is also provided, which can be customized with a variety of
graphical options.
The Linear Regression Online Calculator also provides additional valuable information about the data being analyzed. Further details
about the data can be found below the regression results, which can provide deeper insight into the target variable under investigation.
As a convenience, the interpretation of the results is provided directly. If certain analyses are no longer needed, they can be easily
discarded to maintain clarity without affecting other experiments. To restart the calculator, simply reload the page.