Linear Regression Online: Results Simply Explained

Table of contents

Start
1. Beta coefficients of a Regression Analysis
2. The Standard Error of Linear Regression
3. The Test Statistic T of the Linear Regression Model
4. The p-value of Regression Analysis
5. Confidence Intervals of Regression Coefficients
6. Conclusion: Lineare Regression Online
- 6.1 Bonus: Linear Regression Calculator

- Hamid: What's that?
- Rambo: It's blue light.
- Hamid: What does it do?
- Rambo: It turns blue.

Rambo III

The linear regression analysis has been calculated, the model is ready, and now we see something like this:

results of linear regression online
What does this result tell us? This section explains how to interpret the regression table and what conclusions we can draw from the data. While we see the table from the linear regression, simply calculating and presenting it is, of course, not enough. It would be too easy if the analysis ended here. This article focuses on the individual metrics that emerge from the data analysis and how they should be understood. Some statistical background knowledge is assumed, but the theoretical foundations and background can also be reviewed here.

We know that the linear regression model looks like this:

\hat{y} = {\hat{β}}_{0} + \sum_{i = 1}^{n} {\hat{β}}_{i} \times X_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} \times X_{1} + {\hat{β}}_{2} \times X_{2} + ...

$\hat{y}$ is the estimated value for our target – the variable we are interested in. In the example above, this is the price of a property.
${\hat{β}}_{0}$ is the estimated Y-intercept.
$X_{i}$ represents the explanatory / independent variables (also called regressors), such as the area or condition of a property.
${\hat{β}}_{i}$ represents the estimated changes of the regression line (called the regression coefficient or beta coefficient) with respect to $X_{i} .$

Using the linear regression equation, we examine a linear relationship within a defined research question.
The following happens when calculating this linear regression equation: The best line that most closely matches the data points X and Y is sought. Visually, this corresponds to a line that best fits through a scatter plot of X and Y. More precisely, the squared deviations between

\hat{y}

, the linear regression line, and y, the actual values of Y, are minimized. This method is called OLS (Ordinary Least Squares) or the method of least squares. A more detailed explanation and more information can be found in the introduction to linear regression..
In our example, the linear regression equation looks like this:

\hat{Price} = {\hat{β}}_{0} + {\hat{β}}_{1} \times Area + {\hat{β}}_{2} \times Condition

With this knowledge, we can now understand the results of our linear regression and interpret them in the context of the respective question.

Beta coefficients of a Regression Analysis

After using the linear regression calculator, you will find the variables of the data analysis in the first column. variables of linear regression table

$const$ is the constant of the linear regression equation (the Y-intercept).
$Area (sq ft)$ is the variable whose influence on the target we want to measure (regressor/predictor).
$Condition$ is a control variable in this linear regression. There can be any number of these control variables.

The 2nd column 'coef' contains the values of the beta coefficients

{\hat{β}}_{0}, {\hat{β}}_{1}, ..., {\hat{β}}_{N}

that belong to the corresponding variables. The coefficients are defined as:

{\hat{β}}_{0} = E [Y] - {\hat{β}}_{1} E [X] or with sums: {\hat{β}}_{0} = \frac{1}{n} (\sum_{j = 1}^{n} y_{j}) - {\hat{β}}_{1} \frac{1}{n} (\sum_{j = 1}^{n} x_{j})

and

{\hat{β}}_{i} = \frac{Cov(X_{i},Y)}{Var(X_{i})} or with sums: {\hat{β}}_{i} = \frac{\sum_{j = 1}^{n} (x_{j} - \bar{x}) (y_{j} - \bar{y})}{\sum_{j = 1}^{n} {(x_{j} - \bar{x})}^{2}} for all i=1,...,N

The latter refers to the regression coefficients of the variables included in the linear regression:

{\hat{β}}_{1}

is the coefficient of

X_{1}

{\hat{β}}_{2}

is the coefficient of

X_{2}

and so on. It is a convention to refer to the explanatory variable (also called the predictor or exogenous variable) as

X_{1}

, with

{\hat{β}}_{1}

as its corresponding coefficient. All subsequent variables

X_{2},..., X_{N}

are control variables. In this example, this is also reflected in the results table of the online linear regression.
First, 'const' appears as the constant of the linear model with

{\hat{β}}_{0}

, then 'Area' appears as the explanatory variable

X_{1}

with its coefficient

{\hat{β}}_{1} = 3268.2735

. Then comes the control variable 'Condition' as

X_{2}

with its coefficient

{\hat{β}}_{2} = 124200

.

If

{\hat{β}}_{1}

is positive, i.e.

{\hat{β}}_{1} > 0

, this indicates a positive relationship between

X_{1}

and Y. If

X_{1}

increases by one unit, Y increases by

{\hat{β}}_{1}

. If

{\hat{β}}_{1}

is negative, i.e.

{\hat{β}}_{1} < 0

, this indicates a negative relationship between the variable

X_{1}

and the target Y. The larger

X_{1}

becomes, the smaller Y becomes. The same applies to

{\hat{β}}_{2}

{\hat{β}}_{3}

, etc.

{\hat{β}}_{0}

is the constant in the linear regression model. The target variable Y is equal to

{\hat{β}}_{0}

if all other

X_{1},..., X_{N} = 0

. This can be clearly seen from the regression equation. Visually

{\hat{β}}_{0}

corresponds to the Y-intercept in the regression plot.

The Standard Error of Linear Regression

standard error of linear regression online
The column "std err" stands for standard error (SE) and shows the standard errors for our regression coefficients ${\hat{β}}_{0}, ..., {\hat{β}}_{N}$ . The standard error generally refers to the standard deviation, i.e. dispersion, of a parameter. The standard deviation itself, however, is the dispersion of the raw data points. In the case of linear regression, the standard error is therefore the standard deviation of the beta coefficients ${\hat{β}}_{0}, ..., {\hat{β}}_{N}$ . It indicates how much each ${\hat{β}}_{i}$ in the sample differs on average from the true $β_{i}$ in the population. The standard error is defined by :

SE ({\hat{β}}_{i}) = \sqrt{\frac{1}{n} \times \frac{Var (ε)}{Var (X_{i})}} = \frac{σ_{ε}}{σ_{X_{i}} \times \sqrt{n}}

with

$n$ as sample size,
$Var (X_{i})$ as variance of the predictor $X_{i}$ and
$Var (ε)$ as the variance of the residuals ε from the regression.
$σ$ is the corresponding square root of the variance – that is, the respective standard deviation.

The standard errors increase (i.e. the estimated regression coefficients are less precise) when the variance of the residuals

ε

is large. A high variance of the residuals

ε

indicates that the linear regression line does not fit the data particularly well. Unsurprisingly, the standard error decreases with increasing sample size n, as the estimate for the beta coefficients becomes closer to the population parameters. Similarly, a large spread (variance) of the regressor

X_{i}

is beneficial: As

σ_{X_{i}}

increases, the estimation of the coefficients becomes more precise. More varied and "diverse" data points help to determine the linear regression line more accurately.

In addition to the "normal" standard error, there are also so-called "robust" standard errors. These robust standard errors have a key advantage in that they take into account heteroskedastic data. Heteroskedastic data is data that does not have a constant variance. This means that the dispersion of the data (along one or more dimensions) increases or decreases. When this happens, the standard error is biased. This leads to incorrect conclusions about beta coefficients, confidence intervals, etc.

data points with constant and increasing dispersion — Left: Data with constant variance, right: heteroscedastic data with variable, uneven variance. The dispersion in Y increases the larger X becomes.

However, robust standard errors are generally larger than normal standard errors. As a result, the average deviation from the true

β_{i}

increases, making it more difficult to estimate the true parameter

β_{i}

. Consequently, it becomes more difficult to obtain statistically significant results because the test statistic T becomes smaller (SE is in the denominator) and the confidence interval around the null hypothesis becomes wider. However, if the calculated

{\hat{β}}_{i}

coefficients fall outside a confidence interval with robust standard errors, one can be even more confident that the observed effect of

{\hat{β}}_{i}

on Y is real. There are several types of robust standard errors; for those interested in more detail, see HC1, HC2, HC3, HC4 and Eicker, White, Huber, MacKinnon. A possible definition of robust standard errors (RSE) is as follows:

RSE ({\hat{β}}_{i}) = \sqrt{\frac{1}{n} \times \frac{Var [(X_{i} - E (X_{i})) \times ε]}{{Var (X_{i})}^{2}}}

In the robust standard error equation, the unequal variance is accounted for by an interaction term

(X_{i} - E (X_{i}))

. If

X_{i}

deviates significantly from the mean

E (X_{i})

, the multiplication with the residual e "penalizes" the RSE more. A wider distribution of Y at the edges (with respect to X) of the data contributes more to the RSE. Thus, the RSE reflects the increased spread of Y at large (or small, depending on the case) X values. The normal standard error does not take this uneven variance into account. However, it can be seen that if the data does indeed have constant variance, i.e. is homoscedastic, then RSE ≈ SE.

The Test Statistic T of the Linear Regression Model

test statistic of linear regression
Under the column "t" you will find the test statistics for the respective beta coefficients ${\hat{β}}_{0}, ..., {\hat{β}}_{N}$ . The test statistic T is an aggregate value calculated from the available data sample. T is used to decide whether the effect of $X_{i}$ on Y is also statistically significant. Statistical significance refers to the situation where the effect of $X_{i}$ actually exists in principle and is not just a result of chance (see null hypothesis). This concept is discussed in more detail in the article on hypothesis testing. The test statistic T measures the extent to which our data reflect the null hypothesis $H_{0}$ , and whether or not we can reject $H_{0}$ . T is defined differently for each statistical parameter; in the case of linear regression and beta coefficients, T is defined as follows: $T = \frac{{\hat{β}}_{i} - {β_{i}}^{H_{0}}}{SE ({\hat{β}}_{i})}$ Because T is calculated from a sample of data, it is also subject to random variation, which is represented by the t-distribution.
For example, if T is greater than +1.96 or less than -1.96, the regression coefficient is considered statistically significant with a 5% probability of error (also known as the significance level Alpha). This means that the influence of $X_{i}$ on Y is real and more than just chance. "Mere" chance is represented by the null hypothesis $H_{0} : {β_{i}}^{H_{0}} = 0$ .
The values for the test statistic T vary depending on the significance level being considered or the desired probability of error. The values for T correspond to quantiles of the t-distribution and can be found in corresponding tables. The most relevant t-values and their associated metrics are:

Significance level Alpha	Critical value t	Confidence interval
5%	\|1.96\|	95%
1%	\|2.576\|	99%
0.01%	\|3.291\|	99.99%

The p-value of Regression Analysis

p-values of regression analysis
The p-value indicates the probability of obtaining a T' that is even more extreme than the calculated T from the available data sample. Thus, p-values represent probabilities under which the null hypothesis would be (falsely) rejected. p-value explanation
For example, if $p-value \leq 0.05$ (or 5%), the beta coefficient is considered statistically significant and the influence of X on Y is real and not due to chance. This statement can then be made at the 5% significance level, which means there is a 5% chance of error. Additional p-values, along with their corresponding significance levels and T-values, are as follows:

Significance level Alpha / p-value	Critical value t	Confidence interval
5%	\|1.96\|	95%
1%	\|2.576\|	99%
0.01%	\|3.291\|	99.99%

It is common to see decimal values in a regression table like the one shown above with an e, such as

7.945e-11

. This is scientific notation and means that the number

7.945

is multiplied by

10^{-11}

. The exponent -11 indicates that the decimal point is moved 11 places to the left. Therefore, the number

7.945 \times 10^{- 11} = 0.00000000007945

Confidence Intervals of Regression Coefficients

confidence intervals of linear regression calculator
The last two columns show the borders of the 95% confidence intervals around $H_{0} : {β_{i}}^{H_{0}} = 0$ . Since the beta coefficients also vary and can take different values depending on the sample, their distribution can be described by intervals. The second to last column on the left shows the lower bound, i.e. the 2.5% quantile of this distribution. The last column, the right one, represents the 97.5% quantile. If a calculated ${\hat{β}}_{i}$ falls outside the 95% confidence interval, it can be said with a 5% probability of error that ${\hat{β}}_{i}$ is statistically significant, and the null hypothesis $H_{0}$ can be rejected. In this case, it is said that the corresponding variable $X_{i}$ has a real effect on the target Y.
There are several types of confidence intervals; in statistical regression analysis, confidence intervals that are closed at both the upper and lower ends are usually considered. The formula for their calculation is: The formula for their calculation is:

[{β_{i}}^{H_{0}} - t_{\frac{α}{2}} \times SE ({\hat{β}}_{i}) \leq {β_{i}}^{H_{0}} \leq {β_{i}}^{H_{0}} + t_{\frac{α}{2}} \times SE ({\hat{β}}_{i})]

In general,

{β_{i}}^{H_{0}} = 0

Conclusion: Lineare Regression Online

- Gollum: My Precious!

Lord of the Rings

We now understand each column of the online linear regression table and are ready to interpret the results of the data analysis. results of linear regression calculator

We'll start with the second row, since it contains the results for the variable we're interested in:

X_{1}

(area). It lists all the metrics relevant to the statistical significance of the variable

X_{1}

, which in this example is the area of a property. The results indicate that area does indeed have a real impact on the price of a property (who would have guessed?!) and that the observed

{\hat{β}}_{1}

does not just differ from 0 by chance. Of course, it is more interesting to describe the strength of this effect. In this case, the price increases by $3268.2735 per additional square meter, since

{\hat{β}}_{1} = 3268.2735

.

The following rows in the regression table, namely

X_{2}, ..., X_{2}

, represent the results for

{\hat{β}}_{2}, ..., {\hat{β}}_{N}

of the control variables, such as the condition of a property in this case. Recall that it is a convention to denote the explanatory variable (also called predictor or exogenous variable) as

X_{1}

with

{\hat{β}}_{1}

as the corresponding coefficient. All subsequent variables

X_{2},..., X_{N}

are control variables. Again, we can see which ones are statistically significant.
The top row contains the evaluation for

{\hat{β}}_{0}

, the constant in the linear regression model. The constant represents the value of Y, when all

X_{1}, ..., X_{N} = 0

. It is the base value from which the linear regression line starts. Visually,

{\hat{β}}_{0}

corresponds to the y-intercept. This initial value, the constant in the linear regression model, should always be considered in the context of the problem. Here, it makes little sense to assume a property with an area of 0 square feet.

Bonus: Linear Regression Calculator

All of the above analyses and metrics are calculated using the online linear regression calculator. Once the data analysis is complete, all of these results can be downloaded as a CSV or Excel file. This allows the results to be used for further analysis, such as inclusion in a report or scientific study. A plot of the linear regression line is also provided, which can be customized with a variety of graphical options.

The Linear Regression Online Calculator also provides additional valuable information about the data being analyzed. Further details about the data can be found below the regression results, which can provide deeper insight into the target variable under investigation. As a convenience, the interpretation of the results is provided directly. If certain analyses are no longer needed, they can be easily discarded to maintain clarity without affecting other experiments. To restart the calculator, simply reload the page.

Linear Regression Calculator: Understanding Results – Simply Explained