The calculation of the estimators
Standard Error
The standard error of a statistic[1] is the standard deviation of the sampling distribution of that statistic.
Suppose we have a dataset which contains incomes of people. From this dataset we start drawing samples of size n
and calculating the mean. Now if we plot the distribution (the sampling distribution) of the means we calculated, we’ll get a normal distribution centered on the population mean. The standard deviation of this sampling distribution is the standard error of the statistic which in our case is mean.
Standard error is given by the formula:
Where
Calculating the standard error shows the sampling fluctuation. Sampling fluctuation shows the extent to which a statistic takes on different values.
Notice that the standard error has an inverse relation with the sample size n
. This means that the larger the sample we draw from the population, the lesser the standard error will be. This will also result in a tighter normal distribution since the standard deviation will be less.
Standard Error of OLS Estimates
The standard error of the OLS estimators
where
All of the terms in the equations above except
Although
All of the terms in the above formula would have already been computed as a part of computing the estimators.
How it all ties together
As you calculate the estimators and draw the sample regression curve that passes through your data, you need some numeric measure of how good the curve fits the data i.e. a measure of “goodness of fit”. This can be given as:
This is the positive square root of the estimator of homoscedastic variance. This is the standard deviation of the
The standard errors of the estimators
The square of the standard error is called the mean squared error.
Let’s see some code
To start off, let’s load the Boston housing dataset.
1 | import pandas as pd |
I’m choosing to use the average number of rooms as the variable to see how it affects price. I’ve rounded it off to make the calculations easier. The scatter plot of the data looks the following:
This, ofcourse, is the entire population data so let’s draw a sample.
1 | sample = df.sample(n=100) |
Now let’s bring back our function that calculates the OLS estimates:
1 | def ols(sample): |
and now let’s see coefficients
1 | sample.rename(columns={'RM': 'X', 'price': 'Y'}, inplace=True) |
The plot of the regression curve on the sample data looks the following:
The sample
DataFrame has our intermediate calculation and we can use that to calculate the standard error. Let’s write a function which does that.
1 | from math import sqrt |
Finally, let’s see how much standard deviation we have around the regression line.
1 | standard_error(sample) |
Coefficient of Determination
The coefficient of determination denoted by
For bivariate linear regression,
Turning this into code, we have:
1 | def coeff_of_determination(sample): |
For the sample we’ve drawn, the 0.327
. This means that only 32.7% of the variance in the total data is explained by the regression model.
Finito.
[1] A “statistic” is a numerical quantity such as mean, or median.