0%

Having discussed frequentism and Bayesianism briefly, it’s time to move onto hypothesis testing. I was told by an avid reader of my blog that “it reads like a bad textbook”. So I am going to try to make it a little more entertaining.

What is hypothesis testing?

Hypothesis testing is how mathematicians settle their bets. Unlike peasants like ourselves who use a coin flip or fist fights, math nerds like coming up with more sophisticated ways to settle their differences. Let’s say two mathematicians - John and Nash - are at a bar. John feels duped by Nash after having settled a previous bet with a coin flip. He thinks the coin used by Nash was not fair. Nash insists otherwise. How might they test this hypothesis?

For a coin to be fair, the probability $p$ of getting any side is $\frac {1}{2}$. We’ll call Nash’s stance “the null hypothesis” ($H_0: p = \frac{1}{2}$) and John’s stance “the alternative hypothesis” ($H_1: p \ne \frac{1}{2}$). How do they decide who’s right?

If after flipping the coin 999 times, $T = |\hat{p}_n - \frac {1}{2}|$ is large, we’ll reject the null hypothesis and say that John has been conned because the coin is likely to favor one side other the other. Otherwise, we’ll accept the null hypothesis and say that the coin is indeed fair and John is being salty.

But really, what is hypothesis testing?

Hypothesis testing is a way to either accept or reject preconceived notions based on data. For example, if we have a sample of people’s heights, can we accept that it is drawn from a normal distribution with average height being 170cm? Let’s look at it a little more formally.

Billy has a random variable $X$ with PDF $f(x;\theta)$. $\theta$ is the true population parameter unknown to him.

A random variable is a black box that prints a number every so often. The internal mechanics of this black box are a mystery lost in the annals of time. What remains known to this day is that the black box is operated by a PDF - a probability density function. It is the PDF that decides the fate of the numbers that come out. The higher the probability of a number, the more often it’ll get printed.

Master craftsmxn of the PDF knew how to tune it carefully using delicate knobs they called “the parameters of the PDF”. The more complicated the PDF, the more parameters it has, the more powerful it is. Who made Billy’s random variable? To what value is the parameter set?

The purpose of Billy’s life is to achieve nirvana by finding out the true value of $\theta$ - a magic number which may or may not be 42. After a lot of soul searching, traveling to Denver and back, he obtains a sample of size $n$. He diligently calculates $\hat{\theta}$ - his attempt at uncovering the mystery, an estimate for $\theta$.

During his travels with Jack Kerouac, he had an epiphany that $\theta$ might be … some value $\theta^*$ - another number. He’s convinced that he’s found the answer. But how might he convince the drunks John and Nash that he’s right? He’ll need to prove statistically that $\hat{\theta}$ indicates that the sample came from a distribution where $\theta = \theta^*$.

Could an estimate calculated from a sample lead Billy to the light? Let’s go back to the example of people and their heights.

Let $X$ be a random variable representing the height of people such that $X \sim \mathcal{N}(\mu, \sigma^2)$. We’re told that the standard deviation $\sigma = 15$. From a sample of size $n = 100$, we get the average height $\bar{X} = 165 \text{cm}$. Could this have come from a population with average height $170 \text{cm}$? Let’s write this down as $H_0$ and $H_1$.

Our null hypothesis is that yes, this sample came from a population with average height $170 \text{cm}$ and alternative hypothesis is that no, this did not. To test this, we could use one of two approaches - confidence intervals or significance tests - and so can Billy.

Confidence intervals

The first method Billy could use is confidence intervals. A confidence interval is an .. interval .. calculated using the estimate $\hat{\theta}$ in which the true value of $\theta$ may lie. The calculation of the interval involves using probability distributions and is best shown with a concrete example.

In the population example $X$ is distributed normally with sample mean $\bar{X} = 165$. Since $X$ is normally distributed, we can convert it to standard normal distribution with $\mu = 0$ and $\sigma = 1$. We can construct a 95% confidence interval and check whether it contains the value of $\mu$. The standard normal variable $Z$ is calculated as

From the standard normal distribution table, we get $P(-1.96 ~ \le Z ~ \le 1.96) = 0.95$. Let’s rewrite this and come up with a formula for calculating the interval.

When we plug in our values from the population example, we get $162.06 \le \mu \le 167.94$. This interval clearly does not include $\mu = 170$ and so we’ll have to reject $H_0$.

I know a lot has been glossed over. Where did $-1.96$ and $1.96$ come from? Why a $95\%$ confidence interval and why not $99\%$? And what puts the “confidence” in confidence interval? All that and much more in the next post.

Significance tests

The second method is called significance tests. Billy already has $\hat{\theta}$ and his assumption that $\theta = \theta^*$. In this method, Billy can calculate some test statistic and check the probability of getting that statistic under $f(x; \theta^*)$. If the probability is very low (say below $5\%$ or $1\%$), he’ll have to reject $H_0$ and book flight tickets to Santa Clara. Let’s go back to the population example.

We know that $Z \sim \mathcal{N}(\mu = 0, \sigma ^ 2 = 1)$. Also, we have enough data to calculate the $Z$ statistic, and access to the only surviving copy of the standard normal distribution. We can check what the probability of getting the statistic is and either accept or reject the hypothesis. Let’s do that.

From the standard normal table, the probability of getting $-3.33$ is $0.00043$. This is very low and therefore $H_0$ has to be rejected.

Conclusion

Billy can achieve nirvana and the drunks John and Nash can settle their bets in one of two ways - confidence intervals or significance tests. I know a lot of terminology has been skipped for the sake of providing an intuitive explanation of the mechanics of the two methods. This I’ll cover in the next few posts.

What we will build towards is using both these methods in the context of linear regression. Our arduous quest lies in finding $\beta_1$ and $\beta_2$ from $\hat{\beta}_1$ and $\hat{\beta}_2$.

Finito.

When it comes to statistics, there’s two schools of thought - frequentism and Bayesianism. In the coming posts we’ll be looking at hypothesis testing and interval estimation and knowing the difference between the two schools is important. In this post I’ll go over what frequentism and Bayesianism are and how they differ.

What is ‘probability’?

The reason for these two schools of thoughts is the difference in their interpretation of probability. For frequentists, probabilities are about frequency of occurence of events. For Bayesians, probabilities are about degree of certainty of events. This fundamental divide in the definition of probability leads to vastly different methods of statistical analysis.

The two schools

The aim of both the frequentists and Bayesians is the same - to estimate some parameters of a population that are unknown.

The assumption of the frequentist approach is that the parameters of a population are fixed but unknown constants. Since these are constants, no statements of probability can be made about them. The frequentist procedures work by drawing a large number of random samples, calculating a statistic using each of these samples, and then finding the probability distribution of the statistic. This is called the sampling distribution. Statements of probability can be made about the statistic.

The assumption of the Bayesian approach is that the parameters of a population are random variables. This allows making probability statements about them. There is a notion of some true value that the parameters can take with certain probability. The Bayesian approach thus allows adding in some prior information. The cornerstone of Bayesian approach is Bayes’ theorem:

Here, $\mathcal{H}$ is the hypothesis and $\mathcal{D}$ is the data. What the theorem lets us calculate is the posterior $P(\mathcal{H} ~ | ~ \mathcal{D})$ which is the probability of the hypothesis being true given the data. The prior $P(\mathcal{H})$ is the probability that $H$ is true before the data is considered. The prior lets us encode our beliefs about the parameters of the population into the equation. $P(\mathcal{D} ~ | ~ \mathcal{H})$ is the likelihood and is the evidence about hypothesis $\mathcal{H}$ given data $\mathcal{D}$. Finally, $P(\mathcal{D})$ is the probability of getting the data regardless of the hypothesis.

What’s important here is the prior $P(\mathcal{H})$. This is our degree of certainty about the hypothesis $\mathcal{H}$ being true. This probability can itself be calculated using frequentist methods but what matters is the fact that Bayesian approach lets us factor it in.

Frequentist Bayesian
Parameters are fixed, unknown constants. No statements of probability can be made about them. Parameters are random variables. Since random variables have an underlying probability distribution, statements of probability can be made about them.
Probability is about long run frequencies. Probability is about specifying the degree of (un)certainty.
No statements of probability are made about the data or the hypothesis. Statements of probability are made about both data and hypothesis.
Makes use only of the likelihood. Makes use of both the prior and the likelihood.

The procedures

In the frequentist approach, the parameters are an unknown constant and it is the data that changes (by repeated sampling). In the Bayesian approach, the parameters are a random variable and it is the data that stays constant (the data that has been observed). In this section we will contrast the frequentist confidence interval with Bayesian credible interval.

Both confidence intervals and credible intervals are interval estimators. Interval estimators provide a range of values that the true parameters can take.

The frequentist confidence interval

Let’s assume that there’s a true parameter $\theta$, representing the mean of a population, that we are trying to estimate. From a sample of values $(x_1, x_2, \dots, x_n)$ we can then construct two estimators $\hat{\theta}_1$ and $\hat{\theta}_2$ and say that the true value $\theta$ lies between $\hat{\theta}_1$ and $\hat{\theta}_2$ with a certain level of confidence.

To be confident that $\hat{\theta}_1 \le \theta \le \hat{\theta}_2$, we need to know the sampling distribution of the estimator. If the random variable $X$ is normally distributed, then the sample mean $\bar{X}$ is also normally distributed with mean $\mu$ (true mean) and variance $\frac{\sigma^2}{n}$ i.e. $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$. In a normal distribution, 95% of the area under the curve is covered by two standard deviations. Therefore, if we want 95% of the intervals to contain the true value of $\theta$, we can construct an interval $\bar{X} \pm 2 \frac{\sigma}{\sqrt{n}}$.

What this means is that if we were to keep drawing samples and constructing these intervals, 95% of these random intervals will contain the true value of the parameter $\theta$. This can be stated more generally as $P(\hat{\theta}_1 \le \theta \le \hat{\theta}_2) = 1 - \alpha$. This means that with probability $1 - \alpha$ (with $0 < \alpha < 1$), the interval between $\hat{\theta}_1$ and $\hat{\theta}_2$ contains the true value of the parameter $\theta$. So, if we set $\alpha = 0.05$, we’re constructing a 95% confidence interval. $\alpha$ is called the level of significance and is the probability of committing type I error.

Let’s suppose we’re trying to find the average height of men. It is normally distributed with mean $\mu$ and standard deviation $\sigma = 2.5$ inches. We take a sample of $n = 100$ and find that the sample mean $\bar{X} = 67$ inches. The 95% confidence interval would be $\bar{X} \pm 2 \left( \frac{2.5}{\sqrt{100}} \right)$. This gives us the confidence interval $66.5 \le \mu \le 67.5$. This is one such interval. If we were to construct 100 such intervals, 95 of them would contain the true parameter $\mu$.

The caveat here is that for simplicity I’ve assumed the critical value to be 2 instead of 1.96 for constructing the interval.

The Bayesian credible interval

A Bayesian credible interval is an interval that has a high posterior probability, $1 - \alpha$, of containing the true value of the parameter. Compared to the frequentist confidence intervals which say that $(1 - \alpha)\%$ of all the intervals calculated will contain the true parameter $\theta$, the Bayesian credible interval says that the calculated interval has $(1 - \alpha)\%$ probability of containing the true parameter.

Let’s suppose we’re interested in the proportion of population [1] that gets 8 hours of sleep every night. The parameter $\theta$ here now represents proportion. A sample of 27 is drawn of which 11 get 8 hours of sleep every night. Therefore, the random variable $X \sim \text{Binom}(27, ~ \theta)$.

To calculate a Bayesian credible interval, we need to assume a subjective prior. Suppose the prior was $\text{Beta}(3.3, 7.2)$. The posterior would thus be $\text{Beta}(11 + 3.3, 27 - 11 + 7.2) = \text{Beta}(14.3, 23.2)$. This can be calculated using scipy as follows:

1
2
3
from scipy import stats
stats.beta.interval(0.9, 14.3, 23.2)
(0.256110060437748, 0.5138511051170076)

The 90% credible interval is (0.256, 0.514).

In summary, here are some of the frequentist approaches and their Bayesian counterparts.

Frequentist Bayesian
Max likelihood estimation (MLE) Max a posteriori (MAP)
Confidence interval Credible interval
Significance test Bayes factor

A thing that I have glossed over is handling of nuisance parameters. Bayesian procedures provide a general way of dealing with nuisance parameters. These are the parameters we do not want to make inference about, and we do not want them to interfere with the inference we are making about the main parameter. For example, if we’re trying to infer the mean of a normal distribution, the variance is a nuisance parameter.

The critique

The prior is subjective and can change from person to person. This is the frequentist critique of the Bayesian approach; it introduces subjectivity into the equation. The frequentist approach make use of only the likelihood. On the other hand, the Bayesian criticism of the frequentist approach is that it uses an implicit prior. Bayes theorem can be restated as $\text{posterior} \propto \text{likelihood} \times \text{prior}$. For the frequentist approach, which makes use of likelihood, the prior would need to be set to 1 i.e. a flat prior. The Bayesian approach makes the use of prior explicit even if it is subjective.

The Bayesian criticism of frequentist procedures is that they do not answer the question that was asked but rather skirt around it. Suppose the question posed was “in what range will the true values of the parameter lie?”. The Bayesian credible interval will give one, albeit subjective, interval. The frequentist confidence interval, however, will give many different intervals. In that sense, frequentism isn’t answering the question posed.

Another Bayesian criticism of the frequentist procedures that they rely on the possible samples that could occur but did not instead of relying on the one sample that did occur. Bayesian procedures treat this sample as the fixed data and vary the parameters around it.

The end.

So far what we’ve been doing is point estimation using OLS. Another method of estimation with more mathematical rigor is max likelihood estimation. In this post we will look at likelihood, likelihood function, max likelihood, and how all this ties together in estimating the parameters of our bivariate linear regression model.

Probability

When we think of probability, we are thinking about the chance of something happening in the future. For example, if a coin is tossed, what is the chance of getting a head? It’s $\frac{1}{2}$.

But how did we arrive at the conclusion that the probability of getting a head is $\frac{1}{2}$? This leads us to two varying schools of thoughts on probability - frequentism and bayesianism.

If the coin were tossed a 1000 times, about half the times the outcome would be a head and half the times it would be a tail. The more frequently an event happens, the more probable / certain it is. This is the frequentist definition of probability. Under this definition, probability is the ratio of the number of times we make the desired observation to the total number of observations.

In the Bayesian approach, probability is a measure of our certainty about something. In a coin toss, there’s a 50% chance of getting a head and a 50% chance of getting a tail because there’s only two possible outcomes.

This subtle difference in the definition of probability leads to vastly different statistical methods.

Under frequentism, it is all about the frequency of observations. The event that occurs more frequently is the more probable event. There is no concept of an underlying probability distribution for the event we are trying to observe.

Under Bayesianism, in contrast, each event that occurs follows an observed probability distribution which is based on some underlying true distribution. Calculating the probabilities involves estimating the parameters of the true distribution based on the observed distribution.

Likelihood

Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome[1].

For example, if a coin were tossed 10 times and out of those 10 times a person guessed the outcome correctly 8 times then what’s the probability that the person would guess equally correctly the next time a coin is tossed 10 times? This is the likelihood. Unlike probability, likelihood is about past events.

The likelihood of observing the data is represented by the likelihood function. The likelihood function is the probability density function of observing the given data. Suppose that we have $n$ observations - $y_1, y_2, \dots y_n$ - of data which are realizations of a random variable $Y$. Let the function $f(Y, \theta)$ represent the probability density function of the random variable $Y$ with parameters $\theta$. The the likelihood of observing $y_1, y_2, \dots y_n$ is given by the likelihood function as:

Max Likelihood Estimation

If we were to maximize the likelihood function, we’d end up with the highest probability of observing the data that we have observed i.e. the max likelihood. The data that we have observed follows some distribution we’d like to estimate. This would require us to estimate the parameters $\theta$ such that they maximize the likelihood function i.e. $\hat{\theta} = \mathop{argmax}\limits_{\theta} \mathcal{L}(y_1, y_2, \dots y_n; \theta)$. The estimator $\hat{\theta}$ is therefore the max likelihood estimator of the parameters $\theta$ such that it maximizes the likelihood function.

The procedure followed in maximizing the likelihood function to compute $\hat{\theta}$ is called max likelihood estimation (MLE). MLE answers the following question: given that the we’ve observed data which is drawn from some distribution with parameters $\theta$, to what value should we set these parameters in the sampling distribution so that the probability of observing this data is the maximum?

The mechanics of finding $\hat{\theta}$ is an optimizaition problem involving a little bit of differential calculus. It involves calculating the partial derivative of the likelihood function with respect to each of the parameters in $\theta$ and setting them to zero. This would yield a system of equations which can then be solved to obtain estimates of the parameters which would maximize the function.

Calculating the partial derivatives like this becomes unwieldy very quickly since the likelihood function $\mathcal{L}$ involves multiplication of lot of terms. We can instead calculate the maximum of log-likelihood since it will convert the product term to a summation term. It is defined as $\mathcal{l(y_1, y_2, \dots y_n; \theta)} = log(\mathcal{L}(y_1, y_2, \dots y_n; \theta))$.

This provides us with enough background to apply MLE to find our regression parameters.

MLE for Regression

In regression, we have our data $Y_1, Y_2 \dots Y_n$ which we assume is normally distributed with mean $\beta_1 + \beta_2 X_i$ and variance $\sigma^2$. The likelihood function for this can be written as:

In the equations above, $f(Y_i) = \frac{1} {\sigma \sqrt{2 \pi}} exp \left\lbrace -\frac{1}{2} \frac{(Y_i - \beta_1 -\beta_2 X_i)^2} {\sigma ^ 2} \right\rbrace$ because we assume $Y_i$ to be normally distributed. As mentioned previously, computing the log-likelihood is easier. The log-likelihood is given as:

To find the ML estimators, we need to differentiate the log-likelihood function partially with respect to $\beta_1$, $\beta_2$, and $\sigma ^ 2$ and set the resulting equations to zero. This gives us:

Setting the above equations to zero and letting $\tilde{\beta}_1$, $\tilde{\beta}_2$, and $\tilde{\sigma} ^ 2$ represent the ML estimators, we get:

Simplifying the first two equation above, we get:

The equations above are the normal equations for OLS estimators. What this means is that the OLS estimators $\hat{\beta}_1$ and $\hat{\beta}_2$ are the same as the ML estimators $\tilde{\beta}_1$ and $\tilde{\beta}_2$ when we assume they’re normally distributed. The estimator for homoscedastic variance under ML is given by $\tilde{\sigma} ^ 2 = \frac{1} {n} \sum{\hat{u}_i ^ 2}$. This estimator underestimates the true $\sigma ^ 2$ i.e. it is biased downwards. However, in large samples, $\tilde{\sigma} ^ 2$ converges to the true value. This means that it is asymptotically unbiased.

Summary

In summary, MLE is an alternative to OLS. The method, however, requires that we make an explicit assumption about the probability distribution of the data. Under the normality assumption, the estimators for $\beta_1$ and $\beta_2$ are the same in both MLE and OLS. The ML estimator for $\sigma ^ 2$ is biased downwards but converges to the true value when the sample size increases.

The posts so far have covered estimators. An estimator looks at the sample data and comes up with an estimate for some true population parameter. However, how do we know how close, say $\hat{\beta}_2$, is to the true $\beta_2$? This is a question of statistical inference. To be able to make an inference, $\beta_2$ would need to follow some distribution. CLRM makes no assumptions about the distribution of data. CNLRM (Classic Normal Linear Regression Model), however, adds the assumption of normality i.e. the data and parameters are normally distributed.

In this post we will cover the normal distribution, CNLRM, and how the assumption of normality helps us.

Normal Distribution

A probability distribution is a function which provides the probabilities of the outcomes of a random phenomenon. For example, a random variable $D$ could describe the outcomes of rolling a fair dice. The probability distribution would then be a function that would return $\frac{1}{6}$ for each of the outcomes of $D$.

The normal distribution is the most commonly encoutered probability distribution. It states that the averages (or any other statistic like sum or count) calculated by drawing samples of observations from a random variable tend to converge to the normal. The distribution is defined by two parameters - the mean ($\mu$), and the variance ($\sigma^2$). If a random variable $X$ is normally distributed, it is written notationally as $X \sim \mathcal{N}(\mu, \sigma^2)$. The probability density function of normal distribution is given by:

Distribution of $u_i$, $\hat{\beta}_1$, and $\hat{\beta}_2$

For us to be able to draw any inference about the estimators $\hat{\beta}_1$ and $\hat{\beta}_2$, we will have to make some assumption about the probability distribution of $u_i$. In one of the previous posts we saw that $\hat{\beta}_2 = \frac{\sum{x_iy_i}} {\sum{x_i^2}}$. This can alternatively be written as $\hat{\beta}_2 = \frac{\sum{x_iY_i}} {\sum{x_i^2}} = \sum{k_iY_i}$ where $k = \frac{x_i}{\sum{x_i^2}}$. So we have

The $k_i$, betas, and $X_i$ are fixed and therefore $\hat{\beta}_2$ is a linear function of $u_i$. Under CNLRM, we assume $u_i$ to be to be normally distributed. A property of normally distributed random variables is that their linear functions are also normally distributed. What this means is that since $\hat{\beta}_1$ and $\hat{\beta}_2$ are linear functions of $u_i$, they are also normally distributed.
As mentioned previously, a normal distribution is described by its mean and variance. This means $u_i$ can also be described by its mean and variance as $u_i \sim \mathcal{N}(0, \sigma^2)$. What this shows us is that we assume, on average, for $u_i$ to be zero and there to be constant variance (homoscedasticity).

A further assumption we make is that $cov(u_i, u_j) = 0$ where $i \ne j$. This means that the $u_i$ and $u_j$ are uncorrelated and independently distributed because zero correlation between two normally distributed variables implies statstical independence.

Also note that since $u_i$ is normally distributed, so is $Y_i$. We can state this as $Y_i \sim \mathcal{N}(\beta_1 + \beta_2X_i, \sigma^2) $

Why do we assume normality?

We assume normality for the following reasons:

First, the $u_i$ subsume the effect of a large number of factors that affect the dependent variable but are not included in the regression model. By using the central limit theorem we can state that the distribution of the sum of these variables will be normal.

Second, because we assume $u_i$ to be normal, both $\hat{\beta}_1$ and $\hat{\beta}_2$ are normal since they are linear functions of $u_i$.

Third, the normal distribution is fairly easy to understand as it involves just two parameters, and is extensively studied.

Properties of estimators under CNLRM

  1. They are unbiased.
  2. They have minimum variance.

Combinining 1. and 2. means that the estimators are efficient - they are minimum-variance unbiased.

  1. They are consistent. This means that as the size of the sample increases, the values of the estimators tend to converge to the true population value.

  2. The estimators have the least variance in the class of all estimators - linear or otherwise. This makes the estimators best unbiased estimators (BUE) as opposed to best linear unbiased estimators (BLUE).

Misc

Earlier in the post I mentioned that $\hat{\beta}_2 = \frac{\sum{x_iY_i}} {\sum{x_i^2}}$. This section gives the derivation of this expression. We will look at the numerator for ease of calculation.

Finito.

In one of the previous posts we looked at the assumptions underlying the classic linear regression model. In this post we’ll take a look at the Gauss-Markov theorem which uses these assumptions to state that the OLS estimators are the best linear unbiased estimators, in the class of linear unbiased estimators.

Gauss-Markov Theorem

To recap briefly, some of the assumptions we made were that:

  1. On average, the stochastic error term is zero i.e. $E(u_i | X_i) = 0$.
  2. The data is homoscedastic i.e. the variance / standard deviation of the stochastic error term is a finite value regardless of $X_i$.
  3. The model is linear in parameters i.e. $\beta_1$ and $\beta_2$ have power 1.
  4. There is no autocorrelation between the disturbance terms.

The Gauss-Markov theorem states the following:

Given the assumptions underlying the CLRM, the OLS estimators, in the class of linear unbiased estimators, have minimum variance. That is, they are BLUE (Best Linear Unbiased Estimators).

We’ll go over what it means for an estimator to be best, linear, and unbiased.

Linear

The dependent variable $Y$ is a linear function of the variables in the model. The model must be linear in parameters. Under this assumption $Y_i = \hat{\beta}_1 + \hat{\beta}_2 X_i + \hat{u}_i$ and $Y_i = \hat{\beta}_1 + \hat{\beta}_2 X_i^2 + \hat{u}_i$ are both valid models since the estimators are linear in both of them.

Unbiased

This means that on average, the value of the estimator is the same as the parameter it is estimating i.e. $E(\hat{\beta}_1) = \beta_1$.

Best

This means that among the class of linear estimators, the OLS estimators will have the least variance. Consider the two normal distributions shown in the graph above for two linear, unbiased estimators $\hat{\beta}_1$ (represented by orange) and $\beta_1^*$ (represented by blue). Also let’s assume that the true value of $\beta_1$ is zero. This means that both of these are unbiased since, on average, their value is the same as the true value of the parameter. However, the OLS estimator $\hat{\beta}_1$ is better since it has lesser variance compared to the other estimator.

Finito.

The calculation of the estimators $\hat{\beta}_1$ and $\hat{\beta}_2$ is based on sample data. As the sample drawn changes, the value of these estimators also changes. This leaves us with the question of how reliable these estimates are i.e. we’d like to determine the precision of these estimators. This can be determined by calculating the standard error or the coefficient of determination.

Standard Error

The standard error of a statistic[1] is the standard deviation of the sampling distribution of that statistic.

Suppose we have a dataset which contains incomes of people. From this dataset we start drawing samples of size n and calculating the mean. Now if we plot the distribution (the sampling distribution) of the means we calculated, we’ll get a normal distribution centered on the population mean. The standard deviation of this sampling distribution is the standard error of the statistic which in our case is mean.

Standard error is given by the formula:

Where
$\sigma$ is the standard deviation of the population.
$n$ is the sample size.

Calculating the standard error shows the sampling fluctuation. Sampling fluctuation shows the extent to which a statistic takes on different values.

Notice that the standard error has an inverse relation with the sample size n. This means that the larger the sample we draw from the population, the lesser the standard error will be. This will also result in a tighter normal distribution since the standard deviation will be less.

Standard Error of OLS Estimates

The standard error of the OLS estimators $\hat{\beta}_1$ and $\hat{\beta}_2$ is given by:

where $\sigma$ is the square root of true but unknown constant of homoscedastic variance $\sigma^2$.

All of the terms in the equations above except $\sigma^2$ can be calculated from the sample drawn. Therefore, we will need an unbiased estimator $\hat{\sigma}^2 = \frac{\sum{\hat{u}_i^2}}{n - 2}$ . The denominator $n-2$ represents the degrees of freedom. $\sum{\hat{u}_i^2}$ is the residual sum of squares.

Although $\hat{u}_i = Y_i - \hat{Y}_i$, its summation can be computed with an alternative formula as .

All of the terms in the above formula would have already been computed as a part of computing the estimators.

How it all ties together

As you calculate the estimators and draw the sample regression curve that passes through your data, you need some numeric measure of how good the curve fits the data i.e. a measure of “goodness of fit”. This can be given as:

This is the positive square root of the estimator of homoscedastic variance. This is the standard deviation of the $Y$ values about the regerssion curve.

The standard errors of the estimators $\hat{\beta}_1$ and $\hat{\beta}_2$ will show you how much they fluctuate as you draw the samples. The lesser their standard error, the better.

The square of the standard error is called the mean squared error.

Let’s see some code

To start off, let’s load the Boston housing dataset.

1
2
3
4
5
6
7
8
import pandas as pd
from sklearn import datasets
boston = datasets.load_boston()
data, target = boston.data, boston.target
df = pd.DataFrame(data=data, columns=boston.feature_names)
df = df[['RM']]
df['RM'] = df['RM'].apply(round)
df['price'] = target

I’m choosing to use the average number of rooms as the variable to see how it affects price. I’ve rounded it off to make the calculations easier. The scatter plot of the data looks the following:

This, ofcourse, is the entire population data so let’s draw a sample.

1
sample = df.sample(n=100)

Now let’s bring back our function that calculates the OLS estimates:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def ols(sample):
Xbar = sample['X'].mean()
Ybar = sample['Y'].mean()

sample['x'] = sample['X'] - Xbar
sample['x_sq'] = sample['x'] ** 2
sample['y'] = sample['Y'] - Ybar
sample['xy'] = sample['x'] * sample['y']

beta2cap = sample['xy'].sum() / sample['x_sq'].sum()
beta1cap = Ybar - (beta2cap * Xbar)

sample['Ycap'] = beta1cap + beta2cap * sample['X']
sample['ucap'] = sample['Y'] - sample['Ycap']

return beta1cap, beta2cap

and now let’s see coefficients

1
2
sample.rename(columns={'RM': 'X', 'price': 'Y'}, inplace=True)
intercept, slope = ols(sample)

The plot of the regression curve on the sample data looks the following:

The sample DataFrame has our intermediate calculation and we can use that to calculate the standard error. Let’s write a function which does that.

1
2
3
from math import sqrt
def standard_error(sample):
return sqrt((sample['ucap'] ** 2).sum() / (len(sample) - 2))

Finally, let’s see how much standard deviation we have around the regression line.

1
2
standard_error(sample)
7.401174774558201

Coefficient of Determination

The coefficient of determination denoted by $r^2$ (for bivariate regression) or $R^2$ (for multivariate regression) is a ratio of the explained variance to the total variance of the data. The higher the coefficient of determination, the more accurately the regressors (average number of rooms in the house) explain the regressand (the price of the house) i.e. the better your regression model is.

For bivariate linear regression, $r^2$ value is given by:

$r^2$ can take on a value between 0 and 1.

Turning this into code, we have:

1
2
3
def coeff_of_determination(sample):
Ybar = sample['Y'].mean()
return ((sample['Ycap'] - Ybar) ** 2).sum() / ((sample['Y'] - Ybar) ** 2).sum()

For the sample we’ve drawn, the $r^2$ value comes out to be 0.327. This means that only 32.7% of the variance in the total data is explained by the regression model.

Finito.


[1] A “statistic” is a numerical quantity such as mean, or median.

In the previous post we calculated the values of $\hat{\beta}_1$ and $\hat{\beta}_2$ from a given sample and in closing I mentioned that we’ll take a look at finding out whether the sample regression line we obtained is a good fit. In other words, we need to assess how close $\hat{\beta}_1$ and $\hat{\beta}_2$ are to $\beta_1$ and $\beta_2$ or how close $\hat{Y}_i$ is to $E(Y|X_i)$. Recall that the population regression function (PRF) is given by $Y_i = \beta_1 + \beta_2X_i + u_i$. This shows that the value of $Y_i$ depends on $X_i$ and $u_i$. To be able to make any statistical inference about $Y_i$ (and $\beta_1$ and $\beta_2$) we need to make assumptions about how the values of $X_i$ and $u_i$ were generated. We’ll discuss these assumptions in this post as they’re important for the valid interpretation of the regression estimates.

Classic Linear Regression Model (CLRM)

There are 7 assumptions that Gaussian or Classic linear regression model (CLRM) makes. These are:

1. The regression model is linear in parameters

Recall that the population regression function is given by $Y_i = \beta_1 + \beta_2X_i + u_i$. This is linear in parameters since $\beta_1$ and $\beta_2$ have power 1. The explanatory variables, however, do not have to be linear.

2. Values of X are fixed in repeated sampling

In this series of posts, we’ll be looking at fixed regressor models[1] where the values of X are considered to be fixed when drawing a random sample. For example, in the previous post the values of X were fixed between 80 and 260. If another sample were to be drawn, the sample values of X would be used.

3. Zero mean value of the error term $u_i$

This assumption states that for every given $X_i$, the error term $u_i$ is zero, on average. This should be intuitive. The regression line passes through the conditional mean for every $X_i$. The error term $u_i$ are distances of the individual points $Y_i$ from the conditional mean. Some of them lie above the regression line and some of them lie below. However, they cancel each other out; the positive $u_i$ cancel out the negative $u_i$ and therefore $u_i$ is zero, on average. Notationally, $E(u_i|X_i) = 0$.

What this assumption means that all those variables that were not considered to be a part of the regression model do not systematically affect $E(Y|X_i)$ i.e. their effect on average is zero on $E(Y|X_i)$.

This assumption also means is that there is no specification error or specification bias. This happens when we chose the wrong functional form for the regression model, exclude necessary explanatory variables or include unnecessary ones.

4. Constant variance of $u_i$ (homoscedasticity)

This assumption states that the variation of $u_i$ is the same regardless of $X$.

With this assumption, we assume that the variance of $u_i$ for a given $X_i$ (i.e. conditional variance) is a positive constant value given by $\sigma^2$. This means that the variance for the various $Y$ populations is the same. What this also means is that the variance around the regression line is the same and it neither increases nor decreases with change in values of $X$.

5. No autocorrelation between disturbances

Given any two $X$ values, $X_i$ and $X_j$ ($i \neq j$), the correlation between any two $u_i$ and $u_j$ is zero. In other words, the observations are independently sampled. Notationally, $cov(u_i, u_j|X_i, X_j) = 0$. This assumption states that there is no serial correlation or autocorrelation i.e. $u_i$ and $u_j$ are uncorrelated.

To build an intuition, let’s consider the population regression function $Y_t = \beta_1 + \beta_2X_t + u_t$

Now suppose that $u_t$ and $u_{t-1}$ are positively correlated. This means that $Y_t$ depends not only on $X_t$ but also on $u_{t-1}$ because that affects $u_t$. Autocorrelation occurs in timeseries data like stock market trends where the the observation of one day depends on the observation of the previous day. What we assume is that each observation is independent as it is in case of the income-expense example we saw.

6. The sample size $n$ must be greater than the number of parameters to estimate

This is fairly simple to understand. To be able to estimate $\beta_1$ and $\beta_2$, we need atleast 2 pairs of observations.

7. Nature of X variables

There are a few things we assume about the value of $X$ variables. One, for a given sample they are not all the same i.e. $var(X)$ is a positive number. Second, there are no outliers among the values.

If we have all the values of $X$ the same in a given sample, the regression line would be a horizontal line and therefore it will be impossible to estimate $\beta_2$ and thus $\beta_1$. This will happen because in the equation $\hat{\beta}_2 = \frac{\sum{x_iy_i}}{\sum{x_i^2}}$, the denominator will become zero since $Xi$ and $\bar{X}$ will be the same.

Furthermore, we assume that there are no outliers in the $X$ values. Suppose there are a few $X$ values which are far apart from the mean value, the regression line which we get will be vastly different depending on whether the sample drawn contains these outliers or not.

These are the 7 assumptions underlying OLS. In the next section, I’d like to elucidate on assumption 3 and 5 a little more.

More on Assumption 3. and 5.

In assumption 3. we state that $E(u_i|X_i) = 0$ i.e. on average the disturbance term is zero. For the expected value of a variable $u_i$ given another variable $X_i$ be zero means the two variables are uncorrelated. We assume these two variables to be uncorrelated so that we can assess the impact each has on the predicted value of $Y_i$. If the disturbance $u_i$ and the explanatory variable $X_i$ were correlated positively or negatively then it means that $u_i$ contains some factor that affects the predicted value of $Y_i$ i.e. we’ve skipped an important explanatory variable that should’ve been included in the regression model. This means we have functionally misspecified our regression model and introduced specification bias.

Functionally misspecifying our regression model will also introduce autocorrelation in our model. Consider the two plots shown below.

In the first plot the line fits the sample points properly. In the second one, it doesn’t fit properly. The sample points seem to form a curve of some sort while we try to fit a straight line through it. What we get is runs of negative residuals followed by positive residuals which indicates autocorrelation.

Autocorrelation can also be introduced not just by functionally misspecifying the regression model but also by the nature of data itself. For example, consider timeseries data representing stock market movement. The value of $X_t$ depends on $X_{t-1}$ i.e. the two are correlated.

When there is autocorrelation, the OLS estimators are no longer the best estimators for predicting the value of $Y_i$.

Conclusion

In this post we looked at the assumptions underlying OLS. In the coming post we’ll look at the Gauss-Markov theorem and the BLUE property of OLS estimators.


[1] There are also stochastic regression models where the values of X are not fixed.

In the previous post we built the intuition behind linear regression. In this post we’ll dig deeper into the simplest form of linear regression which involves one dependant variable and one explanatory variable. Since we only have two variables involved, it’s called bivariate linear regression.

Ordinary Least Squares

We defined sample regression function as:

The term $\hat{u}_i$ is called the residual. If we re-write the sample regression function, we get $\hat{u}_i = Y_i - \hat{Y}_i$. The residual then represents how far off our estimated value $\hat{Y}_i$ is from the actual value $Y_i$.[1] A reasonable assumption to make is that we should pick $\hat{\beta}_1$ and $\hat{\beta}_2$ such that the sum of $\hat{u}_i$ is the smallest. This would mean that our estimates are close to the actual values.

This assumption, however, does not work. Let’s assume that our values for $\hat{u}_i$ are -10, 3, -3, 10. Here the sum is zero but we can see that the first and last predictions are far apart from the actual values. To mitigate this, we can add the squares of the residuals. We need to pick $\hat{\beta}_1$ and $\hat{\beta}_2$ such that the sum of the square of the residuals is the minimum i.e. $\sum{\hat{u}_i^2} = \sum{(Y_i - \hat{Y}_i)^2}$ is the least.

Again, assuming the values of $\hat{u}_i$ to be -10, 3, -3, 10, the squares are 100, 9, 9, 100. This sums to 218. This shows us that the predicted values are far from the actual values. $\sum{\hat{u}_i^2}$ is called the residual sum of squares (RSS) or the squared error term.

The method of OLS provides us with estimators $\hat{\beta}_1$ and $\hat{\beta}_2$ such that, for a given sample set, $\sum{\hat{u}_i^2}$ is the least. These estimators are given by the formula:

where $x_i$ = $X_i - \bar{X}$, $y_i = Y_i - \bar{Y}$. These are the differences of individual values from their corresponding sample mean $\bar{X}$ or $\bar{Y}$. The estimators thus obtained are called least-squares estimators.

Now that we’ve covered significant ground, let’s solve a sum by hand.

Example

Let’s start by drawing a random sample from the original dataset. The sample I’ve drawn looks like this:

1
2
3
4
5
6
7
8
9
10
11
income (X)  expenditure (Y)
80 65
100 70
120 84
140 80
160 118
180 130
200 144
220 137
240 191
260 150

To see how this contrasts with the original dataset, here’s a plot of the original dataset and the random sample. The faint green points represent the original dataset whereas the red ones are the sample we’ve drawn. As always, our task is to use the sample dataset to come up with a regression line that’s as close as possible to the population dataset i.e. use the red dots to come up with a line that’d be similar to the line we’d get if we had all the faint green dots.

I’ll write some Python + Pandas code to come up with the intermediate calculations and the final $\hat{\beta}_1$ and $\hat{\beta}_2$ values. I highly recommend that you solve this by hand to get the feel for it.

regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def ols(sample):
Xbar = sample['X'].mean()
Ybar = sample['Y'].mean()

sample['x'] = sample['X'].apply(lambda Xi: Xi - Xbar)
sample['x_sq'] = sample['x'].apply(lambda x: x ** 2)
sample['y'] = sample['Y'].apply(lambda Yi: Yi - Ybar)
sample['xy'] = sample[['x', 'y']].apply(lambda row: row['x'] * row['y'], axis='columns')

beta2cap = sample['xy'].sum() / sample['x_sq'].sum()
beta1cap = Ybar - (beta2cap * Xbar)

sample['Ycap'] = sample['X'].apply(lambda X: beta1cap + (beta2cap * X))

sample['ucap'] = sample[['Y', 'Ycap']].apply(lambda row: row['Y'] - row['Ycap'], axis='columns')

return beta1cap, beta2cap

On calling the function, these are the results I get:

1
2
>>> ols(sample)
(9.696969696969703, 0.6306060606060606)

and the intermediate calculations are the following:

1
2
3
4
5
6
7
8
9
10
11
     X    Y     x    x_sq     y      xy        Ycap       ucap
0 80 65 -90.0 8100.0 -51.9 4671.0 60.145455 4.854545
1 100 70 -70.0 4900.0 -46.9 3283.0 72.757576 -2.757576
2 120 84 -50.0 2500.0 -32.9 1645.0 85.369697 -1.369697
3 140 80 -30.0 900.0 -36.9 1107.0 97.981818 -17.981818
4 160 118 -10.0 100.0 1.1 -11.0 110.593939 7.406061
5 180 130 10.0 100.0 13.1 131.0 123.206061 6.793939
6 200 144 30.0 900.0 27.1 813.0 135.818182 8.181818
7 220 137 50.0 2500.0 20.1 1005.0 148.430303 -11.430303
8 240 191 70.0 4900.0 74.1 5187.0 161.042424 29.957576
9 260 150 90.0 8100.0 33.1 2979.0 173.654545 -23.654545

Let’s plot the line we obtained as a result of this.

Interpreting the Results

Our calculations gave us the results that $\hat{\beta}_1 = 9.6969$ and $\hat{\beta}_2 = 0.630$. Having a slope ($\hat{\beta}_2$) of $0.630$ means that for every unit increase in income, there’s an increase of $0.630$ in expenditure. The intercept $\hat{\beta}_1$ is where the regression line meets the Y axis. This means that even without any income, a persion would have an expenditure of $9.6969$. In a lot of cases, however, the intercept term doesn’t really matter as much as the slope and how to interpret the intercept depends upon what the Y axis represents.

Conclusion

That’s it for this post on bivariate linear regression. We saw how to calculate $\hat{\beta}_1$ and $\hat{\beta}_2$ and worked on a sample problem. We also saw how to interpret the result of the calculations. In the coming post we’ll look at how to assess whether the regression line is a good fit.


[1] It’s important to not confuse the stochastic error term $u_i$ with the residual $\hat{u}_i$. The stochastic error term represents the inherent variability of data whereas the residual represents the difference between the predicted and actual values of $Y_i$.

The first topic we’ll look at is linear regression. As mentioned previously, regression is about understanding dependence of one variable on another. We’ll start by building an intuitive understanding of what it is before we dive into the math-y details of it. We’ll cover the theory, learn to solve some problems by hand, look at the assumptions supporting it, then look at what happens if these assumptions aren’t held. Once this is done we’ll look at how we can do the same by using gradient descent.

What is Regression Analysis?

Regression analysis is the study of the dependence of one variable (dependent variable, regressand) over one or more other variables (explanatory variables, regressors). The aim is to see how the dependent variable changes, on average, if there is a change in the explanatory variables. This is better explained with an example.

The plot shown here is of daily income and daily expenditure. The dependent variable here is the expenditure which depends on the explanatory variable income. As we can see from the plot, for every income level we have a range of expenditures. However, as the regression line shows, on average as income goes up, the expenditure goes up, too.

Why do we say that the income goes up on average? Let’s look at the raw data when daily income is 80 or 100.

1
2
3
4
5
6
7
8
9
10
11
12
income (X)  expenditure (Y)  average
80 55 65.0
80 60
80 65
80 70
80 75
100 65 77.0
100 70
100 74
100 80
100 85
100 88

The highest expenditure when income is 80 (75) is greater than the lowest expenditure when income level is 100 (65). If we look at the averages, the expenditure has gone up. There are a number of factors which may have caused the person with income 80 to expend 75 but as a whole, people with his daily income tend to expend less than those who earn 100 a day. The regression line passes through these averages.

That’s the basic idea behind regression analysis.

The Math behind Regression Analysis

The regression line passes through the average expenditure for every given income level. In other words, it passes through the conditional expected value — $E(Y|X_i)$. This is read as “the expected value of expenditure (Y) given an income level ($X_i$)”. Therefore, $E(Y|X_i)$ is a function of $X_i$ i.e. $E(Y|X_i) = f(X_i)$. The question now is: what is the function $f$?

We can start by assuming $f$ to be a straight line function which means that we’re assuming expenditure to be linearly related to income. So, $E(Y|X_i) = \beta_1 + \beta_2X_i$. This is the slope-intercept form of a line where $\beta_1$ is the intercept and $\beta_2$ is the slope of the line. Both $\beta_1$ and $\beta_2$ are called the regression coefficients (or parameters). The function $f$ is called the population regression function.

Now given that we know how to calculate the average expenditure for a given income level, how do we calculate the expenditure of each individual at that income level? We can say that the individual’s expenditure is “off the average expenditure by a certain margin”. This can be written as $Y_i = E(Y|X_i) + u_i$. The term $u_i$ denotes how far off an individual’s expenditure is from the average and is called the stochastic error term.

No matter how good our regression model is, there is always going to be some inherent variability. There are factors other than income which define a person’s expenditure. Factors like gender, age, etc. affect the expenditure. All of these factors are subsumed into the stochastic error term.

When it comes to applying regression analysis, we won’t have access to the population data. Rather, we have access to sample data. Our task is to come up with sample regression coefficients such that they’ll be as close as possible to the population regression coefficients. The sample regression function is defined as:

Here $\hat{\beta}_1$, $\hat{\beta}_2$, $\hat{u}_i$, and $\hat{Y}_i$ are estimators of $\beta_1$, $\beta_2$, $u_i$, and $E(Y|X_i)$ respectively. An estimator is simply a formula which tells us how to estimate the population parameter from the information provided by the sample data available to us.

What it means to be ‘Linear’

Linearity can be defined as being either linear in parameters or linear in explanatory variables. To be linear in parameters means that all your parameters in the regression function will have power 1. By that definition, $E(Y|X_i) = \beta_1 + \beta_2X_i^2$ is linear. In contrast, $E(Y|X_i) = \beta_1 + \beta_2^2X_i$ is linear in explanatory variables since $X_i$ has power 1. We’ll be looking at models that are linear in parameters. Henceforth, “linear” would mean “linear in parameters”.

That’s it. This is what sets the foundation for further studying linear regression.

A while ago I’d started writing a series on machine learning titled ML for Newbies which I never got around to finishing. This series of posts is a redoing of the older series with greater mathematical, and theoretical rigor. Some of the content will be ported over from that series but most of it will be new content.

The aim of the series is to summarize everything that I know about machine learning in an approachable, intuituve manner, and also be notes for myself.

What is Machine Learning?

ML is all about automated learning. It is about creating algorithms that can “learn” from the input given to them. These algorithms can then do a number of tasks such as filter out spam emails, predict the weather, etc. More formally, it’s about converting experience into expertise. The input to the algorithm become its “experience” and the task that it performs becomes its “expertise”.

Although ML is a sub-field of AI (Artificial Intelligence), they’re different. AI focuses on replicating human intelligence whereas ML focuses on complementing human intelligence by doing tasks that fall beyond human capabilities like looking for patterns in massive data sets.

How does a Machine Learn?

Like I mentioned earlier, a machine learns from the input data given to it. The way the machine learns can also be classified as supervised, unsupervised, active, or passive.

Supervised Learning

Let’s say your task is to build a spam filter. You have a large data set of emails, each of which is marked as either “spam” or “not spam”. You’ll feed this data set to your algorithm and it’ll learn from it. Later, you’ll input an email and it’ll tell you if it is spam or not. This type of learning is called supervised learning. It is called “supervised” because the algorithm was first taught what is spam and what isn’t by a data set generated manually. You had to supervise its learning process.

Unsupervised Learning

Let’s stay with the spam filter example. Let’s now suppose that the emails in the data set are not marked as “spam” or “not spam”. It’s the task of the algorithm to figure it out somehow. Such learning is called unsupervised learning.

Passive Learning

In passive learning, the algorithm only observes the data provided to it and learns from it, without influencing it in anyway. Suppose now that the spam filter starts out blank with no notion of what is and what isn’t spam. It’s up to the user to mark the incoming emails so that the algorithm can learn from it. This is passive learning.

Active Learning

In contrast, let’s say that the spam filter now looked at the email first, and if it found something suspicious, asked the user to mark it as spam or not. This is active learning.

What can ML do?

There are a number of ML algorithms, each focusing on a different task. These algorithms can do a number of things like:

Clustering

In simple terms, clustering is grouping. It’s an unsupervised form of learning in which the algorithm looks at the data and groups the similar pieces of together based on some similarity measure. In the spam filter examples above, when the algorithm, on its own, figured out which of the emails are spam, it clustered them together.

Classification

Classification is about finding out which group a piece of data belongs to. It’s a form of supervised learning because the algorithm needs to know beforehand what each of the group looks like before it can be asked to classify a piece of data. When the spam filter was given a data set with marked emails, all it had to do was to answer the question “is this email spam?”. Basically, it was asked to classify that email into a group.

Regression

Regression is about understanding relationships between variables. If one variable changes, how would it affect the other? For example, if the amount of power needed for a phone to stay on increased, how would it affect battery life?

Association Rules

Finding out association rules in data sets is also about finding out relationships between variables. Suppose you have a large data set from an online store, association rules is about finding out what items you are likely to buy based on what you’ve previously bought. For example, if you buy milk and sugar, you’re likely to buy eggs. Your rule then becomes {milk, sugar} => eggs.

For now, don’t worry about the theoretical rigor of these. We’ll get to that in subsequent posts. That’s all for now. I hope this post was motivation enough for you to start learning ML. :)

So, I am taking a break from my JVM JIT series to write this post about a table summarization algorithm that I had implemented way back in my college days. Simply put, the algorithm finds the most descriptive patterns in the table that succinctly convey the meaning of the rows contained i.e. summarize it. This post will focus on what the algorithm is and how it’s implemented in JavaScript. Take this post with a grain of salt because the implementation was a part of my college assignment and hasn’t been thoroughly checked for correctness.

What is TSum?

TSum is the algorithm published by Google Research [link to the pdf]. To quote the abstract:

Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. … TSum, a method that provides a sequence of patterns ordered by their “representativeness.” It can decide both which these patterns are, as well as how many are necessary to properly summarize the data.

An algorithm like this is useful in situations where the volume of data is so large that it’d be impossible for a human to deduce anything from it simply by looking at the data. For example, given patient records, a description like “most patients are middle-aged men with cholestrol, followed by child patients with chickenpox” provides a very human-readable summary of the data.

The algorithm views finding the best pattern as a compression problem — the best pattern will give the best compression.

Definitions

Pattern and Pattern Size

A pattern $P$ is a tuple $(A_1 = u_1, …, A_D = u_D)$ where $u_i$ can can be either a specific value or a “don’t care” value represented by $\ast$. The number of matching attributes in a pattern is called the size of the pattern and is denoted by $size(P)$. An attribute is considered matching if and only if its value doesn’t equal “don’t care” value.

Pattern List and Pattern Set

A pattern list is an ordered sequence of patterns $\mathcal{P} = [P_1, P_2 …]$ while a pattern set is a set of patterns $\mathcal{P} = \{P_1, P_2 … \}$ with no order.

Compression Saving

Compression saving of a pattern $P$ on a table $\mathcal{T}$, denoted as $Saving(P,\mathcal{T})$ is the amount of compression it can achieve. Specifically, it is the difference of bits between he uncompressed and compressed representations of the data records covered by the pattern $P$. Let $N$ be the number of records covered by pattern $P$ and $D$ be the number of attributes in $\mathcal{T}$. Then,

where

and

$W_i$ is the average number of bits in the ith attribute.
$D$ is the number of attributes.
$log^*(N) = 2 \ast \lceil log_2(N+2) \rceil$

Residue Data Table

Given a table $\mathcal{T}$ and a pattern collection $\mathcal{P}$, the residue data table contains the records that are not covered by any of the patterns in $\mathcal{P}$.

Pattern Marhsalling

Given a set of patterns $\mathcal{S}$, the pattern marshalling algorithm picks a pattern from $\mathcal{S}$ which has the highest compression saving. After every iteration, the records in the table which have been covered by the patterns chosen so far will be removed from consideration.

Generating Patterns - Local Expansion

Local expansion is a pattern generation strategy that tries to “grow” the patterns to increase the compression savings on the data record. In this approach, the algorithm will start with single attributes first and and find a single-condition pattern $P = \{ (A=a) \}$ that has the best compression saving, and then expand the pattern by adding other conditions until the compression cost cannot be improved. To find the next pattern, the same procedure is repeated, but only on the residue data table - the part of the table not covered by any of the patterns found so far.

Code

We’ll start with simpler code first. Let’s start by writing a function to calculate the benefit.

benefit.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"use strict";
const _ = require("lodash");
/**
* The benefit of using this pattern to represent the table
* @param T the table containing rows
* @param pattern the pattern to be applied
*/
function benefit(pattern, T){
if(pattern == null) return 0;

// a function for internal use.
// find out the bits needed to encode the value
function bitsFor(value) {
if( typeof(value) === "string" || value instanceof String){
// JS uses UTF-16
return value.length * 16;
}
if( typeof(value) === "number" || value instanceof Number){
// 64-bit for each number
return 64;
}
return 0;
}

let N = _.filter(T, pattern).length;
let W = 0;
let attributes = Object.keys(pattern);

attributes.forEach(attribute => {
W += bitsFor(pattern[attribute]);
});

return (N-1) * W;
}
module.exports = benefit;

The actual translation of formula to code happens from line #25. We start by finding N which is the number of records in the table that match the pattern. Im using lodash‘s filter function to avoid the boilerplate of having to find the matching records myself. W is the accumulator in which I will sum the number of bits each attribute in the pattern take. attributes are the attributes in the pattern. Then for each attribute, we use bitsFor and add it up with the value of W. Finally, we return the value according to the formula.

Simply put, benefit is $N - 1$ times the total number of bits each attribute in the pattern would take.

Next, let’s write code to find $log^\ast(N)$

log.js
1
2
3
4
5
6
7
8
9
10
"use strict";
/**
* Calculates log*(N).
* @param N number of rows in the table
* @returns {number}
*/
function log(N){
return 2 * Math.log2( N+2 );
}
module.exports = log;

This is a fairly straightforward translation of the formula to code and needs no explanation.

We now have all the code we need to calculate overhead so let’s do that.

overhead.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
"use strict";
const _ = require("lodash");
const log = require("./log");
/**
* The overhead of using this pattern to represent this table
* @param T the table containing rows
* @param pattern the pattern to be applied
*/
function overhead(pattern, T){
let N = _.filter(T, pattern).length;
let D = Object.keys(T[0]).length; // number of attributes
return D + log(N);
}
module.exports = overhead;

Notice that I start by require-ing the log.js module we just saw. I am using filter to find the number of rows in the table that match the pattern. On the next line I find the number of attributes in a given record of the table. Since the algorithm assumes no null / empty values for any attribute, we can safely pick up the 0th record and see how many attributes it contains.

Now that we have benefit and overhead in place, let’s calculate saving

saving.js
1
2
3
4
5
6
7
8
9
10
11
12
13
"use strict";
const benefit = require("./benefit");
const overhead = require("./overhead");
/**
* The compression saving of a pattern P on a data table T,
* denoted by saving(P,T) is the amount of compression it can achieve.
* @param T the table containing rows
* @param pattern the pattern to be applied
*/
function saving(pattern, T){
return benefit(pattern, T) - overhead(pattern, T);
}
module.exports = saving;

Now let’s write code for pattern marshalling.

pattern_marshalling.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
"use strict";
const _ = require("lodash");
const saving = require("./saving");
const coverage = require("./coverage");

function patternMarshalling(S, T){
// chosen patterns
let patterns = []; // empty = don't care for every attribute
let remainingPatterns = _.cloneDeep(S);

while( remainingPatterns.length > 0 ){
// select the pattern with the top incremental compression saving
let bTop = Number.NEGATIVE_INFINITY;
let pTop;
let residualTable = residue(patterns, T);

remainingPatterns.forEach(pattern => {
let compression = saving(pattern, residualTable);
if( compression > bTop ) {
bTop = compression;
pTop = pattern;
}
});

if( bTop > 0 ){
patterns.push({
"pattern" : pTop,
"saving" : bTop,
"coverage" : coverage(pTop,T)
});
}

remainingPatterns = _.difference(remainingPatterns, _.filter(remainingPatterns, pTop));
}

return patterns;
}

function residue(patterns, T){
patterns.forEach(pattern => {
T = _.difference(T, _.filter(T, pattern.pattern));
});
return T;
}

module.exports = patternMarshalling;

Given a set of patterns $\mathcal{S}$ and a table $\mathcal{T}$, we start by making a copy of the patterns passed to us. The aim is to select a pattern which gives us non-negative compression. If we find such a pattern, we add it to our patterns list and remove the chosen pattern from the current list of patterns. We continue the next itreation on the residue table. We repeat this until we have no more patterns to consider.

The next piece of the puzzle is to write code for local expansion.

local_expansion.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
"use strict";
const _ = require("lodash");
const expand = require("./expand");

/**
* Local expansion pattern generation strategy that directly looks
* for patterns that could minimize the compression cost
* @param T an array of JSON objects
* @returns {Array} of patterns that best summarize the data
*/
function localExpansion(T){
let patterns = []; // final list of patterns to return

// while we still have rows
while( T.length > 0 ){
// expand from an empty pattern (Algorithm 4)
let pattern = expand(T, {});
// stop if we cannot achieve more compression saving
if( _.isEqual(pattern, {}) ){
break;
}
// found a new pattern
patterns.push( pattern );
// remove all the rows that match the pattern i.e.residual table for the pattern
T = _.difference(T, _.filter(T, pattern));
}

return patterns;
}

module.exports = localExpansion;

As mentioned in the definitions, local expansion grows a pattern, starting with an empty pattern. The expansion is done in expand function.

expand.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
const _ = require("lodash");
const saving = require("./saving");

/**
* Expands a pattern to improve compression saving. See algorithm 4.
* @param T
* @param pattern
* @returns {*}
*/
function expand(T, pattern){
// find the attributes not included in the pattern
let allAttributes = Object.keys(T[0]);
let matchedAttributes = Object.keys(pattern);
let attributes = _.difference(allAttributes, matchedAttributes);

let bestPattern;
let highestCompressionSaving = 0;

attributes.forEach( attribute => {
// find all the unique values for the current attribute
let values = _.map(T, attribute);
values = _.uniq(values);

values.forEach( value => {
// an expanded pattern created by appending the current
// attribute and its value to the existing pattern
let newPattern = _.cloneDeep(pattern);
newPattern[attribute] = value;

let compressionSaving = saving(newPattern, T);

if(compressionSaving > highestCompressionSaving){
highestCompressionSaving = compressionSaving;
bestPattern = newPattern;
}
});
});

if( saving(bestPattern, T) > saving(pattern, T) ){
return expand(T, bestPattern);
}else{
return pattern;
}
}

module.exports = expand;

The final piece of code we need to look at is to calculate the coverage i.e. how much of the data is described by a given pattern.

coverage.js
1
2
3
4
5
6
7
8
9
10
"use strict";

const _ = require("lodash");
function coverage(pattern,T){
let matchingRows = _.filter(T,pattern);
let coverage = matchingRows.length / T.length;
return coverage * 100; // % of coverage
}

module.exports = coverage;

Now that we have all the machinery in place, let’s write a simple test. We’ll take the patients example given in the paper and turn it into JSON objects. Then we’ll write a simple script to run our code using this data and check whether the results make sense.

Here’s the data:

patients.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
{"gender":"M","age":"adult","blood_pressure":"normal"},
{"gender":"M","age":"adult","blood_pressure":"low"},
{"gender":"M","age":"adult","blood_pressure":"normal"},
{"gender":"M","age":"adult","blood_pressure":"high"},
{"gender":"M","age":"adult","blood_pressure":"low"},

{"gender":"F","age":"child","blood_pressure":"low"},
{"gender":"M","age":"child","blood_pressure":"low"},
{"gender":"F","age":"child","blood_pressure":"low"},

{"gender":"M","age":"teen","blood_pressure":"high"},
{"gender":"F","age":"child","blood_pressure":"normal"}
]

Here’s the test:

test1.js
1
2
3
4
5
6
7
8
9
10
11
"use strict";

const tsum = require("../index");
const table = require("./data/patients.json");
const _ = require("lodash");

let patterns = tsum.localExpansion( table );
let sorted = tsum.patternMarshalling(patterns,table);

patterns = _.shuffle(patterns);
console.log( sorted );

Now let’s run this:

1
2
3
4
5
6
7
$ node test/test1.js
[ { pattern: { age: 'adult', gender: 'M' },
saving: 375.3852901558848,
coverage: 50 },
{ pattern: { age: 'child', blood_pressure: 'low' },
saving: 248.35614381022526,
coverage: 30 } ]

The output says that the most descriptive pattern is “adult male” which makes up 50% of the rows followed by “children with low blood pressure” which make up 30% of the rows. if we look at the sample data, out of the 10 rows, 5 are “adult male” and 3 are “children with low blood pressure”. So the output of the algorithm checks out.

Finito.

In the previous post we looked at how JVM generates assembly code and how we can take a look at it. In this post we will look at yet another optimization technique called loop unrolling.

What is loop unrolling?

To quote The Java HotSpot Performance Engine Architecture,

the Server VM features loop unrolling, a standard compiler optimization that enables faster loop execution. Loop unrolling increases the loop body size while simultaneously decreasing the number of iterations.

Okay so it is an optimization that makes our loops faster by increasing the body size, and reducing the iterations. This is much better explained by example.

A loop like this:

1
2
3
for(int i = 0; i < N; i++) {
S(i);
}

can be unrolled into this:

1
2
3
4
5
6
for(int i = 0; i < N; i += 4) {
S(i);
S(i+1);
S(i+2);
S(i+3);
}

So the size of the loop has increased because instead of calling method S just once per iteration, we call it 4x. The iterations have reduced because we now have a stride of 4 (i += 4). This is a space-time tradeoff because you gain speed by increasing the size of the program. If this for loop were to be a part of a hot method that got compiled to assembly, more code cache would be consumed because of unrolling.

Loop unrolling, however, wins at a different front — speed. By increasing the stride of the loop, we’re reducing the number of jumps that the CPU has to make. Jumps are costly and a reduction in jumps is a huge performance boost.

Loop unrolling in action

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintAssembly"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void testMethod() {
int[] nums = new int[10];

for (int i = 0; i < nums.length; i++) {
nums[i] = 0x42;
}
}
}

The code above is a JMH benchmark which will fork 1 JVM (@Fork), warm it for 10 iterations (@Warmup), then run our benchmark for 5 iterations (@Measurement). The generated assembly looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
0x00000001296ac845: cmp    %r10d,%ebp
0x00000001296ac848: jae 0x00000001296ac8b4
0x00000001296ac84a: movl $0x42,0x10(%rbx,%rbp,4)
0x00000001296ac852: inc %ebp
0x00000001296ac854: cmp %r11d,%ebp
0x00000001296ac857: jl 0x00000001296ac845

...

0x00000001296ac880: vmovdqu %xmm0,0x10(%rbx,%rbp,4)
0x00000001296ac886: add $0x4,%ebp
0x00000001296ac889: cmp %r8d,%ebp
0x00000001296ac88c: jl 0x00000001296ac880

So the first block of code is moving 0x42 to some memory location (line #3), incrementing the counter %ebp (line #4), and jumping back to the start of the loop for the next iteration (line #6). Then there is the second block which increments the counter by 4 (line #11) and uses vmovdqu (line #10).

In the second block, vmovdqu moves %xmm0 (a 128-bit register) to some memory location (line #10) i.e. we’re writing 4 32-bit numbers in one go. This is equivalent of writing nums[0 .. 3] in one iteration.

This means that our entire loop which writes 10 array elements will not run for 10 iterations at all. It will run for 4 — 2 iterations to write 8 elements in groups of 4 each (block 2), and 2 more to write the remaining 2 elements (block 1).

What we saw above was loop unrolling to enable vectorization. Since this happened without any effort on our part, it’s called auto-vectorization.

What is vectorization?

There are certain CPU instructions which are capable of operating on multiple data elements simultaneously. Such instructions are called SIMD instructions — Single Instruction Multiple Data or vectorized instructions. In the example above, all 4 elements were packed into a single register XMM0 and then moved to their memory location with vmovdqu. This results in faster processing by saving jumps and clock cycles.

1
2
3
4
+--------|--------|--------|--------+
| 0x42 | 0x42 | 0x42 | 0x42 | xmm0
+--------|--------|--------|--------+
↓ ↓ ↓ ↓

Auto-vectorization is when the compiler converts the scalar instructions (which operate on a single data element at a time) to vector instructions (which operate on multiple data elements at a time) without any effort on the part of the programmer.

Vectorizing the code enables superword level parallelism (SLP). SLP is a type of SIMD parallelism in which source and result operands are packed in a storage location.

An example of SLP

The example shows statement packing in which the operands have been packed into registers and the scalar addition and multiplication have been replaced by their vectorized counterparts.

How does loop unrolling enable SLP?

Consider this loop:

1
2
3
for (i=0; i<16; i++) {
localdiff = ref[i] - curr[i];
}

This is a loop containing isomorphic statements — statements that contain same operations in the same order. Such statements are easy to unroll. So the loop above can be transformed as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
for (i=0; i<16; i+=4) {
localdiff = ref[i+0] - curr[i+0];
diff += abs(localdiff);

localdiff = ref[i+1] - curr[i+1];
diff += abs(localdiff);

localdiff = ref[i+2] - curr[i+2];
diff += abs(localdiff);

localdiff = ref[i+3] - curr[i+3];
diff += abs(localdiff);
}

This can be further transformed as:

1
2
3
4
5
6
7
8
9
10
11
for (i=0; i<16; i+=4) {
localdiff0 = ref[i+0] - curr[i+0];
localdiff1 = ref[i+1] - curr[i+1];
localdiff2 = ref[i+2] - curr[i+2];
localdiff3 = ref[i+3] - curr[i+3];

diff += abs(localdiff0);
diff += abs(localdiff1);
diff += abs(localdiff2);
diff += abs(localdiff3);
}

Now it becomes easy to see how SLP can be used to speed up the execution of the loop. Loop unrolling sets the stage for SLP.

Which flags control loop unrolling and SLP?

There are a couple of flags - -XX:LoopUnrollLimit and -XX:+UseSuperWord. LoopUnrollLimit controls how many times your loop will be unrolled and UseSuperWord controls the transformation of scalar operations into vectorized operations.

Conclusion

Loop unrolling makes your code faster by doing more per iteration and increasing the stride. It’s a space-time trade off where you make your code larger so that the execution takes less time. Loop unrolling can be taken a step further to make scalar instructions into vector instructions. These instructions operate on packed data and achieve more per instruction than their scalar counterparts.

The conversion from scalar instructions to vector instructions is called vectorization. Since this happens transparently to the programmer, it is called auto-vectorization.

There are a few blog posts that I recommend you read.

A fundamental introduction to x86 assembly programming
This post provides a brief introduction to x86 assembly programming. Having studied assembly programming in university a long time ago, the post is quick refresher on the basic concepts of assembly programming like addresing modes, registers, etc. I keep it handy and refer to it when I have to read the assembly code generated by JVM.

JMH - Java Microbenchmark Harness
This post provides a brief introduction to JMH (Java Mirobenchmark Harness). What does JMH do?

JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.

The post will show you how to write a simple benchmark, how to avoid common pitfalls, etc. It is a good read regardless of this series for two reasons: one, microbenchmarking is coming to Java 9 out-of-the-box and two, microbenchmarking is hard to get right.

Exploiting Superword Level Parallelism with Multimedia Instruction Sets
This is the research paper on superword level parallelism. I’ve borrowed examples and diagrams from this paper. The paper will introduce you to what SLP is, and also provide an algorithm on how to find parts of code which are amenable to SLP.