So far what we’ve been doing is point estimation using OLS. Another method of estimation with more mathematical rigor is max likelihood estimation. In this post we will look at likelihood, likelihood function, max likelihood, and how all this ties together in estimating the parameters of our bivariate linear regression model.

## Probability

When we think of probability, we are thinking about the chance of something happening in the future. For example, if a coin is tossed, what is the chance of getting a head? It’s

But how did we arrive at the conclusion that the probability of getting a head is

If the coin were tossed a 1000 times, about half the times the outcome would be a head and half the times it would be a tail. The more frequently an event happens, the more probable / certain it is. This is the frequentist definition of probability. Under this definition, probability is the ratio of the number of times we make the desired observation to the total number of observations.

In the Bayesian approach, probability is a measure of our certainty about something. In a coin toss, there’s a 50% chance of getting a head and a 50% chance of getting a tail because there’s only two possible outcomes.

This subtle difference in the definition of probability leads to vastly different statistical methods.

Under frequentism, it is all about the frequency of observations. The event that occurs more frequently is the more probable event. There is no concept of an underlying probability distribution for the event we are trying to observe.

Under Bayesianism, in contrast, each event that occurs follows an observed probability distribution which is based on some underlying true distribution. Calculating the probabilities involves estimating the parameters of the true distribution based on the observed distribution.

## Likelihood

Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome^{[1]}.

For example, if a coin were tossed 10 times and out of those 10 times a person guessed the outcome correctly 8 times then what’s the probability that the person would guess equally correctly the next time a coin is tossed 10 times? This is the likelihood. Unlike probability, likelihood is about past events.

The likelihood of observing the data is represented by the **likelihood function**. The likelihood function is the probability density function of observing the given data. Suppose that we have

## Max Likelihood Estimation

If we were to maximize the likelihood function, we’d end up with the highest probability of observing the data that we have observed i.e. the max likelihood. The data that we have observed follows some distribution we’d like to estimate. This would require us to estimate the parameters

The procedure followed in maximizing the likelihood function to compute

The mechanics of finding

Calculating the partial derivatives like this becomes unwieldy very quickly since the likelihood function **log-likelihood** since it will convert the product term to a summation term. It is defined as

This provides us with enough background to apply MLE to find our regression parameters.

## MLE for Regression

In regression, we have our data

In the equations above,

To find the ML estimators, we need to differentiate the log-likelihood function partially with respect to

Setting the above equations to zero and letting

Simplifying the first two equation above, we get:

The equations above are the normal equations for OLS estimators. What this means is that the OLS estimators

## Summary

In summary, MLE is an alternative to OLS. The method, however, requires that we make an explicit assumption about the probability distribution of the data. Under the normality assumption, the estimators for