So far what we’ve been doing is point estimation using OLS. Another method of estimation with more mathematical rigor is max likelihood estimation. In this post we will look at likelihood, likelihood function, max likelihood, and how all this ties together in estimating the parameters of our bivariate linear regression model.
When we think of probability, we are thinking about the chance of something happening in the future. For example, if a coin is tossed, what is the chance of getting a head? It’s
But how did we arrive at the conclusion that the probability of getting a head is
If the coin were tossed a 1000 times, about half the times the outcome would be a head and half the times it would be a tail. The more frequently an event happens, the more probable / certain it is. This is the frequentist definition of probability. Under this definition, probability is the ratio of the number of times we make the desired observation to the total number of observations.
In the Bayesian approach, probability is a measure of our certainty about something. In a coin toss, there’s a 50% chance of getting a head and a 50% chance of getting a tail because there’s only two possible outcomes.
This subtle difference in the definition of probability leads to vastly different statistical methods.
Under frequentism, it is all about the frequency of observations. The event that occurs more frequently is the more probable event. There is no concept of an underlying probability distribution for the event we are trying to observe.
Under Bayesianism, in contrast, each event that occurs follows an observed probability distribution which is based on some underlying true distribution. Calculating the probabilities involves estimating the parameters of the true distribution based on the observed distribution.
Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome.
For example, if a coin were tossed 10 times and out of those 10 times a person guessed the outcome correctly 8 times then what’s the probability that the person would guess equally correctly the next time a coin is tossed 10 times? This is the likelihood. Unlike probability, likelihood is about past events.
The likelihood of observing the data is represented by the likelihood function. The likelihood function is the probability density function of observing the given data. Suppose that we have
If we were to maximize the likelihood function, we’d end up with the highest probability of observing the data that we have observed i.e. the max likelihood. The data that we have observed follows some distribution we’d like to estimate. This would require us to estimate the parameters
The procedure followed in maximizing the likelihood function to compute
The mechanics of finding
Calculating the partial derivatives like this becomes unwieldy very quickly since the likelihood function
This provides us with enough background to apply MLE to find our regression parameters.
In regression, we have our data
In the equations above,
To find the ML estimators, we need to differentiate the log-likelihood function partially with respect to
Setting the above equations to zero and letting
Simplifying the first two equation above, we get:
The equations above are the normal equations for OLS estimators. What this means is that the OLS estimators
In summary, MLE is an alternative to OLS. The method, however, requires that we make an explicit assumption about the probability distribution of the data. Under the normality assumption, the estimators for