# StatisticsMaximum Likelihood Estimation

So far we've had two ideas for building an estimator for a statistical functional : one is to plug into , and the other—kernel density estimation—is closely related (we just smear the probability mass out around each observed data point before substituting into ). In this section, we'll learn another approach which has some compelling properties and is suitable for choosing from a parametric family of densities or mass functions.

Let's revisit the example from the first section where we looked for the Gaussian distribution which best fits a given set of measurements of the heights of 50 adults. This time, we'll include a goodness score for each choice of and , so we don't have to select a best fit subjectively.

The goodness function we'll use is called the **log likelihood** function, which we define to be the log of the product of the density function evaluated at each of the observed data points. This function rewards density functions which have larger values at the observed data points and penalizes functions which have very small values at some of the points. This is a rigorous way of capturing the idea that the a given density function is consonant with the observed data.

Adjust the knobs to get the goodness score as high as possible (hint: you can get it up to about ).

log likelihood = ${LL}

The best μ value is

### Definitions

Consider a parametric family of

Given , the **likelihood** is defined by

The idea is that if is a vector of independent observations drawn from , then is small or zero when is not in concert with the observed data.

Because likelihood is defined to a product of many factors, its values are often extremely small, and we may encounter

Maximizing the likelihood is the same as maximizing the log likelihood because the natural logarithm is a monotonically increasing function.

**Example**

Suppose is the density of a uniform random variable on . We observe four samples drawn from this distribution: , and . Find , , and .

*Solution.* The likelihood at 5 is zero, since . The likelihood at is very small, since . The likelihood at 7 is larger: .

As illustrated in this example, likelihood has the property of being zero or small at implausible values of , and larger at more reasonable values. Thus we propose the **maximum likelihood estimator**

**Example**

Suppose that is the normal density with mean and variance . Find the maximum likelihood estimator for and .

*Solution.* The maximum likelihood estimator is the minimizer of the logarithm of the likelihood function, which works out to

since , for each .

Setting the derivatives with respect to and equal to zero, we find

which implies (from solving the second equation) as well as

So we may conclude that the maximum likelihood estimator agrees with the plug-in estimator for

**Exercise**

Consider a Poisson random variable

Verify that

Show that it follows the maximum likelihood estimator

*Solution.* When we take the derivative with respect to

which gives us

Taking a second derivative gives

**Example**

Suppose

Show that the least squares estimator for

*Solution.* The log likelihood is

The only term that depends on

**Exercise**

(a) Consider the family of distributions which are uniform on

(b) Show that the MLE for a Bernoulli distribution with parameter

(a) The likelihood associated with any value of

(b) The derivative of the log likelihood function is

where

### Properties of the Maximum Likelihood Estimator

MLE enjoys several nice properties: under certain regularity conditions, we have

**Consistency**: as the number of samples goes to\mathbb{E}[(\widehat{\theta}_{\mathrm{MLE}} - \theta)^2] \to 0 . In other words, the average squared difference between the maximum likelihood estimator and the parameter it's estimating converges to zero.\infty **Asymptotic normality**: converges to(\widehat{\theta}_{\mathrm{MLE}} - \theta)/\sqrt{\operatorname{Var} \widehat{\theta}_{\mathrm{MLE}}} as the number of samples goes to\mathcal{N}(0,1) . This means that we can calculate good confidence intervals for the maximum likelihood estimator, assuming we can accurately approximate its mean and variance.\infty **Asymptotic optimality**: the MSE of the MLE converges to 0 approximately as fast as the MSE of any other consistent estimator. Thus the MLE is not wasteful in its use of data to produce an estimate.**Equivariance**: Suppose is the MLE of\widehat{\theta} for\theta . Then the MLE forf(\theta) isg(\theta) . This is a useful property; it states that transformation on the parameter (say, shifting the mean of a normal distribution by a number, or taking the square of the standard deviation) of interest is not an inconvenience for our MLE estimate for the parameter because we can simply apply the transformation on the MLE as well.g(\widehat{\theta})

**Example**

Show that the plug-in variance estimator for a sequence of

*Solution.* We've seen that the plug-in variance estimator is the maximum likelihood estimator for variance. Therefore, it converges to

**Exercise**

Show that it is not possible to estimate the mean of a distribution in a way that converges to the true mean at a rate asymptotically faster than

*Solution.* The sample mean is the maximum likelihood estimator, and it converges to the mean at a rate proportional to the inverse square root of the number of observations. Therefore, there is not another estimator which converges with an asymptotic rate faster than that.

### Drawbacks of maximum likelihood estimation

The maximum likelihood estimator is not a panacea. We've already seen that the maximum likelihood estimator can be biased (the sample maximum for the family of uniform distributions on

**Computational difficulties**. It might be difficult to work out where the maximum of the likelihood occurs, either analytically or numerically. This would be a particular concern in high dimensions (that is, if we have many parameters) and if the maximum likelihood function is. **Misspecification**. The MLE may be inaccurate if the distribution of the observations is not in the specified parametric family. For example, if we assume the underlying distribution is Gaussian, when in fact its shape is not even close to that of a Gaussian, we very well might get unreasonable results.**Unbounded likelihood**. If the likelihood function is not bounded, then is not even defined:\widehat{\theta}_{\mathrm{MLE}}

**Exercise**

Consider the family of distributions on

where

*Solution.* We identify the largest value in our data set and choose

One further disadvantage of the maximum likelihood estimator is that it doesn't provide for a smooth mechanism to account for prior knowledge. For example, if we flip a coin twice and see heads both times, our (real-world) beliefs about the coin's heads probability would be that it's about 50%. Only once we saw quite a few heads in a row would we begin to use that as evidence move the needle on our strong prior belief that coins encountered in daily life are not heavily weighted to one side or the other.

*Bayesian* statistics provides an alternative framework which addresses this shortcoming of maximum likelihood estimation.