最大似然估计Maximum-likelihood (ML) Estimation

Suppose that an experiment consists of n = 5 independent Bernoulli trials, each having probability of success p. LetX be the total number of successes in the trials, so that X∼Bin(5,p)X∼Bin(5,p). If the outcome is X = 3, the likelihood is

L(p;x)=n!x!(n−x)!px(1−p)n−x=5!3!(5−3)!p3(1−p)5−3∝p3(1−p)2L(p;x)=n!x!(n−x)!px(1−p)n−x=5!3!(5−3)!p3(1−p)5−3∝p3(1−p)2

where the constant at the beginning is ignored. A graph of L(p;x)=p3(1−p)2L(p;x)=p3(1−p)2 over the unit interval p ∈ (0, 1) looks like this:

It’s interesting that this function reaches its maximum value at p = .6. An intelligent person would have said that if we observe 3 successes in 5 trials, a reasonable estimate of the long-run proportion of successes p would be 3/5 = .6.

This example suggests that it may be reasonable to estimate an unknown parameter θ by the value for which the likelihood function L(θ ; x) is largest. This approach is called maximum-likelihood (ML) estimation. We will denote the value of θ that maximizes the likelihood function by θ^θ^, read “theta hat.”θ^θ^ is called the maximum-likelihood estimate (MLE) of θ.

Finding MLE’s usually involves techniques of differential calculus. To maximize L(θ ; x) with respect to θ:

first calculate the derivative of L(θ ; x) with respect to θ,
set the derivative equal to zero, and
solve the resulting equation for θ.

These computations can often be simplified by maximizing the loglikelihood function,

l(θ;x)=logL(θ;x)l(θ;x)=logL(θ;x),

where “log” means natural log (logarithm to the base e). Because the natural log is an increasing function, maximizing the loglikelihood is the same as maximizing the likelihood. The loglikelihood often has a much simpler form than the likelihood and is usually easier to differentiate.

In Stat 504 you will not be asked to derive MLE’s by yourself. In most of the probability models that we will use later in the course (logistic regression, loglinear models, etc.) no explicit formulas for MLE’s are available, and we will have to rely on computer packages to calculate the MLE’s for us. For the simple probability models we have seen thus far, however, explicit formulas for MLE’s are available and are given next.

ML for Bernoulli trials

If our experiment is a single Bernoulli trial and we observe X = 1 (success) then the likelihood function is L(p ; x) =p . This function reaches its maximum at p^=1p^=1. If we observe X = 0 (failure) then the likelihood is L(p ; x) = 1 − p , which reaches its maximum at p^=0p^=0. Of course, it is somewhat silly for us to try to make formal inferences about θ on the basis of a single Bernoulli trial; usually multiple trials are available.

Suppose that X = (X₁, X₂, . . ., X_n) represents the outcomes of n independent Bernoulli trials, each with success probability p . The likelihood for p based on X is defined as the joint probability distribution of X₁, X₂, . . . , X_n. Since X₁, X₂, . . . , X_n are iid random variables, the joint distribution is

L(p;x)≈f(x;p)=∏i=1nf(xi;p)=∏i=1npx(1−p)1−xL(p;x)≈f(x;p)=∏i=1nf(xi;p)=∏i=1npx(1−p)1−x

Differentiating the log of L(p ; x) with respect to p and setting the derivative to zero shows that this function achieves a maximum at p^=∑i=1nxi/np^=∑i=1nxi/n. Since ∑i=1nxi∑i=1nxi is the total number of successes observed in the n trials, p^p^ is the observed proportion of successes in the n trials. We often call p^p^ the sample proportion to distinguish it from p , the “true” or “population” proportion. Note that in some textbooks the authors may use π instead of p. For repeated Bernoulli trials, the MLE p^p^ is the sample proportion of successes.

ML for Binomial

Suppose that X is an observation from a binomial distribution, X ∼ Bin(n, p ), where n is known and p is to be estimated. The likelihood function is

L(p;x)=n!x!(n−x)!px(1−p)n−xL(p;x)=n!x!(n−x)!px(1−p)n−x

which, except for the factor n!x!(n−x)!n!x!(n−x)!, is identical to the likelihood from n independent Bernoulli trials with x=∑i=1nxix=∑i=1nxi. But since the likelihood function is regarded as a function only of the parameter p, the factor n!x!(n−x)!n!x!(n−x)!is a fixed constant and does not affect the MLE. Thus the MLE is again p^=x/np^=x/n, the sample proportion of successes.

You get the same value by maximizing the binomial loglikelihood function

l(p;x)=k+x log p+(n−x) log (1−p)l(p;x)=k+x log p+(n−x) log (1−p)

where k is a constant that does not involve the parameter p. In the future we will omit the constant, because it's statistically irrelevant.

The fact that the MLE based on n independent Bernoulli random variables and the MLE based on a single binomial random variable are the same is not surprising, since the binomial is the result of n independent Bernoulli trials anyway. In general, whenever we have repeated, independent Bernoulli trials with the same probability of successp for each trial, the MLE will always be the sample proportion of successes. This is true regardless of whether we know the outcomes of the individual trials X₁, X₂, . . . , X_n, or just the total number of successes for all trials X=∑i=1nXiX=∑i=1nXi.

Suppose now that we have a sample of iid binomial random variables. For example, suppose that X₁, X₂, . . . , X₁₀are an iid sample from a binomial distribution with n = 5 and p unknown. Since each X_i is actually the total number of successes in 5 independent Bernoulli trials, and since the X_i’s are independent of one another, their sum X=∑i=110XiX=∑i=110Xi is actually the total number of successes in 50 independent Bernoulli trials. Thus X∼Bin(50,p)X∼Bin(50,p) and the MLE is p^=x/np^=x/n, the observed proportion of successes across all 50 trials. Whenever we have independent binomial random variables with a common p , we can always add them together to get a single binomial random variable.

Adding the binomial random variables together produces no loss of information about p if the model is true. But collapsing the data in this way may limit our ability to diagnose model failure, i.e. to check whether the binomial model is really appropriate.

ML for Poisson

Suppose that X = (X₁, X₂, . . . , X_n) are iid observations from a Poisson distribution with unknown parameter λ. The likelihood function is:

L(λ;x)=∏i=1nf(xi;λ)=∏i=1nλxie−λxi!=λ∑i=1nxie−nλx1!x2!⋯xn!L(λ;x)=∏i=1nf(xi;λ)=∏i=1nλxie−λxi!=λ∑i=1nxie−nλx1!x2!⋯xn!

By differentiating the log of this function with respect to λ, that is by differentiating the Poisson loglikelihood function

l(λ;x)=∑i=1nxi log λ−nλl(λ;x)=∑i=1nxi log λ−nλ

ignoring the constant terms that do not depend on λ, one can show that the maximum is achieved at λ^=∑i=1nxi/nλ^=∑i=1nxi/n. Thus, for a Poisson sample, the MLE for λ is just the sample mean.

Next: Likelihood-based confidence intervals and tests.

from: https://onlinecourses.science.psu.edu/stat504/node/28