## Thursday, November 1, 2012

### Maximum Likelihood and Information

Maximum likelihood methods can seem complex and daunting and certainly many aspects of the maximum likelihood can be daunting.  However, the general idea behind maximum likelihood is very intuitively appealing and an understanding of the generalities is sufficient for many people who use many maximum likelihood procedures without knowing the formulas behind them.

Maximum Likelihood Methods are methods that use the theoretical probability distribution of outcomes to solve the parameter estimates that maximize the probability of observing the particular outcome observed.  Let’s see this is action.

Imagine we can observe 8 potential test outcomes for a person from a test (100, 200, 300, 400, 500, 600, 700, 800).  The test outcomes has a conditional probability of occurring based on the characteristics of the person (theta).  We can observe the total test score for the person but we cannot observe the theta.

The probability of each outcome occurring can be read from the following table.
Table 1:

 Score 100 200 300 400 500 600 700 800 Total Theta -4 0.60 0.25 0.10 0.05 0.00 0.00 0.00 0.00 1 -3 0.30 0.50 0.15 0.05 0.00 0.00 0.00 0.00 1 -2 0.20 0.30 0.40 0.05 0.05 0.00 0.00 0.00 1 -1 0.10 0.20 0.30 0.20 0.10 0.05 0.05 0.00 1 0 0.05 0.10 0.15 0.20 0.20 0.15 0.10 0.05 1 1 0.00 0.05 0.10 0.25 0.30 0.15 0.10 0.05 1 2 0.00 0.00 0.05 0.15 0.20 0.30 0.15 0.15 1 3 0.00 0.00 0.05 0.05 0.15 0.20 0.30 0.25 1 4 0.00 0.00 0.00 0.05 0.10 0.25 0.30 0.30 1 Total 1.25 1.4 1.3 1.05 1.1 1.1 1 0.8 Probability 0.14 0.16 0.14 0.12 0.12 0.12 0.11 0.09

We read this conditional probability table horizontally.  That is P(T=100|theta=-4) = 60% or P(T=500|theta=4) = 10%.  Horizontally the probabilities must sum to 1 but vertically they need not.  We can interpret the vertical summing as a density measure representing the relative likelihood of observing that score if the theta's are distributed uniformly P(theta=THETA)= 1/9 given THETA={-4,-3,...3,4}.  That I mean to say by the previous notation is that the probability that any random draw of theta equals a particular draw of theta is 1/9.

Thus, given that ability is uniformly drawn, the bottom most row in the table is the probability of observing that particular score.

So what does this have to do with maximum likelihood?  Imagine that we know the information from Table 1 and we see a particular outcome T.  Can we calculate the probability that the person has a particular theta value?  Yes!

Imagine that T=100.  From the table we should be able to see that the most likely theta value is -4.  But what is the exact probability?  It is the probability of the outcome occuring if theta is -4 over the sum of the probability of the outcome occurring (Bayes theorem P(theta=-4|T=100)=P(T=100|theta=-4)/sum(across all THETAS of P(T=100|theta=THETA).

Thus:

P(theta=-4|T=100)= .6/1.25 = 48%

In other words.  Given an observed score of 100, the probability that the person has a theta=-4 is 48%.

We can construct a new table with conditional probabilities differing based instead conditional probabilities of observing a particular theta value given a score value.

Table 2
 Score 100 200 300 400 500 600 700 800 Total Theta -4 0.48 0.18 0.08 0.05 0.00 0.00 0.00 0.00 0.78 -3 0.24 0.36 0.12 0.05 0.00 0.00 0.00 0.00 0.76 -2 0.16 0.21 0.31 0.05 0.05 0.00 0.00 0.00 0.78 -1 0.08 0.14 0.23 0.19 0.09 0.05 0.05 0.00 0.83 0 0.04 0.07 0.12 0.19 0.18 0.14 0.10 0.06 0.90 1 0.00 0.04 0.08 0.24 0.27 0.14 0.10 0.06 0.92 2 0.00 0.00 0.04 0.14 0.18 0.27 0.15 0.19 0.97 3 0.00 0.00 0.04 0.05 0.14 0.18 0.30 0.31 1.02 4 0.00 0.00 0.00 0.05 0.09 0.23 0.30 0.38 1.04 Total 1 1 1 1 1 1 1 1
We can see that the Table 2 is somewhat adjusted from Table 1 but generally not hugely.  This is not a rule.  If there were many more categories of theta then it is likely the adjustment would be more dramatic.

So, in this example a maximum likelihood estimator would choose eight different expected values for theta for each score observed.  Let's define M as the solution to the maximum likelihood problem.  From Table 2 all we need do is read the highest probability from each column.

M(T=100) = -4
M(T=200) = -3
M(T=300) = -2

The maximum likelihood estimator need not peek at every potential theta value.  In this case the maximum likelihood estimator would jump from -2 to 1.  This is somewhat an artifact of the discrete nature of this setup.  If theta and the score were continuous then it is less likely some values of theta would be skipped.
M(T=400) = 1
M(T=500) = 1
M(T=600) = 2

M(T=700) = 3 or 4
The maximum likelihood estimator for most maximization problems needs to have a single peak.  This table would be hard to maximize across for many maximization algorithms.  This is not really a problem because this table is somewhat contrived.
M(T=800) = 4

Thus, this table illustrates some of the common problems with maximum likelihood.  Some values of the parameter are hard to identify (ie. T=0) while some problems have "flat" spots to be maximized over that cause the algorithm not to converge.

When looking at Table 2 think not just on the peaks but also on the "Information" that you have by observing particular test scores.  In other words.  How much information do you get from knowing a particular test score?  If for instance you knew that T=100 then you would know your most likely theta=-4 and that the theta has a 96% chance (48+24+16+8) of being between -1 and -4.  This can be thought of as the 96% confidence interval.  If however you have a T=400 you know that your most likely theta=1 but only that you have a 95% chance that your theta is between (-3 and 4).  This is a pretty wide confidence interval on your estimate.  Thus we can see that some test values have more "information" than other test values.

Let's imagine testing a hypothesis Table 2:
H0: theta=-4 alpha=.05
Observe:
T=100 fail to reject
T=200 fail
T=300 fail
T=400 reject
T>400 reject

Thus we have enough information from this test to potentially reject the null when H0:theta=-4.  If however, the null was H0: theta=0 then only in the event T=100 could we reject the null at a 5% level and T=800 at a 10% level.

I hope this discussion is useful.  I certainly found it useful to think through.