# Interestingly, one of the founders of Item Response Theory (Fredrick Lord) developed his own concept of information. His approach was both unique and interesting but ended up leaving us with the same Fisher Information Matrix.
# The Fisher Information equation is an equation that measures for any particular value of an estimate, what the particular amount of information you have on that estimate is. It is a difficult subject that I struggle with.
# It might be useful to look at how it can be found.
# First keep in mind that for Fisher Information we are primarily concerned with Maximum Likelihood.
# For maximum likelihood we are primarily concerned in maximizing the ln(pdf(theta)) by choosing theta.
# Define L = ln(pdf(theta))
# The score is equal to the first derivative of the L with respect to thera. Define Score = S = dL/dTheta = d ln(pdf(theta))/d theta
# We know that if the MLE maximization routine has worked properly then the score is equal to zero.
# Now the Fisher Information is equal to the expected value of the score squared (the second moment) given theta:
# I = E(S^2|theta)
# Now, the tricky part is thinking how to make sense of this thing we call information.
# First let's think how it is used.
# For one thing the standard error of estimation is se=1/(I)^.5
# In a similar vien, it can be shown that the variance of any unbiased estimator (f) of theta has a lower limit of 1/I: Var(f(theta)) >= 1/I
# Thus we can see that there is a direct inverse relationship between information and the variance of our estimator.
# How do we reconcile this?
# Item Response theory has some very nice applications of information that I believe shed light on other uses of information.
# For the Rasch Model the item information for one item is. I(theta) = p(theta)(1-p(theta)
# Where p is the probability of getting the item correct.
# Let's map this out for a range of thetas assuming the difficulty parameter is one.
theta = seq(-4,6,.1)
# Now let's define the Rasch function:
rasch = function(theta,b) exp(theta-b)/(1+exp(theta-b))
# Let's also define the information value.
rasch.info = function(theta,b) rasch(theta,b)*(1-rasch(theta,b))
plot(theta, rasch.info(theta,1), ylab = "Information", main="The most information is gained at theta=1 (1PL)", type="l")
# We can see that information peaks at theta=1. What does this mean for practical purposes? If you want to know the most about a student's ability give them a question in which the difficulty of the item is at their ability level. If you though give them a question that is far too easy or far too hard then even if they do well on the question or poor on the question you have not learned that much about their ability level on the theta scale.
# Why? Because the likelihood of someone else doing equally well on that question who is close on the theta scale is equivalent.
# We can see this by looking at the item characteristic curve:
plot(theta, rasch(theta,1), main="Item Characteristic Curve", ylab="Probability of Correct Response", type="l")
# We can see that the change in the probability of getting the item correct is largest at theta=1. However, as theta gets very large or very small there is very little change in the probabilities (within each prospective range) of getting the item correct as a result of a small change in theta.
# Another way of thinking about this is: if we had two students in the room and both of them were about the same ability and we wanted to know which was stronger, we would want to give them the question which was most closely aligned at the difficulty related to their ability.
# Let's look at a few more different IRT models.
# The 2 parameter logistic model is very similar to the single parameter:
# I(theta) = a^2*p(theta)*(1-p(theta))
# The only difference exists in that the two parameter model allows for their to be more or less information generated from each item as a result of the "discriminatory" power of the item. Though it is worth noting that a high a parameter model does not strictly dominate a low a parameter model in terms of information for all values of theta. This is because the p*(1-p) is also a function of theta. Let's see this in action:
PL2 = function(theta,a, b) exp(a*(theta-b))/(1+exp(a*(theta-b)))
PL2.info = function(theta, a, b) a^2*PL2(theta,a,b)*(1-PL2(theta,a,b))
plot(theta, PL2.info(theta,2,1), ylab = "Information", main="Larger a creates a larger peak but shallower tails", type="l")
# We can see that we greatly prefer a=2 for theta values between -1 and 3.
# However, the item with less discriminatory power may have more information in the tails. This is somewhat difficult to understand. On a test, a good example may be imagine two different questions. One question, is a arithmatic question for which students must demonstrate knowledge of long division. The alternative, is a question in which students answer a word problem by piecing together components and understanding the concepts behind the math. The first question may be a better question for identifying if a student either knows or does not know long division. However, the second question, may be less good at identifying specific mastery at a particular skill level, but rather demonstrates the ability of the student to pull together various math concepts into a coherent answer. Thus we might not be able to infer much about arithmatic ability if a student answers this question correctly. But, any student answering this question correctly whatever, their ability level tells us something about their ability.
# We will also look briefly at the 3 parameter Logistic Model:
# I(theta) = a^2 *(p-c)^2/(1-c)^2 * (1-p)/p
# It can be shown that as c increases, the information function monotonically decreases. This makes sense in that the guessing parameter c is the opposite of information. As c gets larger, the likelihood of the student getting the question right by pure chance also gets larger.
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)
plot(theta, PL3.info(theta,1.2,1,.4), ylab = "Information", main="3PL information a=1.2, b=1, c=.4", type="l")
# We can see that the 3 parameter logistic model has an asymetric form. This is the result of the guessing parameter. The problem with respondents having the ability to guess correctly is that we end up having less information on low ability respondents than we would like.