Thursday, September 13, 2012

Relating Classics Test Theory Parameters to IRT Parameters

* Relating Classics Test Theory Parameters to IRT Parameters

* This post follows some of the discussion in Multidimensional Item Response Theory by Mark Reckase's chapter 2.

* First let's relate Classical Test theory ideas of difficulty to that of IRT parameter b - difficulty.

* We will use for our underlying true DGP the three parameter logistic model.

* (2.13) P(U=u) = c + (1-c) * exp(a(t - b))/(1 + exp(a(t - b)))

* That is, the probability of getting the problem right, is a function of c (the guessing parameter of the item), a (the discriminatory parameter of the item), b (the difficulty parameter of the item), and t (the ability of the test taker).

* We want to equate this to the classical test theory idea of difficult.  In classical test theory the probability of getting a item correct is the difficulty of the item.

* Let us imagine a heterogenous group of 1000 students

set obs 1000

* Create a student ID
gen stud_id = _n

gen t = rnormal()

* Now let's see how we can compare classical difficulties (CD) to IRT difficulty parameter b.

* Let's imagine that all of our students test the same test with 100 items that range in b value which is independent of the choices of parameters a and c.

* Let's have all of our data listed vertically.

* Create 200 test items for each student
expand 200

* Give each item a different ID
bysort stud_id: gen item_id = _n

* There are many ways to make sure all of the items have the same parameters.  I will use a for loop though generating a seperate data set for all of the items and merging it in would be another good way or drawing all of the parameters from distributions then taking the average accross all of the items of the same ID would probably be the most efficient code wise but would make it difficult to specify exact distributional parameters.
gen a = .
gen b = .
gen c = .

qui forv i = 1/200 {
  * This will only draw one random variable for each local macro
  local a = runiform()/2+.4
  local b = rnormal()*2
  local c = runiform()/4

  * This will assign that draw to the item `i'
  replace a = `a' if item_id==`i'
  replace b = `b' if item_id==`i'
  replace c = `c' if item_id==`i'

* Now let's generate the probability of getting that problem correct given the parameter values and the student t scores.

gen P = c + (1-c) * exp(a*(t - b))/(1 + exp(a*(t - b)))

* If we try to do a direct scatterplot then we are overwhelmed.

* Instead we want to know the probability of a correct answer for each item (given the population being tested).

* Let's us first preserve the current state of our data.

* So we collapse the data set to item level.

* The default of the collapse command is to take the mean.
collapse a b c P t, by(item_id)
* I included the t value just as a debugging test.  t should be constant accross all items.

label var b "IRT b"
label var P "Difficulty (Probability of Correctly Answer)"

scatter P b

* The reason things start fanning out as b gets large is due to the guessing parameter c.

* Even when b is so large that the probability of getting the answer correct based on knowledge is close to zero there is still the chance of guessing the correct answer.


* In order to compare the discrimination parameter to classical test theory we will look at the Point Biserial correlation.  Which is the correlation between a the responses to an item on the test and the total test score.

* First we need to draw actual item reponses

gen u = rbinomial(1, P)

* Now let's generate total test scores

bysort stud_id: egen total_score = sum(u)

* Now let's generate the point-biserial values for our items

gen pbiserial = .

qui forv i = 1/200 {
  corr u total_score if item_id == `i'
  replace pbiserial = r(rho) if item_id == `i'

collapse a pbiserial, by(item_id)

label var a "IRT discriminatory parameter (a)"
label var pbiserial "Classical test theory point-biserial correlation"
twoway (lfitci pbiserial a)  (scatter pbiserial a)


* In order to approximate c using just classical test scores we will look at the lowest 10% of students in terms of total scores and see how they perform on each question on average.

xtile score_pct = total_score, nquantiles(10)
* This should have created 5 groups ranked accounting to total_score

bysort item_id: egen c_hat = mean(u) if score_pct == 1

collapse c c_hat, by(item_id)

label var c "IRT guessing parameter (c)"
label var c_hat "A guess at the guessing parameter"
twoway (lfitci c_hat c)  (scatter c_hat c)


* We can see there is some relationship between our guess at the guessing parameter using the item responses for the lowest 10% of students and the true guessing parameter.

* The problem with this graph is that items have different difficulties

* Finally let's look at out estimates of student ability t relative to that of their total test score

collapse total_score t, by(stud_id)

label var total_score  "Total test score"
label var t "IRT univeratiate ability (t)"
twoway (lfitci total_score t)  (scatter total_score t)


* It seems that total test score provides a reasonable linear approximation for student ability even when ability is drawn using a IRT data generating process.

* The largest advantage of IRT relative to that of classical test theory total test performance estimators is the external applicability of the estimates.  IRT is supposed to predict future performance on different tests.  While, classical test theory only predicts performance on the same or similar tests (if I understand this properly).

No comments:

Post a Comment