## Saturday, September 29, 2012

### Simulating 3 parameter IRT data

# Some Easy Item Response Theory Data Generating Commands

# The following code will draw random test items from several common item response theory forms.

# The command mat.binom will be useful in drawing a matrix of binomial results. I will use a command I programmed in a previous post. (http://www.econometricsbysimulation.com/2012/09/item-response-theory-estimation.html)
mat.binom <- function(n,p) {
bin.mat <- p*0
for (i in 1:nrow(p)) {
for (ii in 1:ncol(p)) {
# This will draw a random binomial for each of the probabilities.
bin.mat[i,ii] <- rbinom(1,n,p[i,ii])
}
}
return(bin.mat)
}

# Define a function for the 3 parameter item response model. D is automatically set to 1.7 to create equivalent responses to that of using the normal distribution.
rirt3 <- function(theta=0,a=1,b=0,c=0, D=1.7) {
nitems = max(length(a),length(b),length(c))
# Note theta, a, b, or c can be vectors.  The length of the a,b,c vector(s) tells R how many items to draw.

# This creates a matrix of theta values
m.theta <- matrix(theta, nrow=length(theta), ncol=nitems)

# This creates a matrices of item parameter values for parameters a,b,c called m.a,m.b, and m.c
for (v in c("a","b","c")) assign(paste("m.",v,sep=""),
t(matrix(get(v), ncol=length(theta), nrow=nitems)))

# This code does the same thing as the following code.
# m.a <- t(matrix(a, ncol=length(theta), nrow=nitems))
# m.b <- t(matrix(b, ncol=length(theta), nrow=nitems))
# m.c <- t(matrix(c, ncol=length(theta), nrow=nitems))

# Now we calculate probabilities of getting the problem right:
m.p = m.c+(1-m.c)/(1+exp(-D*m.a*(theta-m.b)))

# We just need to pass to the probability matrix into the mat.binom function.
Y = mat.binom(1, m.p)

# Now let's return the items that we have calculated.
return(Y)
}

# Let's see this command in action
theta=seq(-8,8,1)
a=1
b=seq(-10,10,5)
c=seq(.1,.5,.1)
# b will be matched with c such that -10 is with .1, -5 with .2, 0 with .3, etc.
# We can see how these will be matched by observing
cbind(a,b,c)

rirt3(theta,a,b,c)

# We can see that though the last item is very difficult there are a few who have guessed it.  That is because the guessing parameter is very high.

# Of course the one parameter and the two IRT models are easily gotten from the three parameter if a=1 and c=0.  The one parameter IRT model is also known as the Rasch model.

# Let's now define the Generalized Graded Response model.

rgrm <- function(theta=0,a=cbind(rep(1,5),rep(1,5)),b=cbind(rep(0,5),rep(1,5)), D=1.7) {
# b is now an input matrix with each row representing a different item and each column representing a different grade for that item.  The number of columns should be equal to the maximum number of grades for all of the items.  Items may have less than the full number of grades indicated by NA values at upper levels.

# We will use the grm Cululative Grade Function defined in a previous post as what we will use to generate our probabilities.
grm <- function(theta=1, a=1, b = c(0,1,2), cplot=T, pplot=T, stackpoly=F ) {
CGF = matrix(NA, ncol=ngrade  , nrow=length(theta))
for (i in 1:length(theta)) CGF[i,] = exp(a*(theta[i]-b))/(1+exp(a*(theta[i]-b)))
return(CGF)
}

# This will be the number of items to generate.
if (sum(dim(a)!=dim(b))>0) warning()
nitems = nrow(b)

# This matrix will hold the results of the items.
Y <- matrix(NA, nrow=length(theta), ncol=nitems)

print(nitems)
for (i in 1:nitems) {
a.sub = !is.na(a[i,])
b.sub = !is.na(b[i,])
print(grm(theta=theta, a=a[i,a.sub], b=b[i,b.sub]))
}
}

rgrm()
rgrm(theta=c(1,2,9,2,3,4))
rgrm(theta=c(1,2,9,2,3,4), a=cbind(rep(1,6),rep(1,6),rep(c(NA,1),3)),b=cbind(rep(1,6),rep(2,6),rep(c(NA,3),3)))

## Friday, September 28, 2012

### Introduction to Multi-Agent Simulations

Multi-agent simulations are simulations in which two or more agents take sequential actions frequently with respect to each other within a constructed environment.  This type of simulation is powerful in that it can demonstrate a hypothesis or generate results to situations which are frequently too complicated to solve by hand.  These types of simulations are frequently used to support hypotheses that researchers would otherwise have difficulty showing with real data.  They are also useful at exploring environments that are inherently difficult or impossible to run experiments in.

The simulations by Robert Axelrod and biologist W.D. Hamilton attempting to demonstrate the evolution of cooperation [1] use multi-agent simulations in an attempt to show that within an environment where agents can either cooperate or cheat with repeated outcomes the best strategy under many circumstances is to cooperate.  This type of simulation is uniquely powerful in that it evaluates many competing strategies ultimately identifying one that works under a well-defined scenario.

Multi-agent simulations are appealing in that they allow researchers to generate data and evaluate scenarios that are rich and complex and potentially more similar to that of the “real world”.  Within a multi-agent simulation each agent acts with its own strategy.  That strategy can either be a common objective or an individual objective.

These types of simulations might be effectively deployed to explain the behaviors of ants.  Each individual ant has a relatively simple set of commands.  By simulating many competing commands and evaluating the effectiveness of different types of ants at building colonies within a rich environment it could be informative.

Within applied economics multi-agent simulations often have a spatial component that is used to identify placement and relation of agents within the environment.  Other relational scales may be appropriate as well.  If looking at social network analysis it may be useful to have both a spatial as well as shared public spaces networks in which agents can interact.

One of the most challenging aspect of multi-agent simulations is the inherent complexity of the environment and the interactions that agents are making.  Often times, though you have defined the environment and the strategies of agents it is not clear exactly how those strategies interact with the environment to produce the results that they do.

In a way this is the point.  Multi-agent simulations can demonstrate results that are otherwise difficult to demonstrate purely through math or argumentation.  But, as such it is often difficult to be sure what mechanisms are driving the results that are being produced: because of the agents’ actions, the environmental variables, or some other artifact not previously considered.

As such, care in interpretation and rigorous error checking are extremely routines when designing and evaluating simulations.

## Thursday, September 27, 2012

### Using Random Treatment to Control for Self-Selection Bias

* Imagine that you have some data on an unemployed population.  You would like to give job training to some of these workers however you are concerned that those who volunteer to be part of the job training program are fundamentally different than those who did not sign up for the job training.

* We have data on 2000 individuals
clear
set obs 2000

* Educational attainment is known
gen education = rnormal()

gen ability = rnormal()
gen motivation = rnormal()
* Ability and motivation levels are unobservable.

* Thoe who have greater motivation are more likely to sign up for the job training program.
gen p_sign_up = normal(motivation)

* The decision to seek trianing or not is a random one.
gen sign_up = rbinomial(1,p_sign_up)

* If we give training to everybody that signs up then
gen train = sign_up

* Once, workers have recieved training the training has some benefit on expected earnings.

gen u = rnormal()

gen exp_earn = ability + 2*education + motivation + train*.5 + u*2

* Let's see how well we can estimate the returns to training.

reg exp_earn train education

* We can see that because ability (part of the error term) is correlated with assignment to training selection bias into training is causing it to look more effective than it actually is.

******

* Alternatively, when choosing who to give training to we could instead choose to give it only to a random selection of those who signed up for training.

gen train2 = rbinomial(1,.5) if sign_u==1
replace train2 = 0 if train2 == .

gen exp_earn2 = ability + 2*education + motivation + train2*.5 + u*2

* If we stop there then we actually have not helped anything.
reg exp_earn2 train2 education

* However, if we restrict our estimation to only people who signed up for the training program we can use the randomness of the program to estimate unbiased results.
reg exp_earn2 train2 education if sign_up==1

* The only catch is that now we cannot argue that we have an unbiased estimate of the effect of the training program for the entire population.  Only for those that chose to sign up.  This is not entirely a bad thing.  If we make the program voluntary then it will be based on sign up anyways.  Thus selecting based on signing up is not an unreasonable restriction.

## Wednesday, September 26, 2012

### Item Response Theory Estimation

# Item response theory (IRT) is very tricky in a sense because it requires both estimation of the regressors (student ability) as well as estimation of the parameters of the items.

# IRT, I have heard that this somehwhat complicated task can be accomplished though a series of maximum likelihood estimations.  First by specifying specific student abilities then by estimating the item parameters.  Then estimating the student abilities by the item parameters, ect.

# Let's start by simulating some data.

nitem = 40

nstud = 100

# To begin with let's start with a single parameter IRT model

# Probability of getting the item i right for person j is is R(x_i=1)=exp(theta_j-b_i)/(1+exp(theta_j-b_i))

# We will range our item difficulties from -4 to 4

vec.b = seq(-4,4,length.out=nitem)

# We will also range our student ability from -4 to 4

vec.theta = seq(-4,4,length.out=nstud)

# Each student will have a probability p of getting each question right.

# Let's first make matrices from theta and the b parameter sets.
b = t(matrix(vec.b,nrow=nitem,ncol=nstud))
theta = matrix(vec.theta,nrow=nstud,ncol=nitem)
# Inputing vectors into the matrix command will cause them to be duplicated to fill out the values of the matrices.  The only trick is making sure that they are read in the right direction.  The command writes the input vector row by row.  Thus the b matrix becuase the rows are constant must be transposed while the theta matrix does not.

# Each column is an item and each row is a different student.

# Now let's calculate the probability of any person answering any of the questions correctly.
p = exp(theta-b)/(1+exp(theta-b))

# The next graph requires the plotrix package.
require(plotrix)
stackpoly(t(p),col=gray(seq(0,1,length=nstud)),
border=gray(seq(.3,1,length=nstud)-.2),
xlab="b - Difficulty Parameter",
ylab= "Probability of Correct Response",
main="Mapping of Probability of Correct Responses for Homogenous Items")

# Each line represents the response probability by a singe student.  All students have lower probability of correct response as the items get more difficult.

# Create a function that will draw a matrix of binomial responses
mat.binom <- function(n,p) {
bin.mat <- p*0
for (i in 1:nrow(p)) {
for (ii in 1:ncol(p)) {
# This
bin.mat[i,ii] <- rbinom(1,n,p[i,ii])
}
}
return(bin.mat)
}

# Now let's generate the actual responses to the probabilities
y = mat.binom(1,p)

total.score <- apply(y,1,sum)

plot(vec.theta, total.score, xlab=~ theta, ylab="Total Score",
main="Total Score as a Function of Ability")

bhat <- rep(NA,nitem)
for (i in 1:nitem) {
bhat[i] <- (glm(y[,i] ~ 1 , family=binomial("logit")))[[1]]
}

plot(vec.b,-bhat, xlab= "b - Item Difficulty", ylab= "Estimated Difficulty")
# Actually our graph is looking pretty good.  The scale is off but that is really unimportant since latent trait scales are arbitrary anyways.
# To scale the b I first make it so that the min is zero and the range is 1.  Then I multiply by the desired range (8) and add 4 to get the mean to 0.
b.scaled <- (bhat-min(bhat))/(max(bhat)-min(bhat))*-8+4

plot(vec.b,b.scaled, xlab= "b - Item Difficulty", ylab= "Estimated Difficulty", main="Estimates of Item Difficulty")

fit.line = lm(b.scaled~vec.b)

abline(fit.line, col="red")
# Interestingly, it is actually unneeded to do anything more to estimate student ability since the one parameter Rasch model is fully identified as a function of just the number of items each student got correct.

# Let's see it in action.

theta.hat <- rep(NA,nstud)
for (i in 1:nstud) {
theta.hat[i] <- print(glm(y[i,] ~ 1 , family=binomial("logit"))[[1]])
print(paste(i))
}

plot(vec.theta, theta.hat , xlab= ~ theta, ylab= "Estimated Student Ability", main="Estimated Student Ability Unscaled")
# Actually our graph is looking pretty good.  The scale is off but that is really unimportant since latent trait scales are arbitrary anyways.

theta.scaled <- (theta.hat-min(theta.hat))/(max(theta.hat)-min(theta.hat))*8+4

plot(vec.theta, theta.scaled, xlab= ~ theta, ylab= "Estimated Student Ability", main="Estimated Student Ability Scaled")

fit.line = lm(theta.scaled ~vec.theta)

abline(fit.line, col="red")

# This shows us that even a simple estimation technique, (two series of logit regressions) can we quite effective when we are estimating only one parameter for both students and one for the items.

## Tuesday, September 25, 2012

### Maximum Likelihood Using R

# The short tutorial by Professor Marco Steenbergen is very helpful in learning to program maximum likelihood in R.

# http://www.unc.edu/~monogan/computing/r/MLE_in_R.pdf

# I will attempt my first R MLE with a probit.

# Let's first generate our data

# Declare the number of observations to create.
nobs=1000

x1 = rnorm(nobs)*.15^.5
x2 = rnorm(nobs)*.35^.5
u = rnorm(nobs)*(.5)^.5

# I have generated the variables so that the inside of the normal CDF has a standard deviation of 1.  This should help the coefficients be more easily recovered.

inside = 2*x1+2*x2+u

# Using basic
# Var(inside) = (.15*2)^2*var(N(0,1)) + (.35*2)^2*var(N(0,1)) + (.5^.5)^2*var(N(0,1)) = .15+.35+.5 =1

# The probability of have a success is equal the cumulative density function at the inside value.
p = pnorm(inside)

y = rbinom(nobs,1,p)

mydata = data.frame(y=y,x1=x1,x2=x2)

# The estimates we would like to duplicate are:

# First we must specify the function to maximize
# All we need to do is to specify a function that takes choice values and maximizes over

myprobit <- function(theta) {
beta0 <- theta[1]
beta1 <- theta[2]
beta2 <- theta[3]

# Calculate the log likelihood values for a set of betas
log.lik <- sum(log(pnorm((-1)^(1-mydata$y)*(beta0 + beta1*mydata$x1 + beta2*mydatax2)))) # This might look quite confusing. However, it basically says that if y=1 then count beta0+beta1*x1+beta2*x2 as positively contributing to the probability. However, if y=0 then we need to count not the contribution but in effect the negative contribution of the inside term. That is when y is positive count the inside term as predicting a positive outcome. When it is negative count the inside term as falsely predicting a positive outcome. I am not sure how much sense this all makes. I am still trying to wrap my head around it all. # Return that log likelihood value. return(log.lik) } # Let's test out our likelihood function using a generic set of parameters myprobit(c(1,1,1)) # Let's see what the probit log likelihoods look like as we vary B1 from -1 to 2 b.range <- seq(-1,2.5,.1) b0.loglik <- b1.loglik <- b2.loglik <- 0*b.range for (i in 1:length(b.range)) { b0.loglik[i] <- myprobit(c(b.range[i],1,1)) b1.loglik[i] <- myprobit(c(1,b.range[i],1)) b2.loglik[i] <- myprobit(c(1,1,b.range[i])) } # I will define a quick min max function to help setting up the range of plots minmax <- function(x) c(min(x),max(x)) plot(minmax(b.range), minmax(c(b0.loglik, b1.loglik, b2.loglik)), type="n", main="Log-Likelihood Around the point (1,1,1)", ylab = "log likelihoods", xlab= ~ beta ) lines(c(1,1),minmax(c(b0.loglik, b1.loglik, b2.loglik)), col="green", lwd=2) lines(b.range, b0.loglik, col="black", lwd=2) lines(b.range, b1.loglik, col="blue", lwd=2) lines(b.range, b2.loglik, col="purple", lwd=2) # The graph displays how the log likelyhood changes with respect to different coefficients around the point (1,1,1). If we were using the Newton-Ralphsony algorithm then B0 (green) would step left, while B1 and B2 would step right. This is because the alorithm use the determinate at the point to approximate the next place to step. # Looking good. By specifying the control fnscale as =-1 we turn the optimization problem from the default minimize to maximize. optim(c(1,1,1), myprobit, control=list(fnscale = -1)) # Great! Though, it is worth noting that the coefficients are not identical. This is not neccessarily a problem since we would expect different optimization algorithms to produce different results. # The vector c(1,1,1) is the starting values of for the coefficients to be estimated. # Notice that this has worked fine for this specific model prob(p)=pnorm(B0+B1*x1+B2*x2). # Clearly, a more sophisticated level of programming would be neccessary to duplicate the flexibility of the probit command. That is, a model that allows for more variables to be specified. ## Monday, September 24, 2012 ### Generalized Graded Response Model # Graded Response Model # The graded response model (grm) by Fumiko Samejima is an extension of the dichotomous item response model developed by Fredrick Lord. # The graded response model allows for answers to items (test questions) to have values that are between the full value for the answer and no value. An example of this would be an algebra question that gave points for steps successfully completed even if the final answer was not correct. # This post will define a function that generates the probability each score value or lower as well as the probability of getting given a specific ability level (this is very closely related to CDFs and PDFs). grm <- function(b = c(0,1,2), a=1, theta=1, cplot=T, pplot=T, stackpoly=F ) { # a is the discrimination parameter for the levels of the item. If a is a constant then the item is "homogenous". Otherwise it is heterogenous and each grade level has its own specified a. # Let's first check if the b are arranged from lowest to highest. if (sum(rank(b)!=1:length(b))>0) warning ("b must be ranked from lowest to highest") # The rank function will rank b from 1 to the number of observations in b # If that ranking does not align with the vector from 1 to length(b) then we have a problem. # c=T will cause the cumulative grade graph to be plotted. # p=T will cause the probability graph graph to be plotted. # Count the number of grades. ngrade = max(length(b),length(a)) # Expand a to or b to be the same length if ((length(a)==1)&(length(b)>1)) a<-rep(a,length(b)) if ((length(b)==1)&(length(a)>1)) b<-rep(b,length(a)) # First let us define a matrix that will return results CGF = matrix(NA, ncol=ngrade , nrow=length(theta)) PG = matrix(NA, ncol=ngrade+1, nrow=length(theta)) # Each column will be for a different grade while each row will be for a different theta (if input) for (i in 1:length(theta)) { # The inverse cumulative grade function (this function returns the probability that grade of a particular value or higher will be returned. CGF[i,] = exp(a*(theta[i]-b))/(1+exp(a*(theta[i]-b))) # The probability of getting a 0 on an item is the probability of not getting any points. PG[i,1] <- 1 - CGF[i,1] # The probability of getting the highest grade is the same as the probability of getting the highest grade or more. PG[i,ngrade+1] <- CGF[i,ngrade] # For the grades in between max and min the values are more dynamically generated. for (ii in 1:(ngrade-1)) { PG[i,ii+1] <- CGF[i,ii] - CGF[i,ii+1] } } # Plot the graphs of interest. # First let's set the number of plot frames to 1 as default par(mfrow=c(1,1)) if ((cplot==T)&(pplot==T)) par(mfrow=c(2,1)) plottitle <- paste("G1", " a=",a[1]," b=",b[1], sep="") for (i in 2:ngrade) { plottitle = paste(plottitle, "; G", i, sep="") if (sd(a)!=0) plottitle = paste(plottitle, " a=",a[i], sep="") if (sd(b)!=0) plottitle = paste(plottitle, " b=",b[i], sep="") } # If cumulative grade plot is to be graphed but stackpoly is not (the shaded graph) then do the following if ((cplot==T) & (stackpoly==F)) { # Plot an empty graph plot(c(min(theta),max(theta)),c(0,1), type="n", ylim=c(0,1), ylab = "Probability", xlab= ~ theta, main=plottitle) # Plot the individual graded probabilities for (ii in 1:ngrade) { lines (theta,CGF[,ii]) } } # If stack poly is set to on we will use the stack poly setting to make a graph with shades if ((cplot==T) & (stackpoly==T)) { require(plotrix) stackpoly(matrix(theta,nrow=length(theta),ncol=ngrade),CGF[,ncol(CGF):1], col=gray(seq(0.1,0.9,length=ngrade)), border="black", ylab="Probability", xlab= ~ theta, main=plottitle) } # If the plot densities is set to on then this will plot the densities in their own graph. if (pplot==T) { plot(c(min(theta),max(theta)),c(0,1), type="n", ylim=c(0,1), ylab = "Probability", xlab= ~ theta, main=plottitle ) for (ii in 1:(ngrade+1)) { lines (theta,PG[,ii]) } } # This will return both the probability of each grade value for each theta value # as well as the probability of that grade or higher in the C graph. return(list(PG=PG,CGF=CGF)) } test1 = grm(b=c(-2,-1,0,1,2), a=2, theta=seq(-4,4,.1)) # A homogenous item with a somewhat low discrimination parameter leading to relatively small probabilities of any partial outcome if the item taking population has theta parameters uniformly distributed between -4 and 4 test2 = grm(b=c(-2,-1,0,1,2), a=2, theta=seq(-4,4,.1), stackpoly=T ) # Setting stackpoly on will generate better looking graphs. # However this requires that the package plotrix is installed test3 = grm(b=c(-2,-1,0,1,2), a=4, theta=seq(-4,4,.1), stackpoly=T ) # Increasing the discrimination parameter will increase the steepness of the IRT curves and thus the peak size of the graded responses. test4 = grm(b=c(-2,-1,1,2), a=4, theta=seq(-4,4,.1), stackpoly=T ) # Removing the middle scoring will cause the probability of the new middle outcome to double. test5 = grm(b=c(-3,-1,0,1,3), a=c(4,2,3,4,5), theta=seq(-4,4,.1), stackpoly=T ) # Creating a heterogenious item will cause the curves to no longer be symetric and the peaks to vary in size. ## Sunday, September 23, 2012 ### Program Your own Bootstrap Command #R# # A similar bootstrap command is written in Stata: see http://www.econometricsbysimulation.com/2012/07/write-your-own-bootstrap-command.html # The bootstrap command is an extremely powerful command that resamples from you sampling distribution in order to estimate a standard error for an estimator. Often, using the bootstrap resampling technique can yeild estimates of standard errors on estimators that are otherwise extremely difficult to estimate (using the Delta method, the predominant method for estimating standard errors for non-linear functions). # This post will develop a simple function that takes a data source and an estimator and generates standard errors. # First let's simulate some data. nobs = 200 x1 = rnorm(nobs) x2 = rnorm(nobs)/2 u = rnorm(nobs) y = 10-x1+2*x2+u*2 # Now that we have some vectors let's join those vectors together as a data.frame mydata = data.frame(y=y, x1=x1, x2=x2) # A simple OLS regression will generate good standard errors. summary(lm(y~x1+x2, data=mydata)) # We can see that x2 has a standard deviation of about twice that of x1, this is due to x2 having half the standard deviation of x1. # As a check, let's say we would like to estimate the standard deviation of our coefficients using bootstrapping instead. # Let's define our bootstrapping function data.boot <- function(Input.data, command, reps, est=F, hist=F) { # data is the source data that will be passed into the boot strapping command # command is the function that will do the operation that we desire and return a vector # reps is the number of repetitions in the simulation # Let us first get our base estimates original.est <- get(command)(Input.data) # Calculate the number of estimates needed nest = length(original.est) # Make a holder matrix for our bootstrap results holder <- matrix(NA, ncol = nest , nrow = reps) # Note, the length of our original.est is the number of coefficients we will estimate standard errors for. # Calculate the number of observations to resample from nobs <- nrow(Input.data) # Now let's start our bootstrapping resampling loop for (i in 1:reps) { # Draw nobs uniform integers draws from 1 to nobs posdraws <- ceiling(runif(nobs)*nobs) # Sample a random sample from our original data set draw <- Input.data[posdraws,] # Now we should have a new data frame called draw which has nobs draws. Observing draw it should be easy to see there there are some observations which are repeated. # Apply our command to the new data set and save it as a row in holder holder[i,] <- get(command)(draw) } # We have completed our bootstrap rountine. Now we need only perform whatever statistics we want on the results. # Calculate the standard deviation for each column sds <- apply(holder,2,sd) if (hist==T) { mfrow=c(1,1) frame() if (nest<= 3) par(mfrow=c(nest,1)) if ((nest> 3)&(nest<= 6)) par(mfrow=c(3,2)) for (j in 1:nest) hist(holder[,j], main=paste("Estimates of ", names(original.est)[j]), xlab="") } print(rbind(original.est,sds)) return(list(estimates=holder, sds=sds)) } # We will define our first function that we would like to bootstrap the standard errors example1 <- function(BSdata) lm(y~x1+x2, data=BSdata)[[1]] # The key is that we need the function to return a vector to the boostrap command. Thus the subscript [[1]] restricts the returned values to only be the estimates. It my take some manipulation to get a single vector out of an estimate. example1.res <- data.boot(mydata, "example1", 200, hist=T) # This will run a bootstrap of the estimates (200 times) and make a histogram # We can see that the standard deviation of our bootstrap estimates are similar to those of our linear model. summary(lm(y~x1+x2, data=mydata)) # Let's generate some more complex data to test our bootstrapping on x1 = runif(nobs) x2 = rnorm(nobs) x3 = exp(rnorm(nobs)) x4 = log(runif(nobs)) u = rnorm(nobs) y = -10+2*x1+2*x2+.2*x3+1.4*x4+u*3 mydata2 <- data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4) # Let's try something a bit more complex, first we will be estimating more coefficients example2 <- function(BSdata) { # First save the results of the OLS regression into a a<- lm(y~x1+x2+x3+x4, data=BSdata) # Next save the coefficients coef <- a[[1]] # Save the r squared r2 <- summary(a)r.squared

# Specify a vector to return
return(c(coef,r2=r2))
}
example2(mydata2)

example2.results <- data.boot(mydata, "example2", 200, hist=T)

We can see that we are able to estimate not only standard coefficients but also the standard deviation of any estimator generated from sampling.  The extreme draws in r2 is probably due to some of the sampling distributions having very low values (or high values) of the error term by chance in those extreme draws.

## Saturday, September 22, 2012

### OLS w correlation between x and u

* Stata and # R Simulation

set more off
* Turn the scroll lock off (I have it set to permenently off on my computer)

clear
* Clear the old data

set obs 1000
* Tell stata you want 1000 observations available to be used for data generation.

gen x = rnormal()
* This is some random explanatory variable

sort x
* Now the data is ordered from the smallest x to the largest x

gen id = _n
* This will count from 1 to 1000 so that each observation has a unique id

gen u = rnormal()
* u is the unobserved error in the model

sort u
* Now the data is ordered from the smallest u to the largest u

gen x2 = .
* We are going to match up the smallest u with the smallest x.

forv i=1/1000 {
replace x2 = x[i'] if id[i']==_n
}

drop x
* Get rid of the original x variable
rename x2 x

corr x u
/*           |        x        u
-------------+------------------
x |   1.0000
u |   0.9980   1.0000  */

gen y = 5 + 2*x + u*5

reg y x

/*

Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) =       .
Model |  50827.8493     1  50827.8493           Prob > F      =  0.0000
Residual |  55.8351723   998  .055947066           R-squared     =  0.9989
Total |  50883.6844   999  50.9346191           Root MSE      =  .23653

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x |   7.145123   .0074963   953.15   0.000     7.130412    7.159833
_cons |   4.858391   .0074869   648.92   0.000     4.843699    4.873083
------------------------------------------------------------------------------
*/

# Now the same thing in R

x = sort(rnorm(1000))
u = sort(rnorm(1000))

y = 5 + 2*x + u*5

summary(lm(y~x))
# This simulation turns out to be extremely easy in R

## Thursday, September 20, 2012

### Stata is to Accounting as R is to Tetris

Both Stata and R handle many of the same data computation needs.

However, researchers must subset data within them very differently.

For Stata subsetting can be extremely easy.

If you want to restrict your data you can simply post after most commands an if statement.

* Stata code
clear
set obs 100
gen y = rnormal()
gen x1 = rnormal()
gen x2 = rnormal()
gen u = rbinomial(1,.5)
reg y x1 x2 if s==1

Thus the OLS regression of y on x1 and x2 will only occure if s=1.  In R this operation can be a little more tricky. Imagine you have a data set called mydata which has four variables y x1 x2 s.  The easiest way to restrict the data would probably be to create a new subset data set.

# R code
mydata = data.frame(y=rnorm(100), x1 =rnorm(100), x2 =rnorm(100), s = rbinom(100,1,.5))
# Create a sub-set of your data by specifying the subset mydata[mydata$s==1,] lm(y~x1+x2, data=mydata[mydata$s==1,])

Thus same operation as above.

Those of you with less experience in R are probably wondering how using brackets accomplished the same task.

In R like in Stata you use brackets to indicate subcripts.

For instance:  The vector "letters" is a built in environmental vector in R that contains all of the letters from a to z.

Thus:

letters[1] # Displays "a"
# Vectors can also be subscripted by vectors (with repetition)
v = c(1,2,3,2,1,5)
letters[v]
# Displays "a" "b" "c" "b" "a" "e"

# Vectors can also be subsetting using logical operators.
vv = rep(c(TRUE,FALSE),13)
# Creates a vector 26 elements long alternating between TRUE and FALSE
letters[vv]
# Will display every other letter starting with a.

# This brings us back to how we subsetted a dataframe.

# Let's make a new data frame called mysample
mysample = data.frame(a = letters, b = 1:26, c = rnorm(26))

# We can subset the data frame by using two subscripts now
mysample[4,2] # Displays 4
mysample[3,]  # Displays an entire row of the data frame.
mysample[vv,] # Will display every other row
# Replacing subsets is notated in a similar manner as subsetting
mysample[vv,1] <- "z" # Will replace every other letter with "z"

# Thus mydata[mydata$s==1,] is telling R to use any row in which variable s of data frame mydata is equal to 1. At this point you are probably thinking that R is overly complicated and that Stata handles data much better. This is not true. Stata handles data in a manner similar to that of an accountant. If you want your accountant to add within rows different values there is no problem. You can even use subscript to move values from one row to another. R on the other hand takes data and transforms it and combines it into new forms often much easier than Stata but with more complex notation. ## Wednesday, September 19, 2012 ### Loops # Loops are an essential building block for most computer algorithms. # They allow for complex tasks to be broken into individual repetitive tasks that save coding, reduce error, and add flexibility. # For simulation design, understanding how and when to use loops is an essential tool. # Like most languages R has several forms of loops that do slightly different routines. # The most common loop I use is the for loop # This loop takes a vector and loops over that vector assigning values to the loop variable based on the position in the vector. # This kind of loop is highly flexible. for (i in 1:10) print(i) # Will print a list from 1 to 10. This is because the argument 1:10 creates a count vector from 1 to 10. 1:10 # An increment of 1 may not be as useful in all circumstances: for (i in seq(-.1,-1,-.1)) print(i) # Though it is somewhat unimportant what the increment is since a count vector can always be transformed into any other linear combination. # Thus also counts from -.1 to -1. for (i in 1:10) print(i/(-10)) # For loops can loop though any arbitrary vector of arguments. for (i in letters[1:10]) print(i) # The vector that will be looped through does not need to be defined within the for loop. names <- c("John", "Bob", "Sussy", "Pieter", "Xin-Xin", "Gonzales") for (i in names)) print(i) # Loops can be used to loop through more than one command. for (i in names) { print(i) print(i) } # Loops can be nested. Meaning that within one loop there can exist another loop which completes its internal loop for every repetition of the outer loop. for (i in names) { for (ii in 1:10) { print (paste(i,ii)) } } # Another common loop form is the while loop which i = 0 while (i<10 p="p"> print(i+1) i = i+1 } # The while loop will continue to repeate the command so long as the condition is true. # While loops are frequently used in instances when the number of loops required to complete a task is uncertain. # Say you wanted to know how many times you would need to apply a function before a value converged or exceeded a particular value. # For instance: f <- function(x) (abs(sin(x))+.1)*(x+.1)*3 x = 0 while (x<100 p="p"> x=f(x) print(x) } # Notice that the entire block of code continues until the end denoted with a } at which point the command checks if the condition is still true. # Obviously this is a very weird function which probably has no real world applications. # However, there are many complex functions such as sorting algorithims for which it is impossible to know in advance the number and range of iterations neccessary to accomplish a specific task. # Another type of loop similar to the while loop is the repeat function. # This function acts similar to the do while loop in other languages. # The do while loop will not run the code even once if the condition has already been met. # However the repeat command (or at least the structure that I have defined) will execute the code up until the point at which the break command is triggered. x = 1000 repeat { x=f(x) print(x) if (x>=100) break } # One repetition of the loop was run through before the "break" command tells R to stop repeating. # The repeat command can also be broken mid way through a repetition of code. x = 0 repeat { x=f(x) if (x>=100) break print(x) } # Thus only prints 71.83 rather than the final value 110.77 # The repeat form of the code acts very similar to that of the while. The break command works for while loops as well. ## Tuesday, September 18, 2012 ### Why Simulate? Simulations are a powerful tool available to researchers in all kinds of work from economics, to engineering, to psychological measurement. They provide researchers with a unique tool that can be used to explore detailed thought experiments. Researchers can simulate anything from the relatively straightforward testing of the inherently unobservable effectiveness of new or existing econometric/statistical methods to more complex simulations of human or non-human movement and behavior such as simulations of the transmission of malaria parasites or zombie hordes. Simulations are frequently used to support theories developed a proiri to the simulation and advanced by authors who use simulations to complement their arguments. The literature is replete with examples of new methods or theories that use simulations to complement their theoretical results. Simulations can also be useful tools in exploring new problems and solutions. An ideal exploratory simulation will have an environment rich in reasonable and representative environmental information. A perfect example of this is an engineering simulation I observed in which the automobiles with various numbers of wheels, wheel sizes, and wheel positioning were created in the simulation and the computer reran the model assigning different parameter values to each repetition searching the best design. However, when simulating global environments it becomes all the more important that simulated agents are representational of real agents because nearly any result is possible if one controls the environment of the simulation. Of course there are other reasons for do simulations besides the ones listed above. Sometimes simulations are seen as a mechanism for uncovering relationships that are too difficult or complex to identify outside of a constructed framework. Typical examples of this are simulation that test violations of assumptions in statistical/econometric methods. Looking at instrumental variables for instance, what happens when an instrumental variable does not have a very strong predictive power on the endogenous variable? And what happens when it is also slightly endogenous? Yet other simulations are used to dismiss or undermine criticisms. Often times it is seen as the fatal flaw in an argument if the estimator is shown to be biased theoretically. However, simulations can be used to look at the scope of reasonable bias expected and see how it would be expected to affect estimates. It is not uncommon that biased, even inconsistent estimators turn out to be reasonable and effective even when the assumptions are not perfectly met. As hopefully is clear, there are many good reasons for developing simulations. This blog will hopefully guide you through of some of the most common types of simulations as well as through more obscure types as well. ## Monday, September 17, 2012 ### My First Bayesian (Markov Chain Monte Carlo) Simulation # My First Bayesian (Markov Chain Monte Carlo) Simulation # I know very little about Baysian methods and this post will probably not reveal much information information. However, it should hopefully lead to many more fruitful posts in the future. # You will need the package MCMCglmm (Markov Chain Monte Carlo) require(MCMCglmm) # Let's set the random seed set.seed(121) nobs = 100 x=rnorm(nobs) e=rnorm(nobs) y=10+2*x+e*20 d <- data.frame(x=x,e=e,y=y) mc <- MCMCglmm(y~x, data=d) plot(mc$Sol)

# This will give the estimates of the coefficients that we are typically looking for.
posterior.mode(mc$Sol) # Probability that the intercept is greater than 0 table(mc$Sol[,1] > 0)/1000
# 100% of the estimates of the intercept indicate that it is greater than zero.

# Probability that the coefficient on x is greater than 0
table(mc$Sol[,2] > 0)/1000 # 888% of estimate results indicate that the coefficient on x is greater than 0. # The 95% confidence interval of the MCMC estimates can be found: HPDinterval(mc$Sol, 0.95)

summary(lm(y~x,data=d))
# We can see that the standard linear model fails to reject the null for the coefficient on x even though it is large.

# To read a small tutorial on the use of MCMC written by the package's author J.D. Hatfield
# http://cran.uvigo.es/web/packages/MCMCglmm/vignettes/Tutorial.pdf

## Sunday, September 16, 2012

### The True Test Score World

* We have results ten different tests administered to 10000 respondents, each taking 10 different tests, which each experience some symmetric homogenous measurement error E(e)=0.

clear
set obs 1000

* The true score of each respondent is tt
gen true_score = rnormal()*2
label var true_score "True Score - Intelligence test"

gen age_group = mod(_n-1,5)
label var age_group "Age grouping"

* This is creating
expand 10

* Measurement error
gen e = rnormal()

* The observed score is a result of the true score as well as a difference in ability resulting from age as well as some error in measurement.
gen obs_score = true_score+age_group+e*(1+age_group/2.5)
* Performance of students harder to measure as they get older (thus the 1+e/15) term.

corr obs_score true_score
* We can see that our observed score is not a very good measure of true intelligence.

* This is of course primarily because our students get uniformily better at the designated task the older they are.

* If we looked within age group we should be able to see that our estimate is working better.

bysort age_group: corr obs_score true_score
* Now we can see our correlations are looking pretty good.

* However, we want to figure out which of our students are the brightest among all age groups and put those students in special programs to encourage their developement.

* So we want to measure of intelligence that is independent of age.

* One method would be to subract the mean of the observed score for each age.
bysort age_group: egen meaned_score = mean(obs_score)

gen score_demean = obs_score-meaned_score

corr score_demean true_score

* We know that intelligence is distributed equally across all grades.  Might we use this knowledge to rescale our observed scores to get a better estimate?

bysort age_group: egen sd_score = sd(obs_score)

gen standardized_score = score_demean/sd_score

corr standardized_score true_score

* We can see interestingly, that standardizing the score by age group gives us the best estimate of intelligence accross all age groups.  This is exploiting the knowledge (assumption normally) that the underlying distribution of intelligence is the same for all age groups.

## Saturday, September 15, 2012

### Strictly Parallel Tests Classical Test Theory Results

* Where x1 is the observed score on test 1 and x2 is the observed score on test 2.

* t1 = t2 which is the true score for students (which is equal because they are strictly positive)

* e1, e2, are uncorrelated with each other or t1 and t2 and have the same standard deviation.

* Observed scores x1 and x2 are calculated as x1 = t1 + e1, x2 = t2 + e2

* Classical test theory tells us in strictly parralell test corr(x1,t1)^2 = corr(x2,t2)^2 = corr(x1,x2)

* corr(x1,t1)^2 = {E[(x1-E(x1))(t1-E(t1))]/sd(x1)sd(t1)}^2

* corr(x1,t1)^2 = {E[(t1+e1-E(t1+e1))(t1-E(t1))]/sd(x1)sd(t1)}^2

* corr(x1,t1)^2 = {E[(t1+e1-E(t1))(t1-E(t1))]/sd(x1)sd(t1)}^2

* corr(x1,t1)^2 = {E[(t1+E(t1))(t1-E(t1)) + e1(t1-E(t1))]/sd(x1)sd(t1)}^2

* corr(x1,t1)^2 = {(sd(t1) + 0)/sd(x1)sd(t1)}^2

* corr(x1,t1)^2 = {sd(t1)/sd(x1)sd(t1)}^2

* Since the tests are parrellel sd(e1)=sd(e2) & sd(x1)=sd(x2) along with corr(e1,t1)=0 and corr(e2,t2)=0 implies.

* corr(x1,t1)^2 = {sd(t1)/sd(x1)sd(t1)}^2 = {sd(t2)/sd(x2)sd(t2)}^2 = corr(x2,t2)^2

* Let's see what this looks like:

clear
set obs 10000
* Imagine we have test results back for 1000 students

gen t1 = rnormal()
gen e1 = rnormal()
gen x1 = t1+e1

gen t2 = t1
gen e2 = rnormal()
gen x2 = t2+e2

qui corr x1 t1
di r(rho)^2

qui corr x2 t2
di r(rho)^2

* Good, our simulation confirms our algebra.

* Other results are:

qui corr x1 x2
di r(rho) " = corr x2 t2"

local corr_x1x2 = r(rho)

sum t1
local var_t1 = r(sd)^2

sum x1
local var_x1 = r(sd)^2

di "var(t) = var_t1' = var(x)*corr(x1x2) = =var_x1'*corr_x1x2''"

* And

sum e1
local var_e1 = r(sd)^2

di "var(e) = var_e1' = var(x)*(1-corr(x1x2)) = =var_x1'*(1-corr_x1x2')'"

* Note, all of these equalities are approximate because we are use a random simulated sample estimates of our parameters.

## Friday, September 14, 2012

### Playing around with IRT Graphs

# It is often useful to plot item response theory (IRT) graphs

# I will use the three parameter IRT model:

# ie. P(X=1) = c + (1-c)*(exp(a*(theta-b))/(1+(exp(a*(theta-b)))

# Let's first declare it as a function

irt = function(theta,a,b,c) c + (1-c)*(exp(a*(theta-b))/(1+(exp(a*(theta-b)))))

# Thus we can calculate the probability that a person with a theta = 1 will get a problem right that has parameters a=.3,b=2,c=.2.

irt(theta=1,a=.3,b=2,c=.2)

# R is convient in that functions can take either vectors or scalars.  Thus if irt is given a scalar and a vector it will reuse the scalar multiple times to be of length equal to the vector.

# So to map out an ability range that we are interested in, let's first define a vector for the relevant range of theta.
theta_map = seq(-4,4,.1)

# Let's see how the IRT function will return a probability response vector of length equal to theta:
irt(theta=theta_map,a=.3,b=2,c=.2)

# Let's see it plotted:
plot(theta_map,irt(theta=theta_map,a=.3,b=2,c=.2), type="l", ylim=c(0,1))

# This would probably not be a very useful item becuase there is not much variation in response probability based on ability level of respondents.

##########################    VARYING a      ############################
# Let's map a few competing items. First let's start with an empty plot:
plot(theta_map, 0*theta_map, type="n", ylim=c(0,1),
xaxs = "i", yaxs = "i",
ylab = "Probability of Correct Response",
xlab = ~ theta,
main="The Effect of Varying a")

# In many IRT models the a parameter is the "discrimination parameter"
# Let's see what happens when we vary it.
for (i in seq(0,2,.1)) {
lines(theta_map,irt(theta=theta_map,a=i,b=2,c=.2), type="l", ylim=c(0,1))
}

# We can see that varying a from 0 to 2 leads the item to have more ability to "discriminate" between respondents with ability less than b and those with ability more than b.

# The lowest a can go and still make sense is 0 while there is no upper limit.  As a approaches infinity, the function becomes a step function at the point b.
##########################    VARYING b      ############################
plot(theta_map, 0*theta_map, type="n", ylim=c(0,1),
xaxs = "i", yaxs = "i",
ylab = "Probability of Correct Response",
xlab = ~ theta,
main="The Effect of Varying b")

# In many IRT models the b parameter is the "difficulty parameter"
# Any student with ability above that parameter should find the chance of answering it greater than  c+(1-c)/2 = (1+c)/2.  That is the probability of guessing at the right answer plus half the remaining probability needed to be spanned.
for (i in seq(-3,3,.3)) {
lines(theta_map,irt(theta=theta_map,a=1,b=i,c=.2), type="l", ylim=c(0,1))
}

# By varying b we map out the ranges of abilities that students posses.  Thus varying difficulty of problems allows us to guess at were on the theta spectrum any student may lie.

##########################    VARYING c      ############################
plot(theta_map, 0*theta_map, type="n", ylim=c(0,1),
xaxs = "i", yaxs = "i",
ylab = "Probability of Correct Response",
xlab = ~ theta,
main="The Effect of Varying c")

# In many IRT models the b parameter is the "difficulty parameter"
# Any student with ability above that parameter should find the chance of answering it greater than  c+(1-c)/2 = (1+c)/2.  That is the probability of guessing at the right answer plus half the remaining probability needed to be spanned.
for (i in seq(0,1,.05)) {
lines(theta_map,irt(theta=theta_map,a=1,b=0,c=i), type="l", ylim=c(0,1))
}

# As c gets large the probability span of the IRT function goes from 1 to 0.  This means that the difference in probabily as a result of ability is decreasing as the guessing parameter gets large.  Thus, a test item that has a high guessing parameter in general has less ability to discriminate between low ability respondents and high ability respondents.  Thus, true or false questions may be uniformative indicators of student ability in many cases.

##########################    Overview      ############################

# What do these graphs tell us about the ideal parameter set of our items?

# In general higher a is better so as to help us discriminate between ability levels.

# There is no obvious ideal level of b.  If we have a specific ability level that we would like to make sure all of our students reach then having all of our items share a similar b might be ideal.  However, if on the other hand we would like to be able to asses the overall ability of students then having items with bs that vary along our entire range of interest might be preferable.

# As for the guessing parameter, I am pretty sure we always want that to be smaller.  That is, if it is harder for students to guess the right answer then when students get the right answer, we have more confidence that that reflects their true ability rather than a good guess.

## Thursday, September 13, 2012

### Relating Classics Test Theory Parameters to IRT Parameters

* Relating Classics Test Theory Parameters to IRT Parameters

* This post follows some of the discussion in Multidimensional Item Response Theory by Mark Reckase's chapter 2.

* First let's relate Classical Test theory ideas of difficulty to that of IRT parameter b - difficulty.

* We will use for our underlying true DGP the three parameter logistic model.

* (2.13) P(U=u) = c + (1-c) * exp(a(t - b))/(1 + exp(a(t - b)))

* That is, the probability of getting the problem right, is a function of c (the guessing parameter of the item), a (the discriminatory parameter of the item), b (the difficulty parameter of the item), and t (the ability of the test taker).

* We want to equate this to the classical test theory idea of difficult.  In classical test theory the probability of getting a item correct is the difficulty of the item.

* Let us imagine a heterogenous group of 1000 students

clear
set obs 1000

* Create a student ID
gen stud_id = _n

gen t = rnormal()

* Now let's see how we can compare classical difficulties (CD) to IRT difficulty parameter b.

* Let's imagine that all of our students test the same test with 100 items that range in b value which is independent of the choices of parameters a and c.

* Let's have all of our data listed vertically.

* Create 200 test items for each student
expand 200

* Give each item a different ID
bysort stud_id: gen item_id = _n

* There are many ways to make sure all of the items have the same parameters.  I will use a for loop though generating a seperate data set for all of the items and merging it in would be another good way or drawing all of the parameters from distributions then taking the average accross all of the items of the same ID would probably be the most efficient code wise but would make it difficult to specify exact distributional parameters.
gen a = .
gen b = .
gen c = .

qui forv i = 1/200 {
* This will only draw one random variable for each local macro
local a = runiform()/2+.4
local b = rnormal()*2
local c = runiform()/4

* This will assign that draw to the item i'
replace a = a' if item_id==i'
replace b = b' if item_id==i'
replace c = c' if item_id==i'
}

* Now let's generate the probability of getting that problem correct given the parameter values and the student t scores.

gen P = c + (1-c) * exp(a*(t - b))/(1 + exp(a*(t - b)))

* If we try to do a direct scatterplot then we are overwhelmed.

* Instead we want to know the probability of a correct answer for each item (given the population being tested).

* Let's us first preserve the current state of our data.
preserve

* So we collapse the data set to item level.

* The default of the collapse command is to take the mean.
collapse a b c P t, by(item_id)
* I included the t value just as a debugging test.  t should be constant accross all items.

label var b "IRT b"
label var P "Difficulty (Probability of Correctly Answer)"

scatter P b

* The reason things start fanning out as b gets large is due to the guessing parameter c.

* Even when b is so large that the probability of getting the answer correct based on knowledge is close to zero there is still the chance of guessing the correct answer.

restore

* In order to compare the discrimination parameter to classical test theory we will look at the Point Biserial correlation.  Which is the correlation between a the responses to an item on the test and the total test score.

* First we need to draw actual item reponses

gen u = rbinomial(1, P)

* Now let's generate total test scores

bysort stud_id: egen total_score = sum(u)

* Now let's generate the point-biserial values for our items

gen pbiserial = .

qui forv i = 1/200 {
corr u total_score if item_id == i'
replace pbiserial = r(rho) if item_id == i'
}

preserve
collapse a pbiserial, by(item_id)

label var a "IRT discriminatory parameter (a)"
label var pbiserial "Classical test theory point-biserial correlation"
twoway (lfitci pbiserial a)  (scatter pbiserial a)

restore

* In order to approximate c using just classical test scores we will look at the lowest 10% of students in terms of total scores and see how they perform on each question on average.

xtile score_pct = total_score, nquantiles(10)
* This should have created 5 groups ranked accounting to total_score

bysort item_id: egen c_hat = mean(u) if score_pct == 1

preserve
collapse c c_hat, by(item_id)

label var c "IRT guessing parameter (c)"
label var c_hat "A guess at the guessing parameter"
twoway (lfitci c_hat c)  (scatter c_hat c)

restore

* We can see there is some relationship between our guess at the guessing parameter using the item responses for the lowest 10% of students and the true guessing parameter.

* The problem with this graph is that items have different difficulties

* Finally let's look at out estimates of student ability t relative to that of their total test score

preserve
collapse total_score t, by(stud_id)

label var total_score  "Total test score"
label var t "IRT univeratiate ability (t)"
twoway (lfitci total_score t)  (scatter total_score t)

restore

* It seems that total test score provides a reasonable linear approximation for student ability even when ability is drawn using a IRT data generating process.

* The largest advantage of IRT relative to that of classical test theory total test performance estimators is the external applicability of the estimates.  IRT is supposed to predict future performance on different tests.  While, classical test theory only predicts performance on the same or similar tests (if I understand this properly).

## Wednesday, September 12, 2012

### Why Item Response Theory is very Cool!

Currently I am reading Multidimensional Item Response Theory by one of my mentors Mark Reckase and I realize how very cool item reponse theory (IRT) is.

The underlying purpose of item response theory is measure two things simultaneously.

1. Measure the item's parameters.  This means measure the characteristics of the test item such as difficulty of the item, likelihood that someone will correctly guess the item, and typically the discriminatory power of the item, ie. how much power the item has to identify the difference in ability between one student and another student.

2. While attempting to measure the items' parameters on a test, IRT methods must also estimate the ability of the students taking the test simultaneously.  That is to say that each student has a different level of ability when entering the test and generally that ability level is unknown a-priori.  Of course after taking the test the ability level is still not "known" but at least some reasonable approximation of the ability level of the student should be known at that time.

So why is this cool?  Well, imagine having a data set with K items (variables) and N students (observations). Now, only knowing that either the students got the answers correct or they failed do an estimation which is largely similar to a logit model except that your exogenous variables (Xs) also need to be estimated.

At this point many of you will probably say, well, this is clearly possible if difficult you are willing to make some identifying assumptions.

So far, the only identifying assumptions that I see needed are 1. The probability that a student answers a question correctly is greater the greater the ability of the student, dP(Y=1|X,T,D)/dX>0. And 2. The probability that the question is answered incorrectly is greater the greater the difficulty of the  item dP(Y=1|X,T,D)/dD<0 .="." and="and" assumption="assumption" form="form" know="know" make="make" model.="model." nbsp="nbsp" of="of" p="p" structural="structural" that="that" the="the" underlying="underlying" we="we" well="well">
I will be posting more on IRT as I learn.

## Tuesday, September 11, 2012

### Calculate an Empirical CDF for a random variable

# First lets define the function. ecdf (Empirical Cumulative Distribution Function)
ecdf <- function(x, bins=100) {

# We will sort the imput variable from smallest to largest
x_sort = sort(x)

# Now we will create a percentile variable that will rank x from slightly above zero to slightly below 1.
pct =  (1:length(x)/(length(x)+1))

# We will now ceate a number of groups equal to length(x)/bins
grps = ceiling(pct*bins)

# First let's create an empty vector of zeros with length = length(pct) = length(x)
avpct = 0*pct
# Next, let's replace avpct with the mean of pct for each group.
for (i in 1:bins) avpct[grps==i] = mean(pct[grps==i])

# Create a bunch of empty value to hold the CDF values
cdf_bins = cdf_pcts = cdf_values = 1:bins

# Now loop through each bin value
for (i in 1:bins) {
cdf_pcts[i] = mean(pct[grps==i])
cdf_values[i] = mean(x_sort[grps==i])
}

# Because there is no
data.frame(bins=cdf_bins, pcts=cdf_pcts, xvar=cdf_values)

}

# We will first generate some data, y is 4000 normal draws divided by a uniform(.1,1.1) draw.
y <- rnorm(4000)/(runif(4000)+.1)

cdf1 = ecdf(y, 100) # Calculate an Empirical CDF for variable y with 100 bins
cdf1

# We can see the ECDF is pretty steep around the mean 0.  This is because most of the observations fall between -10 and 10, yet a very few are either large (around 30) or small (around -30), forcing the x dimension to be large.
plot(cdf1$xvar, cdf1$pcts, type="l")

# In a future post I hope to use the empirical CDF information to calculate an inverse CDF that can be used to draw correlated non-normally distributed random variables.

## Monday, September 10, 2012

### LIE: Law of Iterative Expectations, Word Problem

The law of iterative expectations is extremely useful and I have been trying to think of ways of explaining it.

It basically state that E(Y)=E(E(Y|X))=E(E(Y|X,Z)) where X and Z are sets of covariates.

See my previous post for more information: Law of Iterative Expectations/Law of Total Expectations

I have been trying to explain it in a manner that is intuitively appealing.  Often times I find notation to be a little difficult to dissect.

Imagine Bob and Susie are brother and sister, they go to the same school and always buy lunch.  Their parents randomly decide how much to give them for lunch each day.  But their parents like Susie more than Bob so however much money they give Bob, Susie expects to get a larger amount of money.  There are a few things that you might want to find out.

a. What is the expected amount of money that Bob E(X) will get?
b. What is the expected amount of money that Susie E(Y) will get?
c. What is the distribution of lunch money to Bob (pdf(X))?
d. How much money does Susie expect to get given that we know how much Bob got E(Y|X)?

We know already from the setup that E(Y)>E(X).  {This is not important for the example}

If we know a and b, can we infer c or d? No because in general E(Y)=f(E(X)) and unless we make some assumption about f, then cannot identify the relationship between a and b.

What about if we knew a,b,c? No.  Imagine, if we had a simple distribution p=1/2, x=1: p=1/2, x = 5, thus E(X) = 3, because 1/2*1+1/2*5 = 3.  Imagine that we know E(Y)=6.  This still does not infer the relationship d.  Why, because E(Y|X)=2*X or E(Y|X)=X+3 or any other of a set of infinite functional forms.

* So what can we infer?

* 1. If we know the distribution of X (c), then we know the expected value of X (a).  This is because the expected value function is defined as only the integration of x across the pdf of x.

* 2. If we know the distribution of X (how frequently Bob gets each amount of money) and we know the expected relationship between how much money Bob gets and how much Susie gets (d) then we can figure out how much money Susie gets.  How does this work?  Well imagine that Bob gets $2 with probability 1/3,$4 with probability 1/3, and $6 with probability 1/3. Imagine also that Susie expects to gets the squared amount of whatever Bob got E(Y|X)=X^2. Thus we can figure out how much Susie expects to get on any day without knowing how much Bob has gotten yet E(Y). E(Y)=E(E(Y|X)) = 1/3*2^2 + 1/3*4^2 + 1/3*6^2 = (56)/3 ~ 18.6. ## Sunday, September 9, 2012 ### Matrix Operations in Mata * This post demonstrates a few methods for how to input matrices into Mata and how to do some basic matrix operations. Mata is a matrix programming language so basic matrix operations are extremely easy. mata // Let's first build some matrices in Mata // They can be built directly A = ( 2 , -1, 5 \ 3, 0 , -1 \ 3, 3 , -1) A B1 = (0 , 3 ,-1) B2 = (3 , 0 , 0) B3 = (0 , 2 , 0) // Or through a combination of vectors B = (B1 \ B2 \ B3) B // We can also start with an empty matrix and fill in values C = J(3,2,4) // The J command creates a matrix with 3 rows, 2 colums, and with default values of 4 C // We can replace individual elements once the matrix is created C[1,2] = 3 C[3,1] = 7 // Or entire submatrices C[2,] = (2,6) C // Now let's see how various matrix operations perform in Mata: // a. AB A*B // b. BA B*A // c. A+B A+B // d. A'B' A'*B' // e. B'A' B'*A' // f. A'B A'*B // g. (AB)' (A*B)' // h. ABC A*B*C // i. C'AC C'*A*C // j. CAC' C*A*C' // This does not exist because C*A cannot be multiplied as C is 3x2 and A is 3x3 // k. trace(C'AC) trace(C'*A*C) // l. trace(CAC') trace(C*A*C') // Likewise the trace does not exist end * Matrices can also be input into Mata from data sets. clear set obs 5 gen y = 3 replace y = 6 if _n == 2 replace y = 10 if _n == 3 replace y = 8 if _n == 4 replace y = 2 if _n == 5 mata // The command st_data retrieves data from stata variables to be used in Mata y = st_data(. , "y") y // Once input, matrices can easily be manipulated y*y' // The square of the norm of y norm(y)^2 // This happens to be identical to: sum(y:^2) // The norm is the equclidean norm which is the square square root of the sum of all of the squares of a vector. // Thus the square of it is just the sum of the squares. // As should be clear, manipulating matrices in Mata is extremely easy. // Thus Stata is able to pack a powerful Matrix programming language inside an effective high level user language. // The largest frustration that I have had with Mata is the relative quality of the documentation. // I find the documentation of Mata much harder to use than that of Stata (at least in version 11, perhaps version 12 has better documentation). end ## Saturday, September 8, 2012 ### Mata speed gains over Stata * The inclusion of Mata as an available alternative programming language for Stata users was a great move by Stata. * Mata in general runs much quicker than programming on the surface level in Stata. * In Stata each loop that runs is compiled (interpretted into machine code) as it runs creating a lot of work for the machine. * In Mata on the other hand, the entire loop is compiled prior to running. * Let's see how this works. * Let's say we want to add up the square of the numbers 1 through 100000 * Method 1: Surface loop timer clear 1 timer on 1 local x2 = 0 forv i = 1/1000000 { local x2 = x2'+i'^2 } di x2' timer off 1 timer list 1 * On my laptop, this takes about 13.5 seconds * Method 2: Mata loop timer clear 1 timer on 1 mata x2=0 // This command can be read as start i at 1, // keep looping so long as i is less than 1000000, // the third argument looks a little fishy but it is syntax // that has been around for a while (at least since C). // It would be identical to writing i=i+1, in other words, add 1 to i. // Following the for loop we can immediately place a since line command. for (i = 1; i <= 1000000; i++) x2=x2+i^2 // If there is nothing done with the value x2 then mata displays this value. // R handles this identically x2 end timer off 1 timer list 1 * In contrast, my computer completed the loop using mata in .27 seconds, many magnitudes of speed faster. * However this does not mean you need to learn to use mata (since it has its own limitations and syntax) in order to speed up your commands. * Method 3: Use Stata's data structure to accomplish vector tasks timer clear 1 timer on 1 clear set obs 1000000 gen x2 = _n^2 * The sum command will calculate the mean of x2 which is the same as the sum of x2 divided by it's number of observations. sum x2 * We can reverse that operation easily. di r(N)*r(mean) timer off 1 timer list 1 * Using a little knowledge of how Stata stores post command information this method does the same trick in .2 seconds * Method 4: The speed gains in 3 was as a result of using the vector structure of data columns. Mata can do very similar things even easier. timer clear 1 timer on 1 // This command looks a little fishy, but it is easy to understand. // Order of operations must be taken into account. // First the 10^6 is evaluated which equals 1000000 // Then the vector 1..10^6 is made which looks like 1 2 3 ... 1000000 // The .. tells mata to make a count vector. // If I had written :: then mata would have made a column vector instead. // Once the vector is made then the command :^2 tells stata to do a piece wise squaring of each term in the vector. // Finally the sum command adds all of the elements of the vector together to generate the result we were looking for. mata: sum((1..10^6):^2) timer off 1 timer list 1 * The result is that this command only took .04 seconds to run through efficient coding in Mata. # As a matter of comparison, this command system.time(sum((1:10^6)^2)) # took .04 seconds in R # And the loop: x=0 system.time(for(i in 1:10^6) x=x+i^2) # 1.3 seconds # Thus Mata in this example is significantly faster than Stata and about the same speed as R. ## Friday, September 7, 2012 ### Matrix operations in R # There are many ways of inputing matrices into R # Cbind will bind vectors or other matrices together by adding on the columns of one to the other. A = cbind(c(2,3),c(3,2),c(1,3)) # by the way, the c() function binds a set of elements seperated by commas into a vector. A # Rbind does the same as cbind but uses rows instead B = cbind(c(-1, 1, 4), c(0, 1, 5), c(3, -1, 1)) B # We can also create a matrix by using the matrix command. # Data input in this manner is read from a vector into columns. C = matrix(c(1,4,3,2),nrow = 2, ncol=2) C # An array can accompish the same effect as a matrix. # The largest difference is that an array can take on more than two dimensions. D = array(c(10,4,5,2), dim=c(2,2)) D # One need not populate an matrix/array at creation when specifying matrix or array. E = matrix( NA, nrow = 4, ncol = 4) E # We can see the matrix is filled with empty values # We can replace individual elements by specifying their positions E[1,1] = 3 E[2,1] = 0 E[3,1] = 4 E[4,1] = -1 # As well as entire columns or rows E[,2] = c(0,2,3,2) # Or subsection of the matrix with another matrix E[,3:4] = cbind(c(2,3,2,1),c(-1,2,1,0)) E # Sometimes you do not need to specify every element of the matrix if there are common elements F = matrix(0, nrow=3, ncol=3) F[1,1] = 4 F[2,2] = 3 F[3,3] = 6 F # F is a diagnol matrix which is populated first with 0s then filled with the diagnol values. # Sometimes we start with a data set and want to convert it to a matrix G = data.frame(score1=c(5,10,-7),score2=c(-1,17,3),score3=c(0,5,2)) G # Now let's convert it to a matrix G = data.matrix(G) G # Now let's do some matrix operations with the matrices we have defined: # a. C x D C D C %*% D # Which is different from element wise multiplication which is C * D # b. A x B A B A %*%B # c. B x A # d. E x E' = E times the transpose of E E E %*% t(E) # e. F^-1 or F inverse. We know the inverse of F exists because it is a diagnol matrix with poisitive diagnol non-zero elements. solve(F) require(MASS) ginv(F) # Neither commands work for this particular application. # However, it is easy to see that in a diagnol matrix the inverse is just. # First we will make Finv the same as F Finv = F # This is an interesting bit of R functionality # On the right the diag command is retrieving the diagnol of F and doing a element wise 1/x operation. # The diag on the left is specifying target values to be replaced. diag(Finv) = 1/diag(F) # This might seem a little odd. Let's try this kind of thing on Z Z <- Finv Finv diag(Z) <- 23 Z # In this case because 23 is a single number that can be duplicated throughout the diagnol of Z, there is no problem. # We can check that this is really the inverse of F Finv %*% F # Or equally good F %*% Finv # Interestingly element by element multiplation yeilds the same result in this example F * Finv # f. Rank(D) D qr(D)$rank

# F only has rank 1.  This can be seen by dividing col 1 by col 2.

D[,1]/D[,2]

# Both numbers are two meaning item [1,1] = [1,2]*2 and [2,1]=[2,2]*2

# Interestingly this trick works for rows as well:

D[1,]/D[2,]

# g. B + G

B
G

B + G

# h. B - G

B - G

# Addition and subtraction is element by element in matrix notation

# i. Rank(C)

C
qr(C)$rank # C does not suffer from linearity problems C[1,]/C[2,] # j. -G G -G # k. Trace(E) # Trace is the sum of diagnol elements E sum(diag(E)) # l. Rank(F) # Because F is a diagnol matrix (with no zero diagnol elements) the rank must be equal to the min of the dimensions, 3. qr(F)$rank

# kF, where k = -7

-7*F

# n. BF

B
F
B %*% F

# o. FB

F %*% B
# With matrices, BF != FB

# p. determinate of C or |C|
C
det(C)

# q. |D|
D
det(D)
# Which looks kind of funny but that is because R usings search algorithms to find the determinate and they are not exact.

# But calculating the determinate of a 2d matrix is easy:
D[1,1]*D[2,2]-D[1,2]*D[2,1]

# r. | CD |
C; D
H = C %*% D
det( H )
# Which also happens to be zero

H[1,1]*H[2,2]-H[1,2]*H[2,1]

# Sorry for the repetitive examples.  I had homework for a class and thought I might as well turn it into a post that someone might find useful.

## Thursday, September 6, 2012

### Drawing jointly distributed non-normal random variables

* This method only approximates joint non-normal draws (which is really what any method does).

* I was recently told that it was "impossible" to draw joint non-normal distributions.

* But you will see that the approximation looks pretty good.

* It is easy to draw jointly distributed non-normal draws so long as you can start by drawing jointly distributed normal draws.

set more off

clear
set obs 10000

* For instance let's draw four variables.
* 1. a chi2 with 5 degrees of freedom
* 2. a poisson k = 5
* 3. a uniform variable with min = -5 and max = 5
* 4. a random f distribution draw with 5 and 5 degrees for numerator and denominator degrees of freedom.

* First we will specify the correlation matrix.
* The only constraint as far as I know is that the covariance matrix has to be PSD.
* This in practicality limits the possible correlations between variables since cross terms tend to cause vialations more likely in the PSD requirement.

matrix c = (  1, .7,-.3,  .2 \ ///
.7,  1, .2, -.1 \ ///
-.3, .2,  1,  .3 \ ///
.2,-.1, .3,   1 )

* If we do not specify a mean or covariance matrix then the default draws are standard normals which is what we want for simplicity.
drawnorm x1 x2 x3 x4, corr(c)

corr x?
spearman x?

* Now all we need to do is turn our normal draws into uniform draws.
* Note: if x~N(0,1) and THETA is the CDF of the normal then y=CDF(x)~uniform
* So for any new distribution with CDF ALPHA and inverse INVALPHA the variable z=INVALPHA(y) ~ alpha.

gen y1 = normal(x1)
gen y2 = normal(x2)
gen y3 = normal(x3)
gen y4 = normal(x4)

sum
* Looking good.  The next step is that we take the inverse CDF of the distributions of interest.

gen z1 = invchi2(5, y1)
label var z1 "chi2"
* The inverse poisson distribution seems to be incorrectly defined in Stata so that it uses 1-p rather than p to calucalate the inverse.
gen z2 = invpoisson(5, 1-y2)
label var z2 "Poisson"
* It is easy to transform a uniform (0,1) to (a,b) by subtracting a and multiplying by (b-a)
gen z3 = y3*10-5
label var z3 "Uniform"
gen z4 = invF(5, 5, y4)
label var z4 "F distribution"
corr z?
spearman z?
* We can see that the spearman rank correlation is maintained with the standard pearson correlations are only slightly diminished by the non-linear transformations.

* In general the correlations are slightly drawn towards zero so if possible it might be worth it to exagerate the correlations in the matrix c so that they end up being drawn more closely to the desired levels.

hist z1, saving(chi2, replace) nodraw
hist z2, saving(poisson, replace) nodraw
hist z3, saving(normal, replace) nodraw
hist z4, saving(invF, replace) nodraw

graph combine chi2.gph poisson.gph normal.gph invF.gph

* Much of the content of this post was covered in a previous post under the title: Drawing Rank Correlated Random Variables.  It might be worth looking over the previous post if you have additional questions.