Monday, December 9, 2013

To Cluster or Not to Cluster - That is the Question

Several very prominent econometricians have fallen on either side of this
question, typically they are concerned that the number of clusters is
small and the size in each cluster is large.  One camp argues that the
only reasonable way of deriving consistent standard errors is through
clustering while the alternative camp argues that clustering does not
have good small sample properties.

In this post I will explore how clustering works given these different
conditions.

cap program drop cluster_se

program define cluster_se, rclass
syntax [anything], NGrp(numlist >0 integer) NInd(numlist >0 integer)
  
  clear
  set obs `ngrp'

  * In order to examine clustering we will structure the error
  * in one of the most basic forms possible.  There is a different
  * mean unobservable for each group and a different mean x
  * for each group.  There is no correlation between x and v.
  gen x=rnormal()
  gen v=rnormal()
  gen id=_n

  * Duplicate the data by number of individuals in each group
  expand `nind'

  * u is a function of both the group mean error and the individual
  * observation
  gen u=v+rnormal()

  * In order to test if our standard error are working properly
  * let's look at the rate of false rejections given that there
  * is no true relationship between x and y.
  gen y=0*x+u
  
  * Basic regression
  reg y x
  return scalar noclus_se=_se[x]
  
  * Grab the p values
  test x=0
  return scalar noclus_r=`=r(p)<.05'

  * Now cluster across our groups
  reg y x, cluster(id)
  return scalar clus_se=_se[x]
  
  test x=0
  return scalar clus_r=`=r(p)<.05'

end

* First let's see how our new command works
cluster_se , ng(100) ni(100)
return list

simulate clus_r = r(clus_r) clus_se = r(clus_se)  /// 
       noclus_r = r(noclus_r) noclus_se = r(noclus_se),  /// 
    rep(100): cluster_se , ng(100) ni(100)

sum
* We can see by clustering we get the correct rejection rate
* while by failing to cluster we reject at ~74% of the time.
* Notice also the standard deviation of the standard error
* is much larger for clustered standard errors.

* Now let's see what happens if we decrease the number of
* groups.

simulate clus_r = r(clus_r) clus_se = r(clus_se)  /// 
       noclus_r = r(noclus_r) noclus_se = r(noclus_se),  /// 
    rep(100): cluster_se , ng(10) ni(100)

sum
* We can see now our clustered standard errors are 
* overrejecting though they are still much better that
* the non-clustered.

* Let's try something even more extreme.
simulate clus_r = r(clus_r) clus_se = r(clus_se)  /// 
       noclus_r = r(noclus_r) noclus_se = r(noclus_se),  /// 
    rep(100): cluster_se , ng(3) ni(200)

sum
* Now the clustered standard errors are working even worse
* though they are performing still much better than the
* non-clustered standard error.

simulate clus_r = r(clus_r) clus_se = r(clus_se)  /// 
       noclus_r = r(noclus_r) noclus_se = r(noclus_se),  /// 
    rep(100): cluster_se , ng(3) ni(20)

sum
* Clustering is still giving us better rejection rates though
* neither method is working particularly well at this point.

* From this simple simulation I would be inclined to argue
* that it is always reccomended to cluster when possible.

Formatted By Econometrics by Simulation

Thursday, December 5, 2013

Incidental Parameters Problem with Binary Response Data and Unobserved Individual Effects

It is a well known problem that in some models as the number of observations becomes large, econometric estimators fail to converge on consistent estimators.  The leading case of this is when estimating a binary response model with panel data with potential "fixed effects" correlated with the explanatory variables appearing in the population.

One method that is typically implemented by researchers is to observe inviduals or organizations over multiple periods of time.  This is called panel data.  Within the context of panel data it is often assumed unobserved effects: genetics, motivation, business structure, and whatever other unobservables that might be correlated with the explanatory variables are unchanging over time. 

If we do then it is relatively easy to remove the effect of unobservables from our analysis.  In my previous post I demonstrate 3 distinct but equivalent methods for accomplishing this task when our structural model is linear.

However, when our model is a binary response variables (graduate from college or not, get married or not, take the job or not, ect.) it is usually no longer logically consistent to stick with a linear model.

In addition, not all of our remidies which worked for the linear model provide consistent estimators for non-linear models.  Let us see this in action.

First we will start with generating the data as we did in the December 4th post.

# Let's say: x = x.base + fe
 
set.seed(2)
 
nperson <- 500 # Number of persons
nobs <- 5      # Number of observations per person
 
 
fe.sd <- 1 # Spefify the standard deviation of the fixed effed
x.sd  <- 1 # Specify the base standard deviation of x
 
# First generate our data using the time constant effects
constantdata <- data.frame(id=1:nperson, fe=rnorm(nperson))
 
# We expand our data by nobs
fulldata <- constantdata[rep(1:nperson, each=nobs),]
 
# Add a time index, first define a group apply function
# that applies by group index.
gapply <- function(x, group, fun) {
  returner <- numeric(length(group))
  for (i in unique(group)) 
    returner[i==group] <- get("fun")(x[i==group])
  returner
}
 
# Using the generalized apply function coded above
fulldata$t <- gapply(rep(1,length(fulldata$id)), 
                     group=fulldata$id, 
                     fun=cumsum)
 
# Now we are ready to caculate the time variant xs
fulldata$x <- fulldata$fe + rnorm(nobs*nperson)
 
# This is were our simulation diverges from a linear model.
 
# Let us define the linear component as:
lc <- .5*fulldata$x + .5*fulldata$fe
 
# Now we will define the logit probability as
fulldata$pl <- exp(lc)/(1+exp(lc))
 
# Now we define the probit probability as
fulldata$pp <- pnorm(lc)
 
cor(fulldata$pl,fulldata$pp)
 
plot(fulldata$pl,fulldata$pp, main="Probit and Logit Probabilities", 
     xlab="Logit", ylab="Probit")
 
plot(sort(lc), sort(fulldata$pl), main="Probit and Logit Probabilities",
     ylab="P(y=1|x,a)", xlab="xb+a",
     type="l", lwd=2, lty=3)
  lines(sort(lc), sort(fulldata$pp), lwd=2, lty=2)
legend(-3.3,.8,c("logit","probit"), lty=c(3,2), lwd=2) 


# We can see the probit tends to have a shorter range in which the 
# action is happening. 
  
# We should really think of the probit and the logit as now
# two different sets of data.  Now let's generate out outcomes.
fulldata$yl <- fulldata$pl>runif(nobs*nperson)
fulldata$yp <- fulldata$pp>runif(nobs*nperson)
 
# Let's try to estimate our parameters with the logit.
glm(yl ~ x, data = fulldata, family = "binomial")
 
# We can see our estimator is upwards biased as we expect.
 
# Now we will try to inlcude fixed effects as if we were not
# aware of the incidental parameters problem.
glm(yl ~ x+factor(id), data = fulldata, family = "binomial")
 
# We can see, that including a matrix of dummy variables
# seems to actually make our estimator worse.
 
# Instead let's try the remaining fix that is available to us
# from the previous post listing 3 fixes.
 
# We will include an average level of the explanatory variable
# for each individual.  This is referred to as the 
# Chamberlain-Munlak device.
fulldata$xmean <- ave(fulldata$x, group=fulldata$id)
 
glm(yl ~ x+xmean, data = fulldata, family = "binomial")
# We can see that including an average effect significantly
# reduces the inconsistency in the estimator.
 
# Now, let's see what happens if we do the same things in the
# probit model.
glm(yp ~ x, data = fulldata, family = binomial(link = "probit"))
# Probit experiences similar upward bias to that of the logit.
 
glm(yp ~ x+factor(id), data = fulldata, 
    family = binomial(link = "probit"))
 
glm(yp ~ x+xmean, data = fulldata, family = binomial(link = "probit"))
# Interestingly, including the Chamberlain-Munlak device in the probit
# though theoretically inconsistent does seem produce estimates
# comparably good as including the device with the logit at least
# in the sample sizes simulated here.
Created by Pretty R at inside-R.org

Wednesday, December 4, 2013

Unobserved Effects With Panel Data

It is common for researchers to be concerned about unobserved effects being correlated with observed explanatory variables.

For instance, if we were curious about the effect of meditation on emotional stability we may be concerned that there might be some unobserved factor such as personal genetics that might  predict both likelihood to meditate and emotional stability.

In order to remove this potentially biasing effect we could think about taking measurements over multiple periods for the same individual inquiring about frequency of meditation and emotional stability.

If we observe that within the same individual, removing the time constant effects which (presumably) genetics is a component of that there is still a relationship between meditation and emotional stability, then we may feel on firmer ground as to our hypothesis that mediation may lead to more emotional stability.

In order to accomplish the goal of estimating this relationship we may experiment with a "fixed effects" model defined as:

$$y_{it}=x_{it}\beta + a_i+u_{it}$$

In this typical linear model with panel data, there is no problem including an arbitrary number of dummy variables.  Let's see this in action.

nperson <- 300 # Number of persons
nobs <- 3      # Number of observations per person
 
# In order for unobserved person effects to be a problem they must be
# correlated with the explanatory variable.
 
# Let's say: x = x.base + fe
 
fe.sd <- 1 # Spefify the standard deviation of the fixed effed
x.sd  <- 1 # Specify the base standard deviation of x
 
beta <- 2
 
# First generate our data using the time constant effects
constantdata <- data.frame(id=1:nperson, fe=rnorm(nperson))
 
# We expand our data by nobs
fulldata <- constantdata[rep(1:nperson, each=nobs),]
 
# Add a time index, first define a group apply function
# that applies by group index.
gapply <- function(x, group, fun) {
  returner <- numeric(length(group))
  for (i in unique(group)) 
    returner[i==group] <- get("fun")(x[i==group])
  returner
}
 
# Using the generalized apply function coded above
fulldata$t <- gapply(rep(1,length(fulldata$id)), 
                         group=fulldata$id, 
                         fun=cumsum)
 
# Or a more simplified function
indexer <- function(group) {
  returner <- numeric(length(group))
  for (i in unique(group)) 
    returner[i==group] <- 1:sum(i==group)
  returner
}
 
# Is a special case of gapply
fulldata$t <- indexer(fulldata$id)
 
# Now we are ready to caculate the time variant xs
fulldata$x <- fulldata$fe + rnorm(nobs*nperson)
 
# And our unobservable error
fulldata$u <- rnorm(nobs*nperson)
 
# Finally we are ready to simulate our y variables
fulldata$y <- .5*fulldata$x + .5*fulldata$fe + fulldata$u
 
# First lets see how our standard linear model performs:
summary(lm(y~x, data=fulldata))
 
# Adding a dummy variable removes the bias
summary(lm(y~x+factor(id), data=fulldata))
 
# The same result can be taken by removing the mean from
# both the explanatory variables and the dependent variables.
# Why is that?


Think of the problem as:
$$y_{it}=x_{it}\beta + a_i+u_{it}$$
So $$y_{it}-mean_t(y_i)=(x_{it}-mean_t(x_{i}))\beta + a_i-mean_t(a_i)+u_{it}-mean(u_i)$$

Because the unobservable effect is constant over time it drops out.  And so long as their was a term controlling for the average unobservable effect (the dummy variables) then the average per person unobserved error must by definition be equal to zero.

thus: $$y_{it}-mean_t(y_i)=(x_{it}-mean_t(x_{i}))\beta + u_{it}$$

fulldata$ydemean <- fulldata$y-ave(fulldata$y, group=fulldata$id)
fulldata$xdemean <- fulldata$x-ave(fulldata$x, group=fulldata$id)
 
summary(lm(ydemean~xdemean-1, data=fulldata))
 
# We can also accomplish this by adding the Chamberlain device to the 
# that regression is the total or mean of the explanatory variables at
# the level of each individual.
fulldata$xmean <- ave(fulldata$x, group=fulldata$id)
 
fulldata$xsum <- gapply(fulldata$x, group=fulldata$id, fun=sum)
 
# This is a little trickier to figure out how it accomplishes the task
# of differencing out the unobserved effect.  
 
# This is how I think of it. The unobserved individual effect must be 
# correlatedwith the explanatory variable in aggrogate to be a problem. 
# However, that correlation can only be on the individual level since
# by definition the "fixed effect" is constant on the individual level.
# Thus by creating a new variable which is the average or total for
# each individual, we are allocating to that variable any variation
# which correlates with the explanatory variable. 
 
summary(lm(y~x+xmean, data=fulldata))
summary(lm(y~x+xsum, data=fulldata))
Created by Pretty R at inside-R.org

Sunday, December 1, 2013

More Explorations with catR

# For the purposes of simulating computerized adaptive tests
# the R package catR is unparallelled.
 
# catR is an excellent tool for students who are curious about
# how a computerized adaptive test might work. It is also useful
# for testing companies that are interested in seeing how
# their choices of number of items, or model, stopping rule,
# or quite a few of the other options which are available
# when designing a specific computerized adaptive test.
 
# In this post I will explore some of the features of the 
# function randomCAT, an extremely powerful function 
# that simulates an entire response pattern for an individual.
 
# In a previous post I explore  some of the other function 
# in catR in order to step by step demonstrate how to use 
# the package to simulate a test.
 
library("catR")
 
# First let's generate an item bank. 
# Items specifies how many items to generate
 
# Model specifies which model to use in generating the items
# a,b,c Priors are specifying distributions to draw
# the parameters from for each item.
 
# The final set of arguments is for specifying
# what range of theta values the bank will initially
# draw item parameters for.  Theta values are the typical
# latent traits for which item response theory is concerned
# with estimating.
Bank <- createItemBank(items = 500, model = "3PL", 
                       aPrior=c("norm",1,0.2), 
                       bPrior=c("norm",0,1), 
                       cPrior=c("unif",0,0.25),
                       thMin = -4, thMax = 4,
                       step = 0.05)
 
# We may want to examine the object we have created called "Bank"
attributes(Bank)
 
# Within the Bank object of class "itBank" there is three named
# attributes.
 
# itemPar lists the item parameters for those items which have been
# generated.  We could see a histogram of difficulty parameters (b) by
# targeting within the Bank object:
 
hist(Bank$itemPar[,2], breaks=30, 
     main="Distribution of Item Difficulties",
     xlab="b parameter")
# We can also see how much information a particular item would add # accross a range of ability levels.  This information is already # available within the Bank object under the names infoTab and # theta.   # Plot the first item's information plot(rep(Bank$theta,1),Bank$infoTab[,1],      type="l", main="Item 1's information",      xlab="Ability (theta)", ylab="Information")   # Plot the first 3 items # By specifying type = "n" this plot is left empty nitems = 3 plot(rep(Bank$theta,nitems),Bank$infoTab[,1:nitems], type="n",      main=paste0("First ",nitems," items' information"),      xlab="Ability (theta)", ylab="Information") # Now we plot the for (i in 1:nitems) lines(Bank$theta,Bank$infoTab[,i],                           col=grey(.8*i/nitems))
# We can see how different items can have information that # spans different ability estimates as well as some items # which just have more information than other items.     # Plotting all 500 items (same code as previously but now # we specify the number of items as 500) nitems = 500 plot(rep(Bank$theta,nitems),Bank$infoTab[,1:nitems], type="n",     main=paste0("First ",nitems," items' information"),     xlab="Ability (theta)", ylab="Information") for (i in 1:nitems) lines(Bank$theta,Bank$infoTab[,i],                           col=grey(.8*i/nitems)) # This plot may look nonsensical at first.  Be it actually # provides some useful information.  From it you can see the # maximum amount of information available for any one # item at different levels of ability.  In the places where # there is only one very tall item standing out we may be # concerned about item exposure since subjects which seem to # be in the area of that item are disproportionately more likely # to get the same high info item than other other subjects # in which the next highest item is very close in information # to the max item.   # To see the max information for each ability we can add a line. lines(Bank$theta,apply(Bank$infoTab, 1, max), col="blue", lwd=2)   # We might also be interested in seeing how much information # on average a random item chosen from the bank would provide # or in other words what is the expected information from a # random item drawn from the bank at different ability levels. lines(Bank$theta,apply(Bank$infoTab, 1, mean), col="red", lwd=2)   # Or perhaps we might want to see what the maximum average information # for a 20 item test might be. So we calculate the average information # for the top 20 items at different ability levels. maxmean <- function(x, length=20) mean(sort(x, decreasing=T)[1:length]) maxmean(1:100) # Returns 90.5, seems to be working properly   lines(Bank$theta,apply(Bank$infoTab, 1, maxmean), col="orange", lwd=3)  
# Now this last line is very interesting because it reflects # per item the maximum amount of information this bank can provide # given a fixed length of 20. Multiply this curve by 20 and it will give # us the maximum information this bank can provide given a 20 item test # and a subject's ability.   # This can really be thought of as a theoretical maximum for which # any particular CAT test might attempt to meet but on average will # always fall short.   # We can add a lengend legend(-4.2, .55, c("max item info", "mean(info)",                     "mean(top items)"),        lty = 1, col = c("blue","red","orange"),  adj = c(0, 0.6))       library("reshape") library("ggplot2")   # Let's seperate info tab infoTab <- Bank$infoTab   # Let's add three columns to info tab for max, mean, and mean(top 20) infoTab <- cbind(infoTab,                  apply(Bank$infoTab, 1, max),                  apply(Bank$infoTab, 1, mean),                  apply(Bank$infoTab, 1, maxmean))     # Melt will turn the item information array into a long object items.long <- melt(infoTab)   # Let's assign values to the first column which are thetas items.long[,1] <- Bank$theta   # Now we are ready to name the different columns created by melt names(items.long) <- c("theta", "item", "info")   itemtype <- factor("Item", c("Item","Max", "Mean", "Mean(Max)")) items.long <- cbind(items.long, type=itemtype) items.long[items.long$item==501,4] <- "Max" items.long[items.long$item==502,4] <- "Mean" items.long[items.long$item==503,4] <- "Mean(Max)"   # Now we are ready to start plotting # Assign the data to a ggplot object a <- ggplot(items.long, aes(x=theta, y=info, group=item))   # Plot a particular instance of the object a + geom_line(colour = gray(.2)) +     geom_line(aes(colour = type), size=2 ,             subset = .(type %in% c("Max", "Mean", "Mean(Max)")))  
# Now let's look at how the randomCAT function works. # There are a number of arguments that the randomCAt function # can take.  They can be defined as lists which are fed # into the function.   # I will specify only that the stoping rule is 20 items. # By specifying true Theta that is telling random CAT what the # true ability level we are estimating. res <- randomCAT(trueTheta = 3, itemBank = Bank,                  test=list(method = "ML"),                  stop = list(rule = "length", thr = 20)) # I specify test (theta estimator) as using ML because the # default which is Bayesian model is strongly centrally # biased in this case.   # Let's examine what elements are contained with the object "res" attributes(res)   # We can see our example response pattern. thetaEst <- c(0, res$thetaProv)   plot(1:21, thetaEst, type="n",      xlab="Item Number",      ylab="Ability Estimate",      main="Sample Random Response Pattern") # Add true ability line   abline(h=3, col="red", lwd=2, lty=2) # Add a line connecting responses   lines(1:21, thetaEst, type="l", col=grey(.8)) # Add the response pattern to   text(1:21, thetaEst, c(res$pattern, "X")) # Add the legend   legend(15,1,"True Ability", col="red", lty=2, lwd=2)  
# Plot the sample item information from the set of items selected. plot(rep(Bank$theta,20),Bank$infoTab[,res$testItems], type="n",      main="High information items are often selected",      xlab="Ability (theta)", ylab="Information") for (i in 1:500) lines(Bank$theta,Bank$infoTab[,i], col=grey(.75)) # Now we plot the for (i in res$testItems) lines(Bank$theta,Bank$infoTab[,i],                                lwd=2, col=grey(.2))  
# Now let's see how randomCat performs with a random draw # of 150 people with different ability estimates.   npers <- 150   # Specify number of people to simulate   theta <- rnorm(npers) # Draw a theta ability level vector   thetaest <- numeric(npers) # Creates an empty vector of zeros to hold future estimates # of theta   # Create an empty item object items.used <- NULL   # Create an empty object to hold b values for items used b.values <- NULL     for (i in 1:npers) {   # Input the particular theta[i] ability for a particular run.   res <- randomCAT(trueTheta = theta[i],                    itemBank = Bank,                    test=list(method = "ML"),                    stop = list(rule = "length", thr = 20))   # Save theta final estimates   thetaest[i] <- res$thFinal   # Save a list of items selected in each row of items.used   items.used <- rbind(items.used, res$testItems)   # Save a list of b values of items selected in each row   b.values <- rbind(b.values, res$itemPar[,2])   }   # Let's see how our estimated theta's compare with our true plot(theta, thetaest,      main="Ability plotted against ability estimates",      ylab="theta estimate") 
# To get a sense of how much exposure our items get 
itemTab <- table(items.used)
 
length(itemTab)
# We can see we only have 92 items used for all 150 subjects
# taking the cat exam.
 
mean(itemTab)
# On average each item used is exposed 32 times which means
mean(itemTab)/150
# over a 20% exposure rate on average in addition to some items
# have much higher exposure rates.
Created by Pretty R at inside-R.org