Wednesday, October 31, 2012

Clustering Standard Errors - State Panel Data Example

* Imagine that you are trying to evaluate corporate state labor taxes as a predictor of state employment.

* First let's generate our states
set seed 1033

set obs 50

gen state=_n

* Let's generate some starting values for unemployment.
gen base_employment=runiform()*.3

* Let's imagine that there is an annual trend in unemployment for each state.
gen trend=rnormal()*.025

* The policy to cut unemployment is enacted in different states around year 10.
gen policy_start = rpoisson(10)

expand 20

bysort state: gen t=_n

gen policy=(t>policy_start)

gen employment = .01*policy + base_employment + trend*t + rnormal()*.06

* The nieve regression would be to directly estimate the effect of the policy.
reg employment policy

* However, we might be concerned that the sampling is clustered.
* In order to help controlled for correlated errors by cluster we can cluster the standard errors.

* We may be interested in the interclass correlation.
loneway employment state
* This happens to be large.

reg employment policy, cluster(state)
* This substantially increases our standard errors size and results in a failure to reject the null.
* But, in this case we know that there is an effect of the policy, should we still cluster our standard errors?

* The answer is yes, we need to cluster our standard errors.

* To show this I will simulate the data 100 times with the alternative scenario (that the null is true and there is no effect).

cap program drop cluster_test
cap program define cluster_test, rclass
  set obs 50
  gen state=_n
  gen base_employment=runiform()*.3
  gen trend=rnormal()*.025
  gen policy_start = rpoisson(10)
  expand 20
  bysort state: gen t=_n
  gen policy=(t>policy_start)
  gen employment = .00*policy + base_employment + trend*t + rnormal()*.06
  * NOTE: Now the policy has no effect.
  reg employment policy
     local p1 = ttail(e(df_r), abs(_b[policy]/_se[policy]))
  return scalar sig1 = (`p1'<.05)
  reg employment policy, cluster(state)
     local p2 = ttail(e(df_r), abs(_b[policy]/_se[policy]))
  return scalar sig2 = (`p2'<.05)

simulate sig1=r(sig1) sig2=r(sig2), reps(100): cluster_test

* sig1 is from the regression without clustered standard errors.
* sig2 is from the regression with clustered standard errors.
* We can see that both rejections too frequently reject the null (target is 5%).
* However, the difference between unclustered and clustered is the difference between falsely rejecting the null 56% of the time and 12% of the time.
* You can repeate the simulation above using 500 or 5000 states above.
* The more states you use the closer the type 1 error gets to 5%.
* However, increasing the number of years does not impove the estimates.

* There is one more thing I would like to do with this data so let's generate it once more.

* We may be concerned that our policy was not exogenously given to each state but rather as a product of an endogenous connection between employment and the policy.

* One method to test the exogeniety of the policy to so test if the year before the policy was enacted, if there was any predictive power on unemployment.
gen year_before=policy_start-1

gen policy_lead=(t==year_before)

reg employment policy_lead policy
* We may be tempted to not cluster the errors but clustering is just as important here as previously.

reg employment policy_lead policy, cluster(state)
* Unsurprisingly there is on evidence of endogeniety, since treatement was not endogenous by construction in this case.

Tuesday, October 30, 2012

Regression Analysis - OLS

* Often I simulate problems in order to verify that the method is working as I expect it to.

* Let's first set the number of observations

set obs 10000

gen x = runiform()
gen u = rnormal()

gen y = 2*x + 4*u

* To verify that we can actually estimate the coefficient on x (2) from OLS we can simply run it.

reg y x

* This is not a proof that OLS works but rather a simple test that can indicate that it might not be working.

* A simple example to show that OLS might not be working properly is:

reg y x if y>0

* OLS cannot be assumed to be unbiased when the dependent variable is censored.  If we did not know this already, we could use the previous regression to demonstrate that there is probably a problem.

* Of course a single regression would not be enough.

* We would need a monte carlos simulation to show biasedness (or a mathematical proof of course)

* A simple program can be written as such using the above code.

capture program drop censoredOLS
program censoredOLS
  set obs 10000
  gen x = runiform()
  gen u = rnormal()
  gen y = 2*x + 4*u
  reg y x if y>0

* It is useful to know that the coefficient from the regression can be targeting with the _b[x] command.

simulate b_x = _b[x], reps(1000): censoredOLS
sum b_x

* We can see from the simulation that the individual low estimate on the coefficient of x was not a fluke.

Monday, October 29, 2012

Monster Mash - Spatial Multi-Agent Simulation

# This simualtion demonstrates how to build a simple spatial 
#multi-agent simulations using R. (repo)
# There is a 20 x 20 grid in which the agents occupy.  
# If they walk off one edge they end up on the other side.
gridx = 20
gridy = 20
# Let's first specify the initial number of agents.
n.Bob = 4
n.Frank = 1
n.Dracula = 1
n.Hunter = 2
# Specify number of moves before the monsters start fighting and expanding
n.moves = 10
# Now we will create a movement matrix for each monster.
for(v in c("Bob","Frank","Dracula","Hunter")){
# Zombie is type 1, Frankenstien's Monster is type 2, Vampires is type 3, and 
Hunter type 4.
for(i in 1:get(paste("n.","Bob",sep=""))) {
  # Specify initial positions
  x = ceiling(gridx*runif(1))
  y = ceiling(gridy*runif(1))
  type = t
  # Create a vector for each monster in the simulation.
  assign(paste(v,i,sep=""), t(as.matrix(c(x=x,y=y,t=type))))
# Due to number of files separate objects it is not very easy to plot them.
plot(0,0,xlim=c(1,gridx),ylim=c(1,gridy), xlab="X", ylab="Y", 
main="Zombie-Grey, Frankenstien-Green, Vamp-Red, Hunter-Purple")
# Let's speficy the color of each monster type and hunter.
# Remember Zombie is 1, Frankenstien's 2, Vampires 3, and Hunter 4.
mon.col = c("gray", "green", "red", "purple")
for(v in c("Bob","Frank","Dracula","Hunter")){
for(i in 1:get(paste("n.","Bob",sep=""))) {
  handle = get(paste(v,i,sep=""))
  points(handle.l[1],handle.l[2], col=mon.col[handle.l[3]], cex=3, pch=19)
# Speficy a minmax function that keeps the monsters in the grid
minmax <- function(x,xmin,xmax) max(min(x,xmax),xmin)
for(ii in 1:n.moves) {
for(v in c("Bob","Frank","Dracula","Hunter")){
for(i in 1:get(paste("n.","Bob",sep=""))) {
  handle = get(paste(v,i,sep=""))
  final = rbind(handle,c(minmax(handle.l[1]+(-1)^rbinom(1,1,.5), 1, gridx),
      minmax(handle.l[2]+(-1)^rbinom(1,1,.5), 1, gridy), handle.l[3]))
plot(0,0,xlim=c(1,gridx),ylim=c(1,gridy), xlab="X", ylab="Y", 
main=c("Monster Movement", 
"Zombie-Grey, Frankenstien-Green, Vamp-Red, Hunter-Purple"))
# Let's speficy the color of each monster type and hunter.
# Remember Zombie is 1, Frankenstien's 2, Vampires 3, and Hunter 4.
mon.col = c("gray", "green", "red", "purple")
for(v in c("Bob","Frank","Dracula","Hunter")){
for(i in 1:get(paste("n.","Bob",sep=""))) {
  handle = get(paste(v,i,sep=""))
  turn <- nrow(handle)
  points(handle[1,1],handle[1,2], col=mon.col[handle.l[3]], cex=2, pch=19)
   arrows(x1=handle[-1,1], y1=handle[-1,2],
         col=mon.col[handle.l[3]],pch=19, length=.1)  
  points(handle.l[1],handle.l[2], col=mon.col[handle.l[3]], cex=3, pch=19)

# Large circles are end points, small beginning points and arrows movement over the different turns.

# The agents do not interact with each other so this simulation really is not that interesting.  In a future post I might explore what happens when agents can encounter each other and how the entire simulation resolves.

Sunday, October 28, 2012

Explosive Roommate Combinations

# I have been thinking of late about roommates and conflicts.  It seems to me that the more roommates you have, the more likely you are to have some kind of explosive conflict.  There seems to be very discrete thresholds.

# I believe this is because each roommate pair has the possibility of severely bothering another roommate.  Thus if there is only two people living in a house the probability of conflict is delta (combination choose 2 from 2).  [remember comb = n!/k!(n-k)!]
# Which is 1 because there is only one combination possible from a pool of only towo.

# If there is three people then the probability is combination choose 2 from 3.
# Which means there are 3 potential conflicts. A&B B&C or A&C

# If there is three people then the probability is combination choose 2 from 4.
# There are 6 of these A&B B&C C&D A&C A&D B&D.

# If you were to cram 5 roommates into one house then there could be 10 potential combinations.

# IF there were six roommates then potential number of conflicts increases to 15.

# Thus large families always seem to be fighting.  Likewise it is hard to find examples of non-family members sharing a home with more than three or four roommates.

# The overaching point is that though we are only increasing home occupancy by 1/2,1/3,1/4,or 1/6 the effects of adding one more person discontinuously increases the number of potential conflicts.

# Let's see how the map of combinations work
b <- c(NA)

# We need to loop through all of the a's for this operation because a is already a vector so 1:a will not work for us.
for(i in (a)) b[i-1] <- prod(1:i)/(prod(1:2)*prod(1:max(i-2,1)))

plot(a,b, ylab="Combinations", xlab="Items", main=c("The number of potential conflicts increases", "disproportionately relative to the number of roommates"))

# Thus if each person is likely to be in conflict with each other person at a probability of 10% per combination then likely number of conflicts per year with 20 people is 20 conflicts.  Thus it is necessary for large organizations of people to establish rules by which conflicts are managed.


Saturday, October 27, 2012

Power Analysis

# Power analysis is a frequently used tool to calculate the minimum sample size neccessary in order to show that a hypothesized result exists given some specified rate of type I and type II error.  A typical level of error for type I is alpha=5%.  Type II error for the purpose of power analysis is often specified at a beta=20% level.

# Let's first look at how selecting a sample corresponds to rejection rates and type I errors. We can do this based on the normality assumption of the underlying distribution of errors.  If we assume that we know that the underlying errors are normally distributed then we can say with exact precision how frequently a difference in means will reject the null for a particular sample size.

# A critical feature of power analysis is the specification of the expected effect size.  The larger the specified effect size the smaller the statistical sample we need in order to reject the null.  Symmetrically, the larger the effect size you assume the more likely you will fail to reject the null in practice as your analysis fails to have sufficient power.  Thus, choosing expected effect size is a non-trivial task.  Sometimes, discipline specific requirements will inform your decision as to the a-priori effect size.  For instance in educational interventions, a new technology that is expensive may not be considered economically feasible unless it increases student performance by .1 on overall GPA.
d = .1

# It is also important to either have information about the population's sample variance or to make some assumption about the populations variance.

# I assume that the sample's standard deviation of GPAs is 1.
s = 1

# I would like to first look at the number of students that may be part of the study.  In this case anywhere between 2 and 500.
n = 2:500

# The standard errror is:
se = s/(n^.5)

# The resulting t score is:
t = d/se

# Probability of rejecting the null is drawn from the Student t-distribution because we are assuming a two sided rejection rate.
p <- pt(t,n)

plot(n, p, ylim=c(.5,1), ylab="Probability", col="blue",
          xlab="Sample Size", type="l", yaxs = "i", xaxs = "i" ,
          lwd=2, main="Probability of Rejecting the Null")
# I choose .975 because 5% significance level on a two sided test leaves only 2.5% in each tail.
lines(y=c(0.975,0.975), x=c(2,500), col="red", lwd=2)

# We can see that for an effect size that is only 10% of the standard deviation of the error we would need nearly 400 people before we can be confident of rejecting the null 95% of the time.  I believe there is a symetric argument for type I errors as well.  I will attempt to address this more later.

# At this point one may wonder if we really need to hassle with the t-statistic.  For the most part you would be right.  The difference between the t and the z is extremely small.

z = d/se
pz <- pnorm(z)
plot(n, p, ylim=c(.5,1), ylab="Probability", col="blue",
          xlab="Sample Size", type="l", yaxs = "i", xaxs = "i" ,
          lwd=8, main=c("The difference between t and z distributions is", "not driving results: t=blue, z=green"))
lines(n, pz, col="green", lwd=3)
lines(y=c(0.975,0.975), x=c(2,500), col="red", lwd=2)

# Thus, for simplicity, we will use the z distribution.

# Let's look briefly at what happens when we vary effect size.

# Start with a blank graph with a range of n from 2 to 100
n<- 2:100

# we have to redefine the standard error vector
se = s/(n^.5)

plot(n, n*0, ylim=c(.5,1), ylab="Probability", col="blue",
          xlab="Sample Size", type="n", yaxs = "i", xaxs = "i" ,
          lwd=2, main="Probability of Rejecting the Null")

# Now let's loop through the difference in power as a function of effect size:
for (d in seq(.1,1,.1)) {
  z = d/se
  pz <- pnorm(z)
  lines(n, pz, col=gray(d), lwd=2)
lines(y=c(0.975,0.975), x=c(2,200), col="orange", lwd=2)

# I will now strengthen the axes since they seem to have been somewhat overwritten by the other lines on the graph.
lines(y=c(1,1), x=c(2,200), lwd=2)      
lines(y=c(0,1), x=c(2,2), lwd=2)
# Thus we can see that if the effect is close to one standard deviation, then a very small sample size is all that is needed to identify the effect.

# Overall the construction of power analysis is found by rearranging the common form of the t-test.

# t = (M0 - M1)/(SE)
# SE = s/n^.5
# t=n^.5 * (M0-M1)/s
# (t*s/(M0-M1))^2 = n
# d=M0-M1
# n = t^2 (var/d^2)

# In order to find t you find what t value is required in order to reject at the desired level.

# T~1.96 at 5% rejection rate.

# Let's imagine our previous example

# Let's see how our required sample size increases as our desired rejection rate also increases.
alpha = seq(.5,.975,.025)

target.z <- qnorm(alpha)

nreq = (target.z)^2*s^2/d^2

plot(nreq, alpha, type="l", main="As the required rejection rate increases, patient count increases")

# Thus on a target difference in means of .1 standard deviation, there need be at least:

# Around 384 observations.  (This is of course because the effect size is very small.)

# Statisticians often do not stop there.  They often also assume that the probabiliy of failing to reject the null when there actually is an effect (Type II error) is independent of Type I errors and thus need to be taken into account.  This is done by saying that we are willing to accept a 20% possibility that there is an effect but we are not detecting it.  We can adjust the requirement simply by adding an additional z score:

# This nearly doubles the required sample.

# At this point I would like to make the caveat that most of my posts I simply pull out of my head.  Thus, if there is something wrong I welcome corrections and apoligize.  I say this now because I am generally unfamiliar with power analysis and am just piecing together this simulation as a learning device.

Thursday, October 25, 2012

A gentle introduction to boostrap resampling methods

I found this paper by Kesar Singh and Minge Xie introducing bootstrapping resampling methods.  It looks to me like a nice introduction to the method.

Using Simulations to Maximize Stochastic Models

# Today, I was running late for class and decided to drive and park in two hour parking even though I had a three hour class.  I hoped the chance of being ticketted was low.  Whether it was or not, I ended up being ticketted for a wopping 35 dollars.  This got me thinking about the decisions that cities use to decide on how many officers should patrol their streets.  In some cities, I believe, parking laws and corresponding fees are structured in such a way as to cause minimal congestion while still providing parking for those who are looking for a short term soluation.  East Lansing, however I believe has a different incentive.  I believe they are generally uninterested in the welfare of the mostly students who come from other cities to access the University in East Lansing.  East Lansing, using parking as a revenue generating system that extracts rents from mostly non-East Lansing residents.

# If we are the city of East Lansing and are considering how to maximize rents in the form of traffic tickets we may use a simulation to identify the most likely strategy for maximizing over stochastic outcomes.

# The cost of employing one parking officer checking all of the vehicles takes 1 hour at a cost of $15.

cost_officer = 15

# Each minute there is a 20% chance of someone arriving.  There are 10 hours in which parking is enforced.  Thus there could be at most 60*10=600 people per day.

# Each entry is the arrival time a potential person
arrive = 1:600

# Person
showed_up = rbinom(600,1,.2)

# The duration each person stays is drawn from a poisson distribution.

duration = rpois(600, 30)*4*showed_up

depart = (arrive+duration)*showed_up

# What we want to know is, what is the optimal number of times to send out parking officers.

# In order to answer that we have to think about what the officers are doing.  First, they need to mark vehicles that are currently parked.  If there are vehicles that were marked more than 2 hours previously, then they write a ticket.  It is possible that vehicles will not pay even though they viloate the 2 hour limit because they either were not marked early enough or left before a follow up could be arranged.

# The choice variable is number of patrols which each cost 15 dollars.

numb_patrols = 50

# The patrols are spaced evenly over the 10 hours.

patrol_time = seq(0,600,length.out=numb_patrols+1)[-1]
# I add 1 to the numb_patrols because the first patrol is sent out not at the beginning of the parking day since by construction there is nobody parking before the parking hours.

# Now let's figure out who gets chalked and who gets ticketted.
chalked = 0*showed_up

# We want vehicles to be chalked by the first officer that they encounter but not to be chalked more than once.
for (v in patrol_time) {
  chalked[(arrive < v)&(chalked==0)] = v

ticketed = 0*showed_up
# Now let's see who gets ticketted
for (v in patrol_time) {
  print(((v-chalked)>120) & (chalked!=0) & (depart>v) )
  ticketed[(depart>v) & ((v-chalked)>120) & (chalked!=0)] =1

cbind(duration, ticketed)

# In order to calculate net benefits.  Compare the cost of highering patrols to that of the ticket value.

sum(ticketed)*35 - numb_patrols*15
# My approximation of the revenue is 1385 for one day of ticketing.  Let's see if we can't get a profitability curve as a function of the number of patrols.  To do this I will copy and compress the code above into a function.

park_fun <- function(num_officer) {
  returnvalues=matrix(NA, nrow=length(num_officer), ncol=2)
  for(i in 1:length(num_officer)) {
  numb_patrols = num_officer[i]
  cost_officer = 15
  arrive = 1:600
  showed_up = rbinom(600,1,.2)
  duration = rpois(600, 30)*4*showed_up
  depart = (arrive+duration)*showed_up
  patrol_time = seq(0,600,length.out=numb_patrols+1)[-1]
  chalked = 0*showed_up
  for (v in patrol_time) {
    chalked[(arrivev)&(chalked==0)] = v

  ticketed = 0*showed_up
  for (v in patrol_time) {
    ticketed[(depart>v) & ((v-chalked)>120) & (chalked!=0)] =1
  returnvalues[i,]<- cbind(sum(ticketed)*35, numb_patrols*15)

# Now let's map a profit function

park_results <- park_fun(1:100)

plot(1:100, park_results[,1], type="l", ylab="$", xlab="Number of Patrols", main="Jagged line is ticket revenue, Straight line is patrol cost", , lwd=2)
lines(1:100, park_results[,2],  col="purple", lwd=2)

# The city would like to run this simulation 100 or 1000 times to get expected distributions of retuns for each choice of number of patrols.  Then, you simply take the expected revenue and subtract it from expected cost (assuming the city is risk nuetral in parking violations).  Eyeballing the graph we can get an idea.  A number of patrols around 50 would be ideal to maximize profits.  Another key feature of the graph is the dimishing returns to more patrols after the first hump.  This is because it does little good to send out more cars to ticket if there are already a lot of cars ticketing other cars.

Wednesday, October 24, 2012

Hierarchical linear modelling

* Explorations of in heirachical linear modeling

* Hierarchical modeling is a a type of model that assigns to different levels different portions of the unexplained error term.

* What does this mean?

* Imagine that you are interested in predicting student success.

* However, you believe that there is a invidual effect size for each student, each teacher, each school, and each district.

* You would like to know what portion of the unexplained variance can be attributed to each level.

* Let's see how this works


* Imagine that there are 5 districts you have data on
set obs 5

gen dist_id=_n

* Each district has an effect size sd=2
gen dist_fe = rnormal()*2

* Each district has 5 schools
expand 5

gen school_id=_n

* Each school has an effect size sd=3
gen school_fe = rnormal()*3

* Within each school you have 8 teachers
expand 8

gen teach_id = _n

* Each teacher has an individual effect size sd = 4
gen teacher_fe = rnormal()*4

* Each teacher has twenty students.

expand 20

gen student_id = _n

* Each of these students has an effect size equal to sd = 5

gen student_fe = rnormal()*5

* Let's imagine for a second that each student stays with the same teacher for two semesters and we have information about student progress at the end of each semester.

expand 2

sort student_id

gen semester = mod(_n-1,2) + 1

* There is some transient error (shocks) that affect student performance during each semester sd = 6

gen u = rnormal()*6

* Finally let's imagine that there is some treatment such as extra tutoring that is randomly assigned to 20% of our students on a semester basis.

gen tutoring = rbinomial(1,.2)

* Let's now generate our test results

gen perform = 2*tutoring + dist_fe + school_fe + teacher_fe + student_fe + u

* For our first cut at our analysis of this data let's compare the mixed effect HLM model with a standard fixed effect model and random effects model.

* The standard fixed effect model estimates an effect for every individual.  In this case let us focus on students.

xtset student_id

xtreg perform tutoring, fe
* We can see that our estimate of the effect tutoring is close and that our variance in estimated student effects (sigma_e) is very close to 6.

* We might also try to estimate tutoring with a random effects model at the student level.
xtreg perform tutoring, re

* We expect that the re estimator outperforms the fe estimator in estimating the returns to tutoring because in this case we know that the assignment to tutoring is trully orthoganol to the student fixed effects.

* Now let's compare our previous estimates to the single level mixed effects
xtmixed perform tutoring || student_id:

* We can see that the xtmixed command produces an estimate on tutoring very nearly identical to that of the random effects model.

* As with the previous two estimates, we are generally satisfied with the estimates of the standard deviation of the student effect.

* Now let's start adding layers.

xtmixed perform tutoring || student_id: || teach_id: || school_id: || dist_id:

* Our estimate of tutoring has not improved.

* Likewise our ability to estimate the variance at each level does not seem to have much ability to distinguish effect size between the district level with a (sd of 2), the school (sd of 3), the teacher (sd of 4), and the student (sd of 5).

* Perhaps this is a result of the sample size being too small.

* At this point I will return to the beginning of the simulation and increase the number of districts to 50 with the resulting number of observations at 80,000.  However, the xtmixed command is very slow and computationally intensive so I will paste in the results directly.

/* The First etimation
Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log restricted-likelihood =  -283974.7
Iteration 1:   log restricted-likelihood =  -283974.7

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =     80000
Group variable: student_id                      Number of groups   =     40000

                                                Obs per group: min =         2
                                                               avg =       2.0
                                                               max =         2

                                                Wald chi2(1)       =    873.44
Log restricted-likelihood =  -283974.7          Prob > chi2        =    0.0000

     perform |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    tutoring |   1.975427   .0668413    29.55   0.000     1.844421    2.106434
       _cons |  -.4480866   .0439606   -10.19   0.000    -.5342478   -.3619255

  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
student_id: Identity         |
                   sd(_cons) |   7.231256   .0354216      7.162163    7.301015
                sd(Residual) |   5.984532   .0211588      5.943205    6.026147
LR test vs. linear regression: chibar2(01) = 17369.05 Prob >= chibar2 = 0.0000

* The second estimation (this took a very long time)

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log restricted-likelihood = -284364.73
Iteration 1:   log restricted-likelihood = -283987.22  (not concave)
Iteration 2:   log restricted-likelihood = -283981.69  (backed up)
Iteration 3:   log restricted-likelihood =  -283974.7  (not concave)
Iteration 4:   log restricted-likelihood =  -283974.7  (backed up)
Iteration 5:   log restricted-likelihood =  -283974.7  (not concave)
Iteration 6:   log restricted-likelihood =  -283974.7  (backed up)

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =     80000

                |   No. of       Observations per Group
 Group Variable |   Groups    Minimum    Average    Maximum
     student_id |    40000          2        2.0          2
       teach_id |    40000          2        2.0          2
      school_id |    40000          2        2.0          2
        dist_id |    40000          2        2.0          2

                                                Wald chi2(1)       =    873.44
Log restricted-likelihood =  -283974.7          Prob > chi2        =    0.0000

     perform |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    tutoring |   1.975427   .0668413    29.55   0.000     1.844421    2.106434
       _cons |  -.4480866   .0439606   -10.19   0.000    -.5342478   -.3619253

  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
student_id: Identity         |
                   sd(_cons) |   3.621455   11.10185      .0089014     1473.35
teach_id: Identity           |
                   sd(_cons) |   3.608619   9.417657      .0216722    600.8684
school_id: Identity          |
                   sd(_cons) |   3.681086   8.431311       .041338    327.7954
dist_id: Identity            |
                   sd(_cons) |   3.550187   8.323714       .035854    351.5322
                sd(Residual) |   5.984525   .0211586      5.943198    6.026139
LR test vs. linear regression:       chi2(4) = 17369.05   Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.              */

* So?  As a result we can see that our estimators do not improve substantially as a result of more data.  They may even be worse.

Monday, October 15, 2012

Maximum Likelihoods and Item Response Theory

# Maximum likelihoods is a frequent mechanism for solving item response theory problems.  Outside of IRT, maximum likelihood is widely used to solve a range of other problems.

# The following post will follow a lecture I recently attended by Mark Reckase which illustrated graphically how maximum likelihood works to solve problems.  I found this lecture extremely useful because the arguments especially with regards to IRT are almost exactly analogous to binary response models in econometrics.

# Let us start with a single parameter IRT model also know as the Rasch model.

# Imagine that we have a two test items in which we have a single response by a student.  We know that the difficulty of those questions is b=-1 and b=2.  We would like to find out the skill level of the student.

# The student gets the first question right and the second one wrong.  Based on only this information, what is our best guess at the student's ability.

# Well, we know the probabilies of the student getting the different questions right or wrong is a function of the student's ability.  The response to the questions takes the logistic form.

# Let's see what that means.  First let's restrict the range of the student ability to be between -4 and 4

ability <- seq(-4,4,.1)

# Let's define the Rasch model
rasch <- function(theta,b) exp(theta-b)/(1+exp(theta-b))

# First let's plot the probability of getting the first problem right (b=-1) given different ability levels.
plot(ability, rasch(ability,b=-1), type="l", xlab = ~theta, ylab="Probability", main="Joint Probabilities", col="red", lwd=2)

# Now let's plot the probability of getting the second question wrong.  This is the inverse of the probabity of getting it right.  Thus 1-rasch(theta,b=2).

lines(ability, 1-rasch(ability,b=2), col="blue", lwd=2)

# We can see that the probability of getting either question individually right or wrong makes it impossible to identify the parameter of interest.  However, since we know one of them was right and one of them was wrong we can use that information in order to calculate the best approximation of the ability of the student.

# This is done via maximum likelihood methods.  By assuming that the probability of getting a correct response given the student's ability level is independent for each item, we can calculate the joint probability of getting both outcomes jointly.

# This is done by the standard formula for conditional independence.  P(I1=1,I2=0|theta)=P(I1=1|theta)P(I2=0|theta)

# Thus we can graph the joint probability by multiplying them by each other.

lines(ability, rasch(ability, b=-1)*(1-rasch(ability,b=2)), col="purple", lwd=2)

# It turns out in the rasch model each item has equal discriminatory power, thus each question is equally weighted in the probability calculation.  Thus the maximimum likelihood point happens to be at the midpoint between -1 and 2.
abline(v=.5, col="orange", lwd=2)

# Imagine instead of a correct response to the first and an incorrect response to the second we instead got an incorrect response to the first and a correct response to the second.

lines(ability, 1-rasch(ability,b=-1), type="l", xlab = ~theta, ylab="Probability", main="Joint Probabilities b=-1 wrong, b=2 right", col="red", lwd=2)
lines(ability, rasch(ability,b=2), col="blue", lwd=2)
lines(ability, rasch(ability, b=2)*(1-rasch(ability,b=-1)), col="purple", lwd=2)

# Interestingly we can see that though the responses to the questions are opposite the maximum likelihood estimator still yeilds a highest probability predicted value at 1/2.  The predicted error of the latter two combinations is much greater.

# Notice that if there were no variance in the responses to the questions then the joint maximum would be no more identifiable than the maximum of individual curves.  On the other hand interestingly the difficulties of the questions need not vary in order to identify a maximum.

plot(ability, rasch(ability,b=1), type="l", xlab = ~theta, ylab="Probability", main="Joint Probabilities with four items", col="red", lwd=2,ylim=c(0,1))
lines(ability, 1-rasch(ability,b=1), col="blue", lwd=2)
lines(ability, rasch(ability, b=1)*(1-rasch(ability,b=1)), col="purple", lwd=2)

# Okay, let's imagine that we introduce information about the previous item responses into the current information set.  Imagine the student got b=-1 correct

lines(ability, rasch(ability,b=-1), col="pink", lwd=2)
lines(ability, rasch(ability, b=1)*(1-rasch(ability,b=1))*(rasch(ability,b=-1)), col="yellow", lwd=2)

# and b=2 right this time.
lines(ability, rasch(ability,b=2), col="green", lwd=2)
lines(ability, (rasch(ability,b=2))*rasch(ability, b=1)*(1-rasch(ability,b=1))*(rasch(ability,b=-1)), col="yellow", lwd=2)

# With each additional bit of information the MLE becomes more precise.  That is the tails get smaller and the estimate becomes a steeper hill with a more clearly defined peak.  I do not believe this has to be the case.  If the student were to keep getting easy problems wrong and hard problems right I am not sure if the estimates would improve.

# In order to see how the joint probabilities are changing we can look at just the joint probabilities rescaled so that their maximum is 1.

# I would also like to redefine the ability spectrum to be between -4 and 8 since the expected student ability is creeping to the right.

ability <- seq(-4,8,.1)

# First let's plot the two item joint probability function:
two.joint <-  rasch(ability, b=2)*(1-rasch(ability,b=-1))

plot(ability,two.joint/max(two.joint), col="blue", lwd=2, xlab=~theta, type="l", main="Joint probability distributions")

three.joint <- two.joint*rasch(ability,b=-1)
lines(ability,three.joint/max(three.joint), col="red", lwd=2)

four.joint <- three.joint*rasch(ability,b=2)
lines(ability,four.joint/max(four.joint), col="purple", lwd=2)

# Now let's add a fifth item with a difficulty of 3 that the student got wrong.
five.joint <- four.joint*(1-rasch(ability,b=3))
lines(ability,five.joint/max(five.joint), col="green", lwd=2)

# We can see that the result is a clear tightenning of the distribution.

# It looks like the most likely ability level of the student is around 2.

# Now let's imagine that the student happened to get a really easy problem wrong by filling in the wrong number on a multiple choice.

six.joint <- five.joint*(1-rasch(ability,b=-3))
lines(ability,six.joint/max(six.joint), col="pink", lwd=2)
# We can see that the performance on a single easy question has substantially decreased our estimate of the student's ability.

# In order to see if the distribution has also widenned I will move the max of the five and six joint distributions to be centered at 0.

plot(ability-ability[five.joint==max(five.joint)],five.joint/max(five.joint), col="green", lwd=3, type="l", xlim=c(-3,3))
lines(ability-ability[six.joint==max(six.joint)],six.joint/max(six.joint), col="purple", lwd=1, type="l")
# Adding the sixth item does appear to have increased the width of the joint distribution ever so slightly indicating that getting the easy problem wrong not only substantially decreased the estimate of the student ability but also made that estimate less precise.

Thursday, October 11, 2012

Stata - Write your own "fast" bootstrap

  * Imagine we would like to bootstrap the standard errors of a command using a bootstrap routine.

  * I created a previous post demonstrating how to write a bootstrap command.  This is a similar post however the bootstrap is much faster than the previous one.
  * First let's generate some data.
  set obs 1000
  gen x1 = rnormal()
  gen x2 = 2*rnormal()
  gen u = 5*rnormal()
  gen y = 5 + x1 + x2 + u
  local dependent_var x1 x2
  local command_bs reg y `dependent_var'
  * First let's see how the base command works directly.
  * As a matter of comparison this is the built in bootstrap command.
  bs, rep(100): `command_bs'
  * The following code is yet another user written bootstrap alternative.
  * I wrote this
  * Specify the number of bootstrap iterations
  local bs = 100
  * Save the number of observations to be drawn
  local N_obs = _N
  * Number of terms plus one for the constant of coefficients to be saved
  local ncol = wordcount("`dependent_var'")+1
  mata theta = J(`bs',`ncol',0)
  forv i = 1(1)`bs' {
    * Preserve the initial form of the data

    * Draw the indicators of the resample
mata draw = ceil(runiform(`N_obs',1):*`N_obs')

* Create an empty matrix to hold the number of items to expand
mata: expander = J(`N_obs',1, 0)

* Count the number of items per observation to generate.
    qui mata: for (i=1 ; i <= `N_obs'; i++) expander[i]=sum(draw:==i) ; "draws complete"

* Pull the expander value into stata
getmata expander=expander

* Drop unnessessary data
qui drop if expander == 0

* Expand data, expand==1 does nothing
    qui expand expander

* Run a regression
    qui `command_bs'

* Send to mata the matrix of results
mata theta[`i',] = st_matrix("e(b)")
* Configure the visual display
di _c "."
    if (int(`i'/10)==`i'/10) di _c "|"
    if (int(`i'/50)==`i'/50) di " " `i'

  * The estimates of the coefficients have been saved into a theta matrix.
  mata theta
  * Now let's calculate the standard deviations.
  mata bs_var = variance(theta)
  mata bs_var
  mata bs_se = diag(bs_var):^.5
  mata bs_se

Wednesday, October 10, 2012

Bootstrapping Percentile Count Vs. Standard Error Approximation

* A typical method for calculating confidence intervals is by using the estimates from some number of redraws and taking the standard error from that.

* Then using the standard error to calculate the confidence interval drawing from the t-distribution.

* An alternative method is to use the draws themselves as the sample and to calculate the confidence interval as the range of values that 1-alpha part of draws fall within.

* This might seem fishy but it is really the theoretically superior method (I believe).

* This simple simulation will see how well both methods do.

* First let's program a data generating command.

cap program drop mysim
program define mysim, rclass

  * The first argument that the user puts into the mysim function will be the population size.
  set obs `1'

  * Now let's generate some simple data.
  gen x = rnormal()
  gen u = rnormal()

  * The coefficient of interest is the coefficient on x.
  * Specifically, a good 95% confidence interval will have the simulated value of the coefficient (2) fall within it 95% of the time.
  gen y = 5+2*x+u

  * Now let's bootstrap the standard errors.
  bs, rep(`2') saving(bstemp.dta, replace): reg y x
  * Simple OLS need not use bootstrapping to derive a useful confidence interval.
  * However, for this exercise because OLS is fast and speed is useful in Monte Carlo aproaches.
  matrix CIp = e(ci_normal)
    scalar CInlower = CIp[1,1]
    scalar CInupper = CIp[2,1]

  if (CInlower<2 amp="amp" nupper="nupper">2) return scalar CIn = 1
  else return scalar CIn = 0

  * Now let's see if we can't use the empirical draws to calculate a CI.
  * First let's load the BS data data that was generated from the commands.
  use bstemp.dta, clear

  * Sort the beta coefficients
  sort _b_x

  * Now let's calculate the values that we will use for our CI.
  * We want to start looking at the sorted estimates at the 5% observation or greater to the 95% observation or less and see how frequently the true value (2) falls within that.

  * The number of observations in `2'.
  scalar lower1 = floor(`2'*.025)
  * This makes it so that it rounds down at the bottom and up at the top.  This is less conservative.
  scalar upper1 = ceil(`2'*.975)
  di "CIp1: " _b_x[lower1] " & "_b_x[upper1]
  if (_b_x[lower1]<2 amp="amp" b_x="b_x" upper1="upper1">2) return scalar CIp1 = 1
  else return scalar CIp1 = 0

  scalar lower2 = ceil(`2'*.025)
  * This makes it so that it rounds up at the bottom and down at the top.  This is more conservative.
  scalar upper2 = floor(`2'*.975)
  di "CIp2: " _b_x[lower2] " & "_b_x[upper2]
  if (_b_x[lower2]<2 amp="amp" b_x="b_x" upper2="upper2">2) return scalar CIp2 = 1
  else return scalar CIp2 = 0


mysim 100 200
return list
* One run produces results are effectively identical because the difference in methods is small.

* Let's instead try 1000 simulations.
simulate CIn=r(CIn) CIp1=r(CIp1) CIp2=r(CIp2), reps(10000):  mysim 100 200


* Normally I leave results to be run by blog readers.  However, this took a long time and many repetitions is generally important.
/*   Variable |       Obs        Mean    Std. Dev.       Min        Max
         CIn |      9992    .9377502    .2416208          0          1
        CIp1 |      9992    .9334468    .2492592          0          1
        CIp2 |      9992    .9334468    .2492592          0          1

* It seems that the percentile method is not working as well as the standard error approximation method.

* However, the difference is slight though surprising.  I suspect that I might not be choosing the correct lower and upper range combination.  Of course the difference is slight.  Only 4 draws out of 10,000.

* The percentile method has other values as well.  It can also be used to generate a joint rejection rate. This can be very useful in cracking difficult problems.

Tuesday, October 9, 2012

Optimal Presidential Seeking Behavior

# Imagine that we have a spectrum of voters.  On the left is communist-hippies.  On the fights is fascist-Nazis.

# Zero is perfectly between the two, the "Average voter"

n.voters <- 30000

voter.views <- rnorm(n.voters)

# The command mat.binom will be useful in drawing a matrix of binomial results. I will use a command I programmed in a previous post. (
mat.binom <- function(n,p) {
  bin.mat <- p*0
  for (i in 1:nrow(p)) {
    for (ii in 1:ncol(p)) {
    # This will draw a random binomial for each of the probabilities.
    bin.mat[i,ii] <- rbinom(1,n,p[i,ii])

# There are two parties in this world.  The probability of being in the left party is based on the normal cdf of your world views.  And the probability of being in the right party is based on 1-cdf of your world views.

# Party discrimination greater than zero means that the parties are more discriminated while less than zero results in less discrimination.
alpha = 1.2

# Party inclination
party.inclination <- cbind(alpha*pnorm(voter.views),alpha*(1-pnorm(voter.views)))
# NĂ©ed to make sure probabilities max at 1 and bottom at 0
party.inclination[party.inclination<0 0="0" p="p">party.inclination[party.inclination>1] <- 1

party <- mat.binom(1, party.inclination)

# If the person happens to be in both parties then we will make it so that the person is only in the party for which the person is the most inclined.

party[(party[,1]==party[,2])&(party[,1]==1)&(party.inclination[,1]>party.inclination[,2]),2] <- 0
# Let's generate a graph with three histograms showing the party.
# Par mfrow defines the matrix that the histrograms will be presented as. (3x1) in this case

# Now to presenting the histograms
hist(voter.views[party[,2]==1], main="Registered Democrans", xlab="Voter Views", xlim = range(voter.views), col="blue")
hist(voter.views[party[,1]==1], main="Registered Republicats", xlab="Voter Views", xlim = range(voter.views), col="red")
hist(voter.views[party[,2]==0&party[,1]==0], main="Independents", xlab="Voter Views", xlim = range(voter.views), col="purple")

# Generally speaking there should be relatively few independents (when alpha is not small).

party.factor = 0*1:n.voters # 0 is Independent
party.factor[party[,2]==1] = 1 # 1 is Republicrat
party.factor[party[,1]==1] = 2 # 2 is Democran

# Now let us imagine there is a primary for each party.  The candidates must first win the primary before seeking the overall win.

# There are many interesting ways of setting up a dynamic game in which each candidates positions will be based on each other candidates positions.  However, I am going to do a simpler simplification.

# Each candidate chooses two platforms.  1. The Primary Platform, and 2. the Presidential Platform.

# Each candidate will be specified as one of 19 platforms in the primary and presidential in which their positions are drawn from a normal cdf at every 5% interval (starting at 5% and ending at 95%.  Thus we have:

num.pos <- 19
pos.range <- qnorm((1:num.pos)/(num.pos+1))

# Let's see how well each position does relative to all other positions in each of the primaries.

# First let's define a function that will take two positions and a vector and return a 1 or 0.  1 if the first position wins the most points, zero otherwise.

pos.evaluate <- function(pos1,pos2,voters) {
  # The number who vote for candidate 1 because the distance pos1-voters position is smaller than the distance between that and the alterantive position
  nvote1 <- sum(abs(pos1-voters)-abs(pos2-voters)<0 p="p">  nvote2 <- sum(abs(pos1-voters)-abs(pos2-voters)>=0)
  # Finally, let us return 1 if the number of votes for position 1 is greater than position 2.

# Let's sample the code.
pos.evaluate(1,5, 1:11)
# Someone with a 1 does not win the vote on a uniform range from 1 to 11.
pos.evaluate(6,5, 1:11)
# Someone with a 6 does win the vote. That is because 6 gets votes 6 through 11 while 5 gets 1 through 5.

# Number of wins initially starts at zero. <- data.frame(position = pos.range, democrans = 0*1:num.pos, republicrats=0*1:num.pos,presidential=0*1:num.pos)

# Now let's see which strategy garnishes the most votes in each primary.
for (i in 1:num.pos) {
  for (ii in 1:num.pos) {$democrans[i] <-$democrans[i] + pos.evaluate(pos.range[i],pos.range[ii], voter.views[party.factor==1])$republicrats[i] <-$republicrats[i] + pos.evaluate(pos.range[i],pos.range[ii], voter.views[party.factor==2])$presidential[i] <-$presidential[i] + pos.evaluate(pos.range[i],pos.range[ii], voter.views)
# By looking to the highest number of wins we can see that in each primary, the dominant strategy is to appear much more liberal for democrans and much more conservative for republicrats while both parties want to be solidly in the center during the presidential election.

plot($position,$democrans, type="l", col="red", ylab="# of wins", xlab="Position", main="Results of Position Strategies")
lines($position,$republicrats, col="blue")

# Thus, not just the fringe people on either end of the position spectrum will always feel disappointed by the position taken by candidates when they are attempting to win the national vote.  Because even if the candidates claim to be very liberal or very conservative in the primary they tend towards center when trying to win the total vote.

Monday, October 8, 2012

Simulating Spatial Data

# Spatial data tags are an increasingly recorded for data that is being generating as a result of widescale implementation of GPS technology.

# In this post I will present a simulation in which the population is distributed around a single town center.

# The important characteristic of the population is that their characteristics are autocorrelated.  This can either be in response values or in error terms.

# Let's first specify an imaginary population.

population = 1000

# Specify the location in which the population is centered

center.X = 0
center.Y = 0

# Dispersion, the average distance of an individual from the center = 40

# Now let's generate our population position

# First off.  We would like our population to be distributed in a circle around the center.  So we need to pick an angle in radians.

angle <- pi*runif(population)*2

# This is the distance from the center that each member of the population will be.  If this were set to a constant then the following commands would end up drawing a circle.
distance <- rnorm(population)*

mydata <- data.frame(x=center.X+distance*cos(angle), y=center.Y+distance*sin(angle))

smoothScatter(mydata, nrpoints=0, main="Population Density")

# Let's imagine that there is an unobserved variable called "soil quality" which varies in both the x and y.

mydata$soil.quality <- sin(mydata$x*pi/50+mydata$y*pi/100)+1

# Farm size is random. But on average smaller as the plots get closer to the city center.

mydata$size <- runif(population)/4+abs(distance)/100

# In order to produce some cool graphs we will need to install a new package:

qplot(x, y, data=mydata, size=size, colour = soil.quality, main="Farms tend to be smaller near town")

# Rainfall also is spatially correlated.
mydata$rainfall <- sin(mydata$x*pi/60+mydata$y*pi/160) + sin(mydata$y*pi/60)+1

qplot(x, y, data=mydata, size=size, colour = rainfall, main="Rainfall is also distributed spatially")

# Now, let's imagine some technology usage, say fertilizer.

mydata$fert.use = mydata$rainfall-mydata$soil.quality+mydata$size+rnorm(population) + 4

qplot(x, y, data=mydata, size=size, colour = fert.use, main="Fertilizer use as a result should also be spatially distributed")

# Now let's see if we can't test if we can if fertilizer use is spatially correlated.

# The trick is figuring out what that means.

# I will define it as this, spatial correlation is the test to see if the use of fertilizer by one person is correlated with the use of fertilizer by another person.

# So I need to figure out a way of finding out what the closest person fertilizer usage is.

neighbor.fert.use <- neighbor.x <- neighbor.y <- 0

for (i in 1:population) {
  # We will look at each person i and find the person which is closest.
  # First let us constuct a variable that measures distance.
  # This is that standard Euclidean distance formula.
  distance.from.i = ((mydata$x[i]-mydata$x)^2 + (mydata$y[i]-mydata$y)^2)^.5

  # The following set of nested statements can be somewhat confusing.  Read from the inside statement first.
  # rank() will create a vector of length population that ranks distance from i.
  # Outside of that is a logical operator that will create a vector of length population which is all False except from rank == 2 which is true which means that mydata$fert will be drawn from that single value of rank==2.
  # Finally we assign it to the ith place in the neighbor.fert.use vector.
  neighbor.fert.use[i]<- mydata$fert.use[rank(distance.from.i, ties.method="random")==2]
  neighbor.x[i]<- mydata$x[rank(distance.from.i, ties.method="random")==2]
  neighbor.y[i]<- mydata$y[rank(distance.from.i, ties.method="random")==2]

plot(mydata$x,mydata$y, main="Arrows indicate closest farm")
for (i in 1:population) arrows(x0=mydata$x[i], x1=neighbor.x[i],y0=mydata$y[i], y1=neighbor.y[i],length = 0.075)

# From the plot we can see that the farm matching algorithm specified above appears to be working well.  We can see from that plot that every farm has a farm that is closest to it however there is some farms that are not the closest farm to any other farm (which makes sense).

cor(neighbor.fert.use, mydata$fert.use)
# From the correlation between fertilizer uses between each farm and it's closest neighbor I get a correlation of around .4
# This indicates that fertilizer use is spatially correlated.  However, as a result of how we set up the model this correlation is rather simple to handle.  It is not because a neighbor uses fertilizer that a farmer will use fertilizer but rather the effect of unobservables which is driving the use of fertilizer.  Mainly rainfall variation and soil quality variation.

# Imagine that we observe rainfall but not soil quality.  Let's see how well we can predict fertilizer usage.

# first let's remember how fertilizer use is calculated:
# mydata$fert.use = mydata$rainfall-mydata$soil.quality+mydata$size+rnorm(population) + 4

# We can see that the coefficient on rainfall is too small.  This is because within the construction of this sample rainfall is correlated with soil quality.

# Let's try to include technology choice of closest farm neihbor as a proxy for the spatial correlation of soil quality.

# We can see that the coefficient on rainfall is even smaller.  This indicates that controlling for the neighbors choice is not helping.  Why is that?

# Sure, controlling for the neighbors fertilizer use is controlling for some of the soil quality variable.  However, what else is in the neighbor's decision = rainfail+size.  Since size is correlated spatially and rainfall is correlated spatially controlling for the neighbor's choice in effect controls for some of the effects of the explanatory variables.  Thus, both rain and size variables suffer as a result of controlling for the nearest neighbor's technology choice.