Thursday, January 31, 2013

US Daily Gun Deaths R Animation - Sandy Hook

R script

# Listenning to NPR about gun deaths in the US got me thinking.

# Find the article here:

# Let's create an animation of gun deaths since Dec 14, 2012
gun.deaths <- getcsv.php="" gun-deaths="" http:="" p="" read.csv="""">

# I first need to change the dates from days to days for later reference
  # First I will make a table of the dates wich will include a count = table(gun.deaths$date)
    # Calculate Cumulative Dead
    cum.deaths =
    for (i in 1:(length(cum.deaths)-1)) cum.deaths[i+1] = cum.deaths[i][i+1]

  plot(, main="Daily Total US Deaths by Gun", ylab="Death count")

  # Number of days in our data (constantly updated every time we run the code
  ndays = length(

  # This complicated bit of code will force the dates which are currently factor variables into string variables
  gun.deaths$dates = t(data.frame(lapply(gun.deaths$date, as.character), stringsAsFactors=FALSE))

  # Create an empty factor to be filled
  gun.deaths$day = NA
  # This command loops through all of the days and checks if each individual entry in the data from is from that day.
  # If it is, then it assigns that day to the day entry.
  for (i in 1:ndays) gun.deaths$day[gun.deaths$dates==names([i]] = i

# We will cut the data into different age categories

# Some individuals have ages missing.  We will code these as category 0.
  gun.deaths$age[$age)] <- 999="" p="">
  gun.deaths$ = ""
  gun.deaths$[gun.deaths$age<12 children="" p="">  gun.deaths$[gun.deaths$age>11 & gun.deaths$age<20 p="" teens="">  gun.deaths$[gun.deaths$age>=20 & gun.deaths$age<30 adults20="" p="">  gun.deaths$[gun.deaths$age>=30 & gun.deaths$age<40 adults30="" p="">  gun.deaths$[gun.deaths$age>=40 & gun.deaths$age<65 madults="" p="">  gun.deaths$[gun.deaths$age>65 & gun.deaths$age<999 nbsp="" oadults="" p="">
# Adjust the latitude and logitude variables to account for a rescaling of the graph later
# as well as some noise which will help identify when there are multiple deaths in the same city.
  nll = length(gun.deaths$lng)
  gun.deaths$x = ((gun.deaths$lng+125)/(60))*ndays+rnorm(nll)*.006
  # For the graph that will be produced the 20 year olds have the highest likelihood of death.
  # Thus they will provide the y upper limit.
    ymax = ceiling(sum(gun.deaths$"adults20")/50)*50
  gun.deaths$y = ((gun.deaths$lat-24)/(27))*ymax+rnorm(nll)*.06

# Generate subsets of the data.
  children         = gun.deaths[gun.deaths$ == "children",]
  teens            = gun.deaths[gun.deaths$ == "teens",]
  adults20         = gun.deaths[gun.deaths$ == "adults20",]
  adults30         = gun.deaths[gun.deaths$ == "adults30",]
  madults          = gun.deaths[gun.deaths$ == "madults",]
  oadults          = gun.deaths[gun.deaths$ == "oadults",]

# This will count cumulative deaths by data subset = table(children$date)
  cum.children =
  for (i in 1:(length(cum.children)-1)) cum.children[i+1] = cum.children[i][i+1] = table(teens$date)
  cum.teens =
  for (i in 1:(length(cum.teens)-1)) cum.teens[i+1] = cum.teens[i][i+1] = table(adults20$date)
  cum.adults20 =
  for (i in 1:(length(cum.adults20)-1)) cum.adults20[i+1] = cum.adults20[i][i+1] = table(adults30$date)
  cum.adults30 =
  for (i in 1:(length(cum.adults30)-1)) cum.adults30[i+1] = cum.adults30[i][i+1] = table(madults$date)
  cum.madults =
  for (i in 1:(length(cum.madults)-1)) cum.madults[i+1] = cum.madults[i][i+1] = table(oadults$date)
  cum.oadults =
  for (i in 1:(length(cum.oadults)-1)) cum.oadults[i+1] = cum.oadults[i][i+1]

# This counts the total deaths = cum.adults20+cum.adults30+cum.teens+cum.children+cum.madults+cum.oadults

# In order to get a map of the US we will need to install the library maps

# Rescale the US map to fit in our data = map('state', plot=F)$x = ((map('state', plot=F)$x+125)/(60))*ndays$y = ((map('state', plot=F)$y-24)/(27)*ymax)

# Static Plot - Final image
i=ndays, height=10)
plot(cum.adults20, type="n", ylim=c(0,ymax),
     ylab="Cumulative Deaths by Age Group" ,
     main="US gun Deaths Since Dec 14, 2012",  cex.main=1.5,
           "   ", toString([i]), "Killed",
           "  ", toString(cum.deaths[i]), "Dead"))
  grid(lwd = 2) # grid only in y-direction

  # Draw the US state map
  lines(, type="l")

  # Place dots for every death
  lines(adults20$x, adults20$y, type="p", col="blue",pch=16, cex=.5)
  lines(adults30$x, adults30$y, type="p", col="purple",pch=16, cex=.5)
  lines(madults$x, madults$y, type="p", col="orange",pch=16, cex=.5)
  lines(oadults$x, oadults$y, type="p", col=colors()[641],pch=16, cex=.5)
  lines(teens$x, teens$y, type="p", col="red",pch=16, cex=.5)
  lines(children$x, children$y, type="p", col=colors()[258],pch=16, cex=.5)

  lines(cum.teens,    type="l", col=gray(.9), lwd=2)
  lines(cum.children, type="l", col=gray(.9), lwd=2)
  lines(cum.adults20, type="l", col=gray(.9), lwd=2)
  lines(cum.adults30, type="l", col=gray(.9), lwd=2)
  lines(cum.madults,  type="l", col=gray(.9), lwd=2)
  lines(cum.oadults,  type="l", col=gray(.9), lwd=2)

  lines(cum.teens,    type="l", col="red", lwd=1, cex=.5)
  lines(cum.children, type="l", col=colors()[258], lwd=1, cex=.5)
  lines(cum.adults20, type="l", col="blue", lwd=1, cex=.5)
  lines(cum.adults30, type="l", col="purple", lwd=1, cex=.5)
  lines(cum.madults,  type="l", col="orange", lwd=1, cex=.5)
  lines(cum.oadults,  type="l", col=colors()[641], lwd=1, cex=.5)

  lines(c(ndays,ndays), c(-15,ymax-27), lwd=2, lty=2)

  legend("topright", "Today", cex=1.5, bty="n")

legend(0, ymax+20, expression(Children, Teens, "Adults 20s", "Adults 30s",
   "Adults 40-65", "Adults 65+"), lty = 1:1, col = c(colors()[258], "red",
   "blue", "purple", "orange", colors()[641]),  adj = c(0, 0.6), ncol = 6,
   cex=.75, lwd=2

# Sequential Graphic - Animation

# Animation package must be installed

deaths.animation = function() {

for (i in c(1:ndays, rep(ndays,10))) {
  # Adding the rep(ndays,10)) causes the animation to wait for the equivalent of 10 days after ending.

plot(cum.adults20, type="n", ylim=c(0,ymax),
     ylab="Cumulative Deaths by Age Group" ,
     main="US gun Deaths Since Dec 14, 2012", cex.main=2,
     xlab=paste(names([i],"day",toString(i), "   ", toString([i]), "Killed", "  ", toString(cum.deaths[i]), "Dead"))
  # Draw the US state map
  grid(lwd = 2) # grid only in y-direction

  # Place dots for every death
  lines(, type="l")

  t.adults20 = adults20[adults20$day  t.adults30 = adults30[adults30$day  t.madults  = madults[madults$day  t.oadults = oadults[oadults$day  t.teens = teens[teens$day  t.children = children[children$day
  lines(t.adults20$x, t.adults20$y, type="p", col="blue",pch=16, cex=1.5)
  lines(t.adults30$x, t.adults30$y, type="p", col="purple",pch=16, cex=1.5)
  lines(t.madults$x, t.madults$y, type="p", col="orange",pch=16, cex=1.5)
  lines(t.oadults$x, t.oadults$y, type="p", col=colors()[641],pch=16, cex=1.5)
  lines(t.teens$x, t.teens$y, type="p", col="red",pch=16, cex=1.5)
  lines(t.children$x, t.children$y, type="p", col=colors()[258],pch=16, cex=1.5)

  p.adults20 = adults20[adults20$day==i,]
  p.adults30 = adults30[adults30$day==i,]
  p.madults  = madults[madults$day==i,]
  p.oadults = oadults[oadults$day==i,]
  p.teens = teens[teens$day==i,]
  p.children = children[children$day==i,]

  lines(p.adults20$x, p.adults20$y, type="p", col="blue",pch=1, cex=5)
  lines(p.adults30$x, p.adults30$y, type="p", col="purple",pch=1, cex=5)
  lines(p.madults$x, p.madults$y, type="p", col="orange",pch=1, cex=5)
  lines(p.oadults$x, p.oadults$y, type="p", col=colors()[641],pch=1, cex=5)
  lines(p.teens$x, p.teens$y, type="p", col="red",pch=1, cex=5)
  lines(p.children$x, p.children$y, type="p", col=colors()[258],pch=1, cex=5)

  lines(cum.teens[1:i],    type="b", col=gray(.9), lwd=7)
  lines(cum.children[1:i], type="b", col=gray(.9), lwd=7)
  lines(cum.adults20[1:i], type="b", col=gray(.9), lwd=7)
  lines(cum.adults30[1:i], type="b", col=gray(.9), lwd=7)
  lines(cum.madults[1:i],  type="b", col=gray(.9), lwd=7)
  lines(cum.oadults[1:i],  type="b", col=gray(.9), lwd=7)

  lines(cum.teens[1:i],    type="b", col="red", lwd=2)
  lines(cum.children[1:i], type="b", col=colors()[258], lwd=2)
  lines(cum.adults20[1:i], type="b", col="blue", lwd=2)
  lines(cum.adults30[1:i], type="b", col="purple", lwd=2)
  lines(cum.madults[1:i],  type="b", col="orange", lwd=2)
  lines(cum.oadults[1:i],  type="b", col=colors()[641], lwd=2)

  lines(c(ndays,ndays), c(-15,ymax-27), lwd=2, lty=2)

  legend("topright", "Today", bty="n", cex=2.5)

legend(0, ymax+20, expression(Children, Teens, "Adults 20s", "Adults 30s",
   "Adults 40-65", "Adults 65+"), lty = 1:1, col = c(colors()[258], "red",
   "blue", "purple", "orange", colors()[641]),  adj = c(0, 0.6), ncol = 6,
   cex=1.5, lwd=1
# deaths.animation()

ani.options(ani.width=1200, ani.height=900)
# In order for the saveGIF function to work you need install Image Magic Display (

# Map Updated Jan-31-2012


Wednesday, January 30, 2013

Return Text

do file

* The following program will create a tiny program that will send to the display input text.
cap program drop rt
program define rt
    local caller : di _caller()
    * This will parse the `0' text so that there is a before the colon and an after the colon
capt _on_colon_parse `0'

* This will save the after the colon in the macro s(after)
local command = s(after)

* Now we will display the command input
di as input `"`command'"'

* Next we will run the command.

rt: di "adsf"
rt: clear
rt: set obs 1000
rt: gen a = "324"
rt: sum

* Without rt the following commands will see only the output.
forv i = 1/5 {
 set obs `i'000
 gen a = rnormal()

* With rt prefix we will see both the command as well as the output.
forv i = 1/5 {
 rt: clear
 rt: set obs `i'000
 rt: gen a = rnormal()
 rt: sum

Tuesday, January 29, 2013

Stata Syntax Command's Syntax

do file

* When programming Stata "programs" (loosely known as functions in R) it is often times very useful to use the "syntax" command to parse arguments.

* This command takes the arguments input into the command and uses them to control the function of the command.

* For example:

cap program drop example1
program define example1, rclass

  syntax [anything]
  * The [] tell stata that this argument is optional
  * The anything tells Stata that this argument can be of any form excluding a comma (,)

  di "`anything'"

  * The local function 1 returns the first argument in the command
  di "`1'"

  * While 2 returns the second
  di "`2'"


example1 hello world
example1 *
example1 2123 123213 hello world

* We can see the display changes as a result of how the arguments are structured

* Programmers can force the user to input only inputs of a particular structure.

* For instance:

cap program drop example2
program define example2, rclass

  syntax [varlist]
  * The [] tell stata that this argument is optional
  * The anything tells Stata that this argument can be of any form excluding a comma (,)

  di "`varlist'"

  * The local function 1 returns the first argument in the command
  di "`1'"

  * While 2 returns the second
  di "`2'"

  * Variables targetted in this manner can be modified:
  foreach v of varlist `varlist' {
    rename `v' r_`v'_rename



gen hello = "content of variable hello"
gen world = "content of variable world"

example2 hello world
* Which works because we created the variables hello and world

example2 hello world
* This does not work because the above command already renamed the variables hello and world

* There is a lot of trickiness involved in effectively programming the syntax command and unfortunately it can be very frustrating because the only feedback stata gives is "invalid syntax" when something goes wrong.

* Also, unfortunately stata does not have many actual coding examples in which syntax is displayed.

* However, I think the following example will go a long way to answering many questions.

* This is the start of a command that will simulate data and test different estimators on that data.

* It will also set default parameters if the user does not input parameters into the simulation directly.

* Before executing the command, the command will display to the user the parameter set to be estimated

cap program drop example3
program define example3, rclass

  syntax [anything] [, NGrp(numlist >0 integer) NInd(numlist >0 integer) ///
                    Rgrp(numlist integer) Udist(string) NEXogenous NENdogenous]
  * The first few letters being capital allows the user to abreviate the option to just those two or three letters.

  * Set defaults if the user has not defined any.
  if "`ngrp'"==""   local ngrp=50
  if "`nind'"==""   local nind=20
  if "`rgrp'"==""   local rgrp=0
  if "`udist'"==""  local udist="rnormal()*5"

  * This sets the local parameter exogenous to be equal to 1 by default unless the user specifies ontherwise
  local exogenous=1
  if "`nexogenous'"!="" local exogenous=0

  local endogenous=1
  if "`nendogenous'"!="" local endogenous=0

  di as text _newline "Simulation Paramaters"
  di "Number of groups: `ngrp'"
  di "Average number of individuals in each group: `nind'"
  di "Group size random (0 no, 1 yes): `rgrp'"
  di "Error distribution: `udist'"

  di _newline _c "Models estimated: "
  if `exogenous' == 1 di _c "EXOGENOUS"
  if `endogenous' == 1 di " ENDOGENOUS"
  if  `exogenous' == 0 & `endogenous' == 0 di "No model estimated!"


  * The actual simulation


  * End of the command

  example3 , ngrp(454)
  example3 , ng(454)
  * See how the abreviation is possible

  example3 , nex  udist(runifrom()*3)
  example3 , nex  udist(runifrom()*3) r(1)

  example3 , nex nen

* It is very useful to check out the "help syntax" file.

Monday, January 28, 2013

HLM comparison with OLS - 2 levels, random coefficient on constant

do file

* xtmixed is capable of estimating a variance for the random effects on multiple levels.

* random effects are normalized to have mean 0.

* Our initial model is y_ij = bij + xij*B1 + uij

* With bij = B0 + v0j

* We can represent our model as:

* y_ij = B0 + v0j + xij*B1 + uij = B0 + xij*B1 + v0j + uij

* With v0j + uij an unobserved error term.

* Let's see an example.


set obs 60
* We will have 60 different schools (level 2 indexes)

gen school=_n

gen v0j = rnormal()*2
* Generate some school effect with standard deviation = 2

expand 200
* There are 200 individuals in each school.

gen x1 = rnormal()
* Each individual has a continuous predictor (say academic performance).

gen u = rnormal()*5
* each individual has an unobserved error.

gen y = 1 + x1*2 + v0j + u
* Each individual's outcomes are a function of individual ability plus the school effect plus individual error.

xtmixed y x1 || school:
* Now let's estimate both the effect of x1 on y as well as the effect of school variation on predicting outcomes.

* The standard deviation of the estimate of school effect is close to 2 which is the true so xtmixed is working well.

* In principal we can attempt to estimate the same thing using OLS with dummies.

qui tab school, gen(sch_id)
* Generate a set of dummy indicator variables for each school.

* School 1 is left out of the estimation as the reference school.
reg y x sch_id2-sch_id60

predict u_hat, resid

* We would like to calculate the standard deviations of the school estimates in order to compare with the xtmixed standard deviation estimates.

* Since we left school 1 out of the estimation it is considered a 0 effect.  All other school effect estimates are relative to school 0.
gen sch_coef_est = 0 if school==1

* The following code will save the school estimates to the coefficient variable that can then be used to find the standard deviation of the estimated school effects.
forv i=2/60 {
  qui replace sch_coef_est = _b[sch_id`i'] if school==`i'

sum sch_coef_est u

* We can see that the standard deviation of sch_coef_est is similar to that estimated by xtmixed as well as the standard deviation of the residuals.

* So both methods seem to be effective at estimating the variance of the school level effect.

corr sch_coef_est v0j

* We can also see that the OLS dummy variable method has produced individual school effects that highly correlated with the true school effects.

* In summary both methods seem to work well though we would probably favor the xtmixed (hierarchical linear model) estimator if we did not care about the actual individual school estimates because it provides an estimator that maintains more degrees of freedom.

Thursday, January 24, 2013

Choosing an appropriate level of analysis

do file

* Often times we are confronted with having data that has many potential levels of interest.

* It is not always clear what the best level of analysis is in that case.

* Imagine that you have district average achievement rates on a statewide exam.

* You also have individual characteristics of individual students in your population such as age, gender, socioeconomic status.

* Now your question is, what is the best way to predict district pass rates? (You do not have individual passing values).

* The difficulty is that you dependent variable (pass rates) is on an aggrogate level while your independent variables are on a individual level.

* What should you indist_de as your explanatory variables?
* a. Individual level?
* b. Mean by district or median or some other aggrogate statistic?

* The following simulation will attempt to shed some light on this question though I do not feel I can really resolve it in any way without more thought.

* I will structure the simulation directly into a "program" structure in order to speed up the monte carlo process which I will use to evaluate the estimators.

cap program drop ols_levels
program define ols_levels, rclass


  set obs 50
  * We have 50 districts

  gen dist_id = _n

  gen dist_1 = rnormal()
  gen dist_2 = rnormal()
  * Districts have some continuous observable heterogeniety

  expand `3'
  * We have parameter 3 # of students in each district

  * The first two arguments of the ols_levels command will specify the correlation between the individual chacteristics and the aggrogate characteristics.
  gen x1 = rnormal()+dist_1*`1'
  gen x2 = rnormal()+dist_2*`2'
  * The students each have some individual characteristics which are correlated with the district level heterogeniety.

  gen u = rnormal()*5
  * Each student has some unobserved achievement shock.

  * Now the next step is the one I had the most difficulty with.
  * We need to convert individual achievement into average achievement.
  * That is done by first generating individual achievement:

  gen y_ind = dist_1+dist_2+.5*x1-.75*x2+u
  * This is the value we would like to observe.
    label var y_ind "Individual achievement"

  * However, what we instead observe is
  bysort dist_id: egen y = mean(y_ind)
      label var y "District average achievement"

  * Now we ask the question, how do we estimate the effect of our explanatory variables on observed average achievement?

  * One suggestion would be to estimate observable explanatory variables on the observable dependent variable:
  reg y dist_1 dist_2 x1 x2, cluster(dist_id)
    * I cluster at the district id since normally the errors would be correlated on the district level.
    return scalar Adist_1=_b[dist_1]
    return scalar Adist_2=_b[dist_2]
    return scalar Ax1=_b[x1]
    return scalar Ax2=_b[x2]

  * Alternatively we could argue that it does not make sense to use disaggrogated explanatory variables when the dependent variable is aggrogated since the error that we might hope to explain through having the additional variance in the explanatory variables cannot be correlated with the disaggrogated explanatory variables.

  bysort dist_id: egen x1_mean = mean(x1)
  bysort dist_id: egen x2_mean = mean(x2)

  reg y dist_1 dist_2 x1_mean x2_mean, cluster(dist_id)

    return scalar Bdist_1=_b[dist_1]
    return scalar Bdist_2=_b[dist_2]
    return scalar Bx1_mean=_b[x1_mean]
    return scalar Bx2_mean=_b[x2_mean]

  * As a matter of reference let's see what happens when we only include the district level characteristics.
  reg y dist_1 dist_2, cluster(dist_id)

    return scalar Cdist_1=_b[dist_1]
    return scalar Cdist_2=_b[dist_2]
    return scalar Cx1=0
    return scalar Cx2=0

* Run the simulation one time with no correlation between dist_ variables and individual variables.
ols_levels 0 0 500

simulate Adist_1=r(Adist_1) Adist_2=r(Adist_2) Ax1=r(Ax1) Ax2=r(Ax2) ///
         Bdist_1=r(Bdist_1) Bdist_2=r(Bdist_2) Bx1_mean=r(Bx1_mean) Bx2_mean=r(Bx2_mean) ///
         Cdist_1=r(Cdist_1) Cdist_2=r(Cdist_2) Cx1=r(Cx1) Cx2=r(Cx2) ///
         ,reps(500): ols_levels 0 0 500

/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
     Adist_1 |       500    .9988695    .0385498   .8776859   1.122928
     Adist_2 |       500    .9993321    .0360491   .8834813   1.105001
         Ax1 |       500    .0012883    .0017756  -.0043576   .0063994
         Ax2 |       500   -.0016733    .0018259  -.0076018   .0041119
     Bdist_1 |       500    .9985228    .0389298   .8760694   1.122633
     Bdist_2 |       500    .9993324    .0361465   .8882689   1.105491
    Bx1_mean |       500    .5431928    .7500353  -2.483676   2.507016
    Bx2_mean |       500   -.6719179    .7527237  -3.020345   2.034019
     Cdist_1 |       500    .9988699    .0385515   .8776129   1.122928
     Cdist_2 |       500    .9993329    .0360523   .8834765   1.104997
         Cx1 |       500           0           0          0          0
         Cx2 |       500           0           0          0          0 */

* We can see that though we know that x1 and x2 causes individual achievement, they do not provide a good basis for estimating average achievement.

* However, the mean x1 and mean x2 provide reasonable esimates of the effect of x1 and x2 on achievement.

* Can reason this through by thinking about how the average can be represented.

* y_mean = 1/N * sum(y_ind) = sum(dis_1)/N*b1 + sum(dis_1)/N*b2 + sum(x1)/N*b3 + sum(x2)/N*b4 + sum(u)/N

* When seen this way it seems clear that the average of the explanatory variables should be a better predictor than the variable on the individual level.

* But why does the individual value fail so badly?

* I believe it is because the individual value only succeeds at all because it is correlated with the average explanatory variable.

* We can test this hypothesis:

simulate Adist_1=r(Adist_1) Adist_2=r(Adist_2) Ax1=r(Ax1) Ax2=r(Ax2) ///
         Bdist_1=r(Bdist_1) Bdist_2=r(Bdist_2) Bx1_mean=r(Bx1_mean) Bx2_mean=r(Bx2_mean) ///
         Cdist_1=r(Cdist_1) Cdist_2=r(Cdist_2) Cx1=r(Cx1) Cx2=r(Cx2) ///
         ,reps(500): ols_levels 0 0 2


/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
     Adist_1 |       500    .9988851    .5417397  -.8754451   2.458699
     Adist_2 |       500    .9978838    .5054726  -.4637527   2.396384
         Ax1 |       500    .2299528    .3559766  -.9725819   1.328953
         Ax2 |       500   -.3321666    .3767386  -1.615652   .7381132
     Bdist_1 |       500    .9953491    .5473069  -.8368359   2.489341
     Bdist_2 |       500    .9980698    .5141795  -.3417804   2.519113
    Bx1_mean |       500    .4715293    .7382465  -2.060023   2.356781
    Bx2_mean |       500   -.6794988    .7577947  -2.587295   1.302411
     Cdist_1 |       500    1.003058     .544771  -.9002677   2.433167
     Cdist_2 |       500    .9979357     .510302  -.6022506   2.372237
         Cx1 |       500           0           0          0          0
         Cx2 |       500           0           0          0          0 */

* We can see that now, while the individual level explanatory variables are not as effective as the district level ones, they provide estimates which are closer to the true parameter than the previous simulation.

* Finally, let's look at what happens when there is correlation between district level values and individual level explanatory variables.

simulate Adist_1=r(Adist_1) Adist_2=r(Adist_2) Ax1=r(Ax1) Ax2=r(Ax2) ///
         Bdist_1=r(Bdist_1) Bdist_2=r(Bdist_2) Bx1_mean=r(Bx1_mean) Bx2_mean=r(Bx2_mean) ///
         Cdist_1=r(Cdist_1) Cdist_2=r(Cdist_2) Cx1=r(Cx1) Cx2=r(Cx2) ///
         ,reps(500): ols_levels .45 -.45 200


/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
     Adist_1 |       500    1.226031    .0521453   1.067398   1.384503
     Adist_2 |       500    1.334787    .0542503   1.155632    1.48664
         Ax1 |       500    .0021011    .0035268  -.0094846   .0116934
         Ax2 |       500   -.0036379    .0034786  -.0146368   .0058956
     Bdist_1 |       500    1.030821    .3416801   .1528734   1.992299
     Bdist_2 |       500    .9902028    .3403735  -.0965584   2.060089
    Bx1_mean |       500    .4359488    .7501554  -1.634525   2.529116
    Bx2_mean |       500   -.7689256    .7431586  -3.064452   1.332662
     Cdist_1 |       500    1.226977    .0520978    1.06962   1.383977
     Cdist_2 |       500    1.336426    .0542272   1.157392   1.485595
         Cx1 |       500           0           0          0          0
         Cx2 |       500           0           0          0          0 */

* Now we can see the situation has changed.

* Not only are the individual level explanatory variables not providing good estimates but they are actually causing the district level estimates to be biased because they are failing to control for the correlation between district mean explanatory variables and district mean dependent values.

Wednesday, January 23, 2013

Sticky Probit - clustered bootstrapped standard errors

do file

* There exist numerous estimators which can be used for a variety of special circumstances.

* If these estimators have been developed by an econometrician, then the econometrician has probably done the hard work of proving consistency (or unbiasedness) and estimated an asymptotically valid standard error estimator under well defined conditions.

* However, sometimes we want to design our own estimator and before doing the hard work of figuring out its theoretical properties, we would first like to see how well it works with simulated data.

* This post will go through one example of how this process may work.

* First let us imagine that I have a new estimator which is a combination of a linear probability model and a probit model.  I will call this estimator a sticky probit.  It is defined as prob(Y(t)=1)=ZA + (1- 1(k x n) A)* NormalCDF(XB) where Z is a K by N matrix of binary explanatory variables.  X is a L by N matrix of explanatory variables as well.  This formulation will become clear in a moment.

* In the event that Z only has one variable then prob(Y(t)=1)=ZA + (1-A)* NormalCDF(XB).  I was thinking that an interesting case would be when Z=Y(t-1).  Thus it would allow for an estimation easily interpretable time dependent outcomes.  For instance, imagine in the “real” world that probability of renting or owning a home was 90% dependent entirely upon what your previous period’s decision was.

* Let’s see this in action.

* p = zA + (1-zA)*normal(xB)

* First let’s program an MLE estimator to solve our “sticky probit”

set more off

cap program drop stickyprobit
program define stickyprobit
  * Tell Stata that this code is written in version 11
  version 11
  args ln_likehood xB zA
  qui replace `ln_likehood' = ln(`zA' + (1-`zA'[1])*normal(`xB')) if $ML_y==1
  qui replace `ln_likehood' = ln(1-(`zA' + (1-`zA'[1])*normal(`xB'))) if $ML_y==0
  * This only works if we are only estimating a single coefficient in zA.
  * If anybody knows how to generalize the above statement to allow for any number of parameters in zA to be estimated, please post it was a comment.


* Let's generate some sample data


* We have 500 panel level observations
set obs 500

gen id=_n

* We have some random continuous explanatory variables
gen x1 = rnormal()
gen x2 = rnormal()

* Let's first just think about A being a single constant.  So A=.5 is telling us that 50% of the decision to buy or rent this period will be dependent upon the decision the previous period.
gen A = .5

* The decision to rent at t=0 we will say is unobserved but that it enters the probability of renting this period in the form of the binomail draw.
gen p=A*rbinomial(1,.5) + (1-A)*normal(.5*x1 - .5*x2)

* Now we simulate the decision to rent at time period t=1
gen rent=rbinomial(1,p)

* We expand the data to make 150 time periods
expand 50
bysort id: gen t=_n

* Now we generate time varying explanatory variables
replace x1 = rnormal()  if t>1
replace x2 = rnormal()  if t>1

* This part of the data generating process is slightly complicated because in each period the decision to rent or buy is dependent upon the previous period's decision.
forv i=2/50 {
  qui replace p=A*rent[_n-1] + (1-A)*normal(.5*x1 - .5*x2) if t==`i'
  * The actual decision to rent or not is a binomial draw with the probability of 1 equal to p
  qui replace rent=rbinomial(1,p) if t==`i'

**** Done with simulation

* Now let's estimate

gen rent_lag = rent[_n-1]  if t>1

probit rent rent_lag x1 x2
* This is the benchmark

ml model lf stickyprobit (probit: rent=x1 x2, noconstant) (zA:rent_lag, noconstant)
ml maximize, iterate(10)

* I set iterate to 10 because the ml model has difficulty converging on the solution.
* I think this is because as A gets close to .5 then there is an increasing frequency in which infeasible values are evaluated (those are Ln(p<0 ln="" or="" p="">
* Because we know there there is serial correlation of the errors then we cannot trust that standard errors from the maximum likelihood estimator.

* Thus we need to bootstrap clustering at the observation level.

* In order to do this we will need to write a short program

cap program drop bsstickyprobit
program define bsstickyprobit

  ml model lf stickyprobit (probit: rent=x1 x2, noconstant) (zA:rent_lag, noconstant)
  ml maximize, iterate(10)
  * I will set iterate to 10 to speed up the boostrapped standard errors.

bs, cluster(id): bsstickyprobit

/* (running bsstickyprobit on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Bootstrap results                               Number of obs      =     24500
                                                Replications       =        50
                                                Wald chi2(2)       =     11.92
Log likelihood = -12748.652                     Prob > chi2        =    0.0026

                                    (Replications based on 500 clusters in id)
             |   Observed   Bootstrap                         Normal-based
        rent |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
probit       |
          x1 |   .4972777    .145441     3.42   0.001     .2122186    .7823369
          x2 |  -.5287849   .1586004    -3.33   0.001    -.8396359   -.2179339
zA           |
    rent_lag |    .497552   .1232098     4.04   0.000     .2560652    .7390388
Warning: convergence not achieved

* Finally, let's compare our boot strapped errors with simulated errors.

cap program drop simstickyprobit
program define simstickyprobit, rclass

  set obs 150
  gen id=_n
  gen x1 = rnormal()
  gen x2 = rnormal()
  gen A = .5
  gen p=A*rbinomial(1,.5) + (1-A)*normal(.5*x1 - .5*x2)
  gen rent=rbinomial(1,p)
  expand 150
  bysort id: gen t=_n
  replace x1 = rnormal()  if t>1
  replace x2 = rnormal()  if t>1
  forv i=2/150 {
    qui replace p=A*rent[_n-1] + (1-A)*normal(.5*x1 - .5*x2) if t==`i'
    * The actual decision to rent or not is a binomial draw with the probability of 1 equal to p
    qui replace rent=rbinomial(1,p) if t==`i'
  **** Done with simulation
  gen rent_lag = rent[_n-1]  if t>1

  ml model lf stickyprobit (probit: rent=x1 x2, noconstant) (zA:rent_lag, noconstant)
  ml maximize, iterate(20)

  return scalar b1=[probit]_b[x1]
  return scalar b2=[probit]_b[x2]
  return scalar A=[zA]_b[rent_lag]


return list
* The list of returned values look good.

simulate A=r(A) b1=r(b1) b2=r(b2), rep(50): simstickyprobit


/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
           A |        50    .3727054    .1289269     .23698   .5096082
          b1 |        50    .3494328    .1601192   .1524611   .5577423
          b2 |        50    -.348098    .1563565  -.5363456   -.145098

* We can see the standard deviation estimates from the clustered bootstrap are right on.

* It is important to note that the estimator is strongly biased.

Tuesday, January 22, 2013

Use Expand to Manually Reshape Data

do file

* Imagine that you have a wide data set that you would like to convert to a long data set.

* Your data set is structured in the following manner.

set obs 1000

gen id = _n

forv i=1/4 {
  gen var1_`i' = rnormal()
  gen var2_`i' = rnormal()
  gen var3_`i' = rnormal()
  gen var`=`i'+3' = rbinomial(1,.5)

order _all, alphabetic

* You should now have data in which there are three variables 1,2, and 3 which have four different records and four variables var4-var7 which are time invariant.

* We can use expand to reshape the data.

expand 4

* Now we have four instances of each of the original observations.

* We want now to create variables var1 var2 var3 and var4 which represent each different values for each panel period.

* First let's sort and label our different panel periods.

* Clump all of the same id's together
bysort id: gen year=_n

* Now we just copy our variables so that they only occur in the appropriate time period.

* I will do this manually though it would be very easy to do by macros.

gen var1 = var1_1 if year == 1
replace var1 = var1_2 if year == 2
replace var1 = var1_3 if year == 3
replace var1 = var1_4 if year == 4

gen var2 = var2_1 if year == 1
replace var2 = var2_2 if year == 2
replace var2 = var2_3 if year == 3
replace var2 = var2_4 if year == 4

gen var3 = var3_1 if year == 1
replace var3 = var3_2 if year == 2
replace var3 = var3_3 if year == 3
replace var3 = var3_4 if year == 4

* Now we just need to drop the extra variables.

drop var?_?

* This can also be accomplished with the reshape command:

set obs 1000

gen id = _n

forv i=1/4 {
  gen var1_`i' = rnormal()
  gen var2_`i' = rnormal()
  gen var3_`i' = rnormal()
  gen var`=`i'+3' = rbinomial(1,.5)

order _all, alphabetic

reshape long var1_ var2_ var3_, i(id)

* The tricky thing to remember with reshape is that it requires exact syntax on your variables to be converted.

* That is, reshape is looking for a number at the end of each variable name.  This number it will turn into the j variable.

Thursday, January 17, 2013

model selection tests

do file

* It is typically simple to test for superiority of one model over another model when using nested models.

* This can be accomplished in linear regression through use of a Wald test.

* For example:

* Let's say you have one hypothesis that H0: y = x1*B1 and another hypothesis that H1: y = x1*B1 + x2*B2 + x3*B3.

* We can see that H0 is nested within H1.  That is H0 = H1 when B2=B3=0

* Let's generate some sample data:

set obs 10000

gen x1 = rnormal()
gen x2 = rnormal()
gen x3 = rnormal()

gen u = rnormal()*2

gen y1 = x1*2 + u

* Under this case the null is true.

reg y1 x?

test x2=x3=0
* This test should reject the null about %5 of the time when the null is at an alpha = .05

gen y2 = x1*2 - x2*1 + x3*1 + u

reg y2 x?
* In Stata this test is very easy to perform.

test x2=x3=0
* This test should almost always reject the null since the effects of x2 and x3 on y is large.

* For example.

gen y_p1 = normal(x1*.5)
gen y_p2 = normal(x1*.5+x2*1-x3*.2)

gen Y1 = rbinomial(1,y_p1)

probit Y1 x1 x2 x3

test x2=x3=0

* When looking at nested maximum likelihood estimators the likelihood ratio test can be used to test if the difference in explanatory powers of the models is statistically significant.

* The estimator is 2(lk0-lkA) distributed chi-squared with degrees of freedom equal to the difference in degrees of freedom of the two models.

* Let's see this in action:

probit Y1 x1 x2 x3
* Now let's save the degrees of freedom and log likelihood.

global df0 = e(df_m)
global ll0 = e(ll)

* Now for the other model

probit Y1 x1
* Now let's save the degrees of freedom and log likelihood.

global df1 = e(df_m)
global ll1 = e(ll)

* Now let's get the chi-squared statistic.

di "Chi-squared =" 2*(${ll0}-${ll1}) " with DF = " ${df0}-${df1}

di "P value = " 1-chi2(${df0}-${df1},2*(${ll0}-${ll1})) " probability"

* Notice how very similar this LR test is to the previous Wald test in this case.

* Alternatively when the models are actually different:
gen Y2 = rbinomial(1,y_p2)

probit Y2 x1 x2 x3

test x2=x3=0

probit Y2 x1 x2 x3

global df0 = e(df_m)
global ll0 = e(ll)

probit Y2 x1

global df1 = e(df_m)
global ll1 = e(ll)

di "Chi-squared =" 2*(${ll0}-${ll1}) " with DF = " ${df0}-${df1}
di "P value = " 1-chi2(${df0}-${df1},2*(${ll0}-${ll1})) " probability"

* Notice that the Chi-squared value for the Wald test and the LR test is quite different yet the conclusion is identical when it comes to rejecting the null.

* When models are not nested the problem becomes a bit more challenging.

* The Vuong test is a useful test of the goodness of fit of non-nested models.

* Imagine the above data Y2 which is generated based on the assumptions for the probit model.

* We are not sure that the probit is the better model or the logit.

* With the Vuong (1989) test of non-nested models we can construct a test statistic based on the log likelihoods of each individual observation.

* This is done by treating the difference in log likelihoods as a random variable and constructing a chi-squared estimator based on that difference (once again referencing Wooldridge class notes from EC 821B).

* First we will estimate the probit.

probit Y2 x1 x2 x3

* We will predict the fitted values which are the predicted probabilities of a draw of 1.

predict Y2_probit

* The log likelihood can be found from this by taking the log of those probabilities when the draw is 1 and the log of 1 less those probabilities when the draw is 0.

gen ll_probit = Y2*log(Y2_probit) + (1-Y2)*log(1-Y2_probit)

* In order to test that this is actually the case:
sum ll_probit

di "The log likelihood from the MLE should be equal to " r(N)*r(mean)

* Bingo!

* Now we will do the same with the logit.

logit Y2 x1 x2 x3

* We will predict the fitted values which are the predicted probabilities of a draw of 1.

predict Y2_logit

* The log likelihood can be found from this by taking the log of those probabilities when the draw is 1 and the log of 1 less those probabilities when the draw is 0.

gen ll_logit = Y2*log(Y2_logit) + (1-Y2)*log(1-Y2_logit)

* In order to test that this is actually the case:
sum ll_logit

di "The log likelihood from the MLE should be equal to " r(N)*r(mean)

* We next construct a variable called dif, which is a measure of the individual differences of the likelihoods.

gen dif = ll_probit-ll_logit

* Now this variable should only be statistically different from 0 about 5% of the time (at the 5% level) if the two models have equal explanatory power.

reg dif

* The constant dif only seems to be statistically significant a small proportion of the times indicating that as discussed previous the difference between a probit and a logit model is extremely small.

* It is possible to compare many other non-nested models in this way.

* Binary response models are particularly convenient examples because the log-likelihood statistic is so easy to construct.

* However, log likelihoods are easy to recover and or necessary to construct for many maximum likelihood procedures.

Tuesday, January 15, 2013

Creating an easy pie chart from data vectors

R script

# Pie charts can be a memorable way of representing information.

# There are a number of online examples showing how to create pie charts in R.

# For example:

# Imagine you have survey response data from 100 respondents and your survey answers range from A do D.

# This will draw 100 draws from the LETTERS vector between A and D
response = LETTERS[ceiling(runif(100)^2*4)]

# You would like to now see the pie chart of results you can simply tabulate.

# We can see that about half the results are A followed by B then C then D.

# Note, that without including percentages it is often difficult to identify the differences in the individual sizes of each slice.

Thursday, January 10, 2013

Efficiency of LAD vs OLS

R Script

# The following post follows class notes from Panel Data Analysis II by Jeffrey Wooldridge (though all errors are mine)

# When deciding on the best method to estimate a coefficient, LAD and OLS are two commonly considered.

# Often times these estimators are theoretically estimating the same coefficient.

# That is, in the linear relationship y=xB + u with u~symmetrically distributed and independent of x with E(u|x)=0 and med(u|x)=0, both OLS and LAD are consistent estimators of B.

# However, in terms of efficiency, the two estimators individual efficiency is variable.

# Assuming u is independent of x, the asymptotic variance of the LAD estimator (n^.5 * (B[LAD]-B))= is 1/(4*f(0)^2)*E(x'x)^-1

# with f(0) being the pdf of the error distribution at the quantile being estimated (median 0).

# While the asymptotic variance of OLS is (n^.5 * (B[OLS]-B))=sigma2*E(x'x)^-1

# with sigma2 being the variance of the error distribution.

# Now we can evaluate the differences in asymptotic variances given difference distributions of errors.

# For instance, assume the underlying distribution is standard normal

# sigma2=1

# f(0)


# Yields 1.57.  This indicates that LAD is 57% less efficient than OLS when the errors are distributed normally.

# On the other hand.  If we had a distribution of errors v = exp(|u|)*(2-|u|)*sign(u) with u~normally.

N = 1000000

u = rnorm(N)

v = exp(abs(u))*(2-abs(u))*sign(u)

# We can estimate the variance of the errors by sampling.


# Is estimated around 12.7

# To calculate the density at the median 0 we can sum across the pdf values of u in which v = 0.

# Which is u=2 and u=-2.  Because the normal is symmetric:


# Which is about .108

# So avar(lad) = (X'X)^-1 *

# Which is about 2.315*(X'X)^-1

# As opposed to the OLS which is around 12.7*(X'X)^-1

# Thus the LAD estimator is about 6 times more effective than the OLS estimator when the error term is distributed in this manner.

# A leading example of when the LAD estimator is preferred to the OLS estimator is when the errors are distributed double exponential f(u)=1/2k * exp(-|u|/k)

# The variance(u) = 2*k^2 making the avar = 2*k^2*(x'x)^-1 while the density at 0 is 1/2k which makes the avar = k^2*(x'x)^-1 making LAD twice as efficient as OLS.  Of course despite this special cases, we know that most data tends to look more normal than fat tailed making OLS preferable to LAD.