Thursday, August 30, 2012

Cronbach's alpha

* Cronbach's alpha coefficient is a widely used measure of internal consistency or reliability of a psychometric test score for a sample of examinees.

* First let us imagine that we have a test of 100 items to be administered to 1000 people.

* Let's imagine the test in only attempting to measure a single ability (math competency).

set obs 1000
gen stu_id = _n
label var stu_id "Student ID"

* Let's generate our data using IRT specifications:

* Each of the 1000 test takers has a different ability
gen theta = rnormal() + .5      
  label var theta "Individual ability"

* Each item (testing question) has three parameters:
* a - discrimination (the ability this item has of deciphering between people of different ability levels)
* b - the difficulty of this item (1 is easy, 0 is hard)
* c - the guessing probability (the probability that someone who knows nothing will guess the correct answer

* Let us generate now the 100 items, for all 1000 test takers

* Each item will have a different a,b, and c.

* Let's also generate a total score for each person
gen total_score = 0

forv i = 1/100 {

  local a = .2 + runiform()/4
  local b = .3 + runiform()/2
  local c = .2 + runiform()/3
  * I am not very sure what the best way of parameterizing these items are since I know this matters.
  local a`i' = `a'
  local b`i' = `b'
  local c`i' = `c'
  * We will generate item responses as pi(theta) = ci + (1-ci)/(1 + exp(-ai*(theta-bi)))
  * This is saying for a person, the probability of getting the item i right is a function of the item parameters ai bi ci as well personal ability theta.
  gen item`i' = rbinomial(1, `c'+(1-`c')/(1+exp(-`a'*(theta-`b'))))

  * Add the result of this item to the total score
  replace total_score = total_score + item`i'

forv i = 1/100 {
  di "Item `i': a = `a`i'' , b = `b`i'' , c = `c`i''"
* We can see that items with higher "difficulty" have more people getting the item right.

* We now have a collection of items.

* Now to calculate the alpha coefficient we need to do the following
* alpha = K/(K-1) * (1-sum(var(items))/var(total_score))
* K is the number of components (so K is 100)

local K = 100
* variance of total score is easy
sum total_score
local var_total_score = r(sd)^2

* To calculate the sum of the variance of items is a bit more complicated
local sum_var_items = 0
forv i = 1/100 {
  qui sum item`i'
  local sum_var_items = `sum_var_items' + r(sd)^2

di "Sum of Item Variances is `sum_var_items'"

* Now we are ready to calculate the alpha:
di "alpha = " `K'/(`K'-1) * (1-`sum_var_items'/`var_total_score')

* I am learning this as I am doing it.  Please correct me if I have made any mistakes.

* Right now the alpha for the current items is only .55 Which is really pretty poor.

* Try increasing the different parameters in the model a, b, c (their constants and their variations)

Wednesday, August 29, 2012

Calculating Mutually Exclusive Fixed Effects

* Let's imagine we would like to estimate how effective 90 different teachers in 3 different grades each teaching 20 students are individually.  Every student receives a teacher and all teachers are assigned to students randomly (an important assumption often violated).

set obs 90

gen teffect = rnormal()+1.5
  label var teffect "True Teacher Effect"

gen tid = _n

* This should create 3 different grades for the teachers to be assigned to
gen grade = ceil(_n/30)

* We will expand to 20 students per teacher
expand 20

* This is the base student effect level.
gen student_effect = rnormal()

* This is the starting levels of the teachers
gen start_level = rnormal()

* This is a random normal variation achievement gain over the year
gen u = rnormal()

gen current_level = teffect + student_effect + .75*start_level + 2*grade + u*5

* Multiplying start_level by .75 implies that students retain 75% of the ability that they had going into the school year.

* Now we want to see how well we can infer teacher ability

tab tid, gen(tid_)

* We might want to start with a straightforward regression of current achievement on teacher id
reg current_level tid_2-tid_90 start_level

* Since tid=1 is ommitted due to multicolinearity, we will set its effect equal to 0 as a base of reference.
gen reg_res = 0 if tid==1

* Note, you may want to identify the true magnitude of the teacher effect.  This however, is not possible because all students have recieved teachers.  Therefore, we can only at best hope to estimate how good teachers are relative to each other.

forv i = 2/90 {
  cap replace reg_res = _b[tid_`i'] if tid == `i'

* The problem with this is now we have to figure out how to compare teachers.

* One way would be to correlate the estimated teacher effect with the true teacher effect (which we know).
corr reg_res teffect

* This correlation looks really bad primarily because teachers are only teaching in one grade each and grades have different learning effects.

spearman reg_res teffect
* The spearman rank correlation fairs even worse than the pearson correlation

two (scatter reg_res teffect if grade==1)  ///
    (scatter reg_res teffect if grade==2)  ///
    (scatter reg_res teffect if grade==3), ///
legend( label(1 "Grade 1") label(2 "Grade 2") label(3 "Grade 3"))

* We can see generally their is a correlation between higher teacher effect and higher estimates of teacher effects across all grades.  However, within grades the correlation is even more clear.

* One may attempt to correct this problem by including grade dummies
reg current_level tid_2-tid_90 i.grade start_level

* However, the system experiences multicolinearity issues and sometimes drops the grade dummies.

* To control this we will drop the first teacher in each grade.

tab grade, gen(grade_)

reg current_level tid_2-tid_30 tid_32-tid_60 tid_62-tid_90 grade_1-grade_3, nocon

* This regression still is a little fishy however.

* Within each grade the estimated teacher effects is relative to the omitted teacher.

* Thus if the omitted teacher is high in grade 1 and low in grade 2 then the correlations will be thrown off.

gen reg_GD = 0 if tid==1 | tid==31 | tid == 61
forv i = 2/90 {
  cap replace reg_GD = _b[tid_`i'] if tid == `i'

corr reg_GD teffect
spearman reg_GD teffect
* Including the grade dummies greatly improves the teacher estimates.

* An alternative method would be to demean current achievement

bysort grade: egen mean_current_level = mean(current_level)

gen dm_current_level = current_level-mean_current_level

reg dm_current_level tid_2-tid_90 start_level

gen dm_results = 0 if tid==1
forv i = 1/90 {
  cap replace dm_results = _b[tid_`i'] if tid == `i'

* The problem with this is now we have to figure out how to compare teachers.
corr dm_results teffect
spearman dm_results teffect

* Finally an alternative approach may be to do the original regression but demean the teacher estimates by grade post estimation.

bysort grade: egen mean_reg_res = mean(reg_res)
gen dm_reg_res = reg_res - mean_reg_res

* The problem with this is now we have to figure out how to compare teachers.
corr dm_reg_res teffect
spearman dm_reg_res teffect

two (scatter dm_reg_res teffect if grade==1)  ///
    (scatter dm_reg_res teffect if grade==2)  ///
    (scatter dm_reg_res teffect if grade==3), ///
legend( label(1 "Grade 1") label(2 "Grade 2") ///
label(3 "Grade 3")) title(Demeaned dependent variable)

Tuesday, August 28, 2012

Nested functions in R and Stata

#----------------- Starting in R -------------------------#

# Both R and Stata have built in limit of how many nested functions each package will allow.
a <- 0

recursive_call <- function() {
  a <<- a+1

# R gives up at 2494 calls.

#-----------------Switching to Stata---------------------*

* This is an identical program (function) in Stata

global a=0

cap program drop recursive_call
program recursive_call

  global a=$a + 1
  di $a


* Stata on the other hand throws in the towel at 63.  This looks bad but look at the response by Alan Riley!

* The Following Code is code from a email comment by Alan RileyStataCorp LP Vice President, Software Development

/* --------------------------- And In Mata------------------------------ */
set more off // don't pause every screen full of output

a = 0

function recursive_call() {
external a
a = a+1
printf("%f\n", a)




Alan: The above will error out at some point when it hits its recursion limit,
and Mata will print out a (long) traceback log intended to help in
debugging when a Mata function hits some system limit. Due to such deep
recursion, this traceback log will be long and so the output from
recursive_call() will scroll off the top of your screen. In any case,
after running the above, you can then re-enter Mata and see what the value
of "a" is:



You should see 32766.

As an aside, you will notice that

* I imposed an upper limit of recursive calls at 10000, if left uncontrolled it will keep going till it gives an error. I am impressed with the speed and power of Mata!

Monday, August 27, 2012

Test Rejection Power for Failure of Normality Assumption

* In order for OLS to be the best unbiased estimator, the error must be distributed normally (along with other OLS assumptions).

* Failure or the normality assumption does not cause bias but may cause the estimator to be potentially less efficient than some non-linear estimators.  However, so long as the error is spherical (homoskedastic and uncorrelated) then the OLS estimator is still BLUE (best lineary unbiased estimator).

* Likewise, in order to calculate p-values and confidence intervals accurately the underlying distribution must be normal.

* However, by the central limit theorem so long as E(error^2)
* Of course, asympotics sometimes only appear in very large samples.

* In this post I will explore what happens to confidence intervals when the error is not distributed normally distributed and the sample size is "small".

cap program drop randols
program define randols, rclass

  * This tells stata to call the first argument in the command randdraw N and the second distribution
  args N distribution

  * The first arugment of the
  set obs `N'

  * Generate the variable
  gen u = `distribution'

  gen x = rnormal()

  gen y = 1*x + u

  reg y x

  * The Degrees of freedom are N less the number of parameters estimates (2)
  local DOF = `=`N'-2'

  * The (10%) condifence interval is Bhat-t(.05)*SE, Bhat+t(.05)*SE
  local lowerCI = _b[x]-invttail(`DOF', .05)*_se[x]
  local upperCI = _b[x]+invttail(`DOF', .05)*_se[x]

  * Now we check if the estimated coefficient is within the CI
  * Since the true coefficient is 1, we are checking if the CI encloses 1.
  if `lowerCI'<1 amp="amp" upperci="upperci">1 return scalar within = 1
  if !(`lowerCI'<1 amp="amp" upperci="upperci">1) return scalar within = 0

randols 100 runiform()
  return list

* Now we are set up to do some tests to see how well the 90% CI is working.

* If the CI is too wide (the standard error estimates are too large) then the CI will enclose the true value more than 90% of the time.

* If the CI is too tight then the CI will enclose the true vale too few times.

* As a means of comparison let's start with a normal distribution of the error.

* Set the random see so as to create replicable runs.
simulate r(within), reps(2000) seed(1011) : randols 100 rnormal()

* We can see that the standard estimates are working well, capturing the CI at 90.55% of the time.

* We do not expect exactly 90% since our simulation is a series of random draws as well.
* Now let's try a uniform distribution.  Note because the uniform has a non-zero mean this will effect the constant estimate, but since we are not interested in the constant we don't mind.
simulate r(within), reps(2000) seed(1011) : randols 100 runiform()
* Likewise the random uniform distribution does not present a serious problem in estimating CI.

* Perhaps the bernoulli(.5)
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)
* Likewise, the normality assumption is working very well even in binomial(.5) errors.

* Perhaps the bernoulli(.15) will be a little more challenging
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)
* Nope

* It should not matter if we increase the size of the error either since the CI should increase in size proportionate the size of the error estimated in the underlying population.

* Perhaps the bernoulli(.15) will be a little more challenging
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)*100
* Nope

* Perhaps if we reduce the sample size, this should make non-normal distribution more problematic.
simulate r(within), reps(2000) seed(1011) : randols 10 rbinomial(1,.5)*100

* Even using a sample size of 10 the CI is working beautifully.

* Let's try using a distribution which is extremely problematic:

* CDF(standard Cauchy) = 1/pi * arctan(x) + 1/2
* runiform() = 1/pi * arctan(x) + 1/2
* (runifrom() - 1/2) = 1/pi * arctan(x)
* pi(runifrom() - 1/2) = arctan(x)
* tan(pi(runifrom() - 1/2)) = x

simulate r(within), reps(2000) seed(1011) : randols 10 tan(_pi*(runiform()-1/2))

* For even the Cauchy distribution (a distribution for which the CLT does not apply) the CI calculations are close

* Let's try a larger sample size.

simulate r(within), reps(2000) seed(1011) : randols 1000 tan(_pi*(runiform()-1/2))

* Alright, Cauchy is not enough to throw off our CI.

* How about:

simulate r(within), reps(2000) seed(1011) : randols 1000 1/(rnormal()/10)

* There should be a lot of observations falling near the discontinuous region at 1/0.

* Yet the CI approximation assuming normally distributed errors continues to work extremely well with a overrejection rate of .002% off.

* This is a pretty convincing series of simulations to me.

* If you can concieve of a distribution of errors/sample size that makes the CI fail badly please post them as a response to this post!

Saturday, August 25, 2012

Breaking down R2 by variable

* It is not entirely possible to assign R2 by individual variable within a regression.

* Let's first look at what R2 is

set obs 1000

gen x1=rnormal()
gen x2=rnormal()*(5)
gen x3=rnormal()*(10)

* I want x1 to explain some variation but x2 to explain more than x1 and x3 to explain even more.

gen u=rnormal()*10

gen y = x1 + x2/2 + x3/3 + u

reg y x1 x2 x3

* Now we see that (as expected) x1 has the largest coefficient, however we know x2 and x3 are more important to calculating y.

* It is not clear at all how much of R2 each variable is explaining.

* One interpretation is that it is sum of squares model over sum of squares total.

di "R2 = " e(mss) /(e(rss) + e(mss))

* This is easy to calculate

* SST = sum{across I}(Yi-mean(Y))^2 =  N*mean((Yi-mean(Y))^2)
sum y
gen ydemean2 = (y-r(mean))^2
sum ydemean2
di "SST = " r(mean)*r(N)

* An identical process can be used to find RSS if you were to use yhat rather than y.
predict yhat
sum yhat
gen yhatdemean2 = (yhat-r(mean))^2
sum yhatdemean2
di "SSR = " r(mean)*r(N)

* R2 can also be related to the correlation between yhat and y

corr yhat y

di "R2 = " r(rho)^2

* So can we break apart the R2 into seperate variable components?

* Perhaps

reg y x1
reg y x2
reg y x3

* This works fine if the explainatory variables are independent.  However, if they are not the sum of the separate R2s get larger that the collective R2 rather quickly.

* An alternative approach would be through use of explanatory variables.

reg y x1 x2 x3

forv i = 1/3 {
  gen yhat_x`i' = x`i'*_b[x`i'] /* the constant is unimporant */
  corr yhat_x`i' y
  di "Maximum R2 (by x`i') is " r(rho)^2

* Interestingly, this gives identical results

* A final method seems appealing in seperating the explanatory power out of each variable.

* I call it the Partial Variation in Projection (PVP)

* For instance, it may be calculated as
sum x1
local pvp1 = r(sd)*_b[x1]
di "PVP(x1) = " `pvp1'

sum x2
local pvp2 = r(sd)*_b[x2]
di "PVP(x2) = " `pvp2'

sum x3
local pvp3 = r(sd)*_b[x3]
di "PVP(x3) = " `pvp3'

* We can see the PVP value is performing well.  x2 is twice as large as x1 and x3 is about three times that of x1.

* A scaled estimator would be in terms of total explanatory power by variable (portion of R2 by variable)

di "PVP(x1) scaled = " `pvp1' / (`pvp1'+`pvp2'+`pvp3')
di "PVP(x2) scaled = " `pvp2' / (`pvp1'+`pvp2'+`pvp3')
di "PVP(x3) scaled = " `pvp3' / (`pvp1'+`pvp2'+`pvp3')

* Finally the question ends up being, does this method releave useful values?

* Looking back at how y was generated:

* gen x1=rnormal()
* gen x2=rnormal()*(5)
* gen x3=rnormal()*(10)

* gen u=rnormal()*4

* gen y = x1 + x2/2 + x3/3 + u

* The portion x1 + x2/2 + x3/3 is explainable variance.

sum x1
local sd_x1 = r(sd)*1
sum x2
local sd_x2 = r(sd)/2
sum x3
local sd_x3 = r(sd)/3

* Within explainable variance:

di "x1 represents: " `sd_x1'/(`sd_x1'+`sd_x2'+`sd_x3')
di "x2 represents: " `sd_x2'/(`sd_x1'+`sd_x2'+`sd_x3')
di "x3 represents: " `sd_x3'/(`sd_x1'+`sd_x2'+`sd_x3')

* We can see that the coefficient estimates above approximate the true values.

* However this type of analysis of explanatory power by variable might only be useful for the case when explanatory variables are orthoganol.  If explanatory variables were correlated then the variance in one variable is not indepentent of the variance in the other and thus together they will have less explanatory power than two orthoganal explanatory variables, ceteris parabis.

Friday, August 24, 2012

Central Limit Theorem

* I find the central limit theorem pretty darn amazing. 

* For a series of random draws from any well behaved distribution, the mean of that series of samples from that distribution is normally distributed as the size of those samples gets large.

* This might seem at first like a lot of mumbo jumbo but this result is easily demonstrable and extremely useful.

* To show the central limit theorem (CLT) in action we will use Stata's simulate command.

* First we need to program a "program" to simulate

cap program drop randdraw
program define randdraw


  * This tells stata to call the first argument in the command randdraw N and the second distribution
  args N distribution

  * The first arugment of the
  set obs `N'

  * Generate the variable
  gen x = `distribution'

  sum x


* It is not neccessary that the mean be equal to zero but it might be useful for comparison between different distributions
randdraw 1000 "rpoisson(10)-10"
* This command drew 1000 draws from the poisson distribution less 10.  Which has a E(poisson(10)-10)=E(poisson(10))-10=10-10=0

simulate mean_x =r(mean), reps(1000): randdraw 1000 "rpoisson(10)-10"

hist mean_x, kden
* Looks pretty normal, but that is not so surprising because the poisson distribution begins to look pretty normal once k gets as large as 10.

* So how exactly does the central limit relate to all of this and why is it so cool?

* The argument is that as N the number of draws going into each mean calculation gets large, the distribution of the means begins to look normal.

* Lets first try something as far from normal as possible (not a technical definition)
simulate mean_x =r(mean), reps(1000): randdraw 10 "rbinomial(1,.05)-.05"

* This is the mean of a bernoulli distribution (weighted to .05) either 0 or 1 with each mean representing 10 draws.
hist mean_x

* We can see that this is clearly not normally distributed.  However what happens when we increase the draws to 100 rather than 10?
simulate mean_x =r(mean), reps(1000): randdraw 100 "rbinomial(1,.05)-.05"
hist mean_x

* Already things are beginning to look a lot more like a normal distribution.  Obviously it is not normal because the potential outcomes are discrete.

* Perhaps it will look better at 1000 draws.
simulate mean_x =r(mean), reps(1000): randdraw 1000 "rbinomial(1,.05)-.05"
hist mean_x

* This is what is meant by asymptotically converging on a normal distribution.  As N gets large the distribution of the means begins to look more like a normal distribution (so long as the variance of x is finite).
* Crazy, huh?

* Perhaps it will look better at 10000 draws.
simulate mean_x =r(mean), reps(1000): randdraw 10000 "rbinomial(1,.05)-.05"
hist mean_x, kden

* Still not looking perfect.
simulate mean_x =r(mean), reps(1000): randdraw 100000 "rbinomial(1,.05)-.05"
hist mean_x, kden

* Now things are looking very close to normal.  Of course there will be some randomness in the 1000 draws which have enough inherent variation to make even a normal variable look non-normal.

* This might take a little while but let's increase the draws one more time.
simulate mean_x =r(mean), reps(1000): randdraw 1000000 "rbinomial(1,.05)-.05"
hist mean_x, kden

* So using a x variable with are bernoulli(.05) leads to a convergence in means that looks normal not until around 100,000 draws.  This might sound bad but it really is not.

* Because, usually we don't need the individually drawn means TO BE normal just to be approximated by a normal distribution as well as their standardized values as being approximated by a chi-squared distribution for null tests.

* In addition, a bernoulli(.05) is an extreme distribution with a large skew and discrete outcomes.  Most distributions, even binary and count distributions are going to have properties which are more similar to a normal distribution causing more rapid convergence on normality.

* Even a .5 bernoulli begins to look pretty normal at 1000 draws.
simulate mean_x =r(mean), reps(1000): randdraw 1000 "rbinomial(1,.5)-.5"
hist mean_x, kden

* Let alone distributions such as uniform
simulate mean_x =r(mean), reps(1000): randdraw 1000 "runiform()-.5"
hist mean_x, kden

* Compare this to a normal distribution for which the mean by definition normally distributed (because Normal + Normal ~ Normal)
simulate mean_x =r(mean), reps(1000): randdraw 1000 "rnormal()"
hist mean_x, kden

Thursday, August 23, 2012

A Visual Exploration of Two Dimensional Random Walks

# Let's create N variable vectors that will hold our different random walks.
N <- 7

# Now let's speficy the number of walks to take
S <- 50

# Create a matrix to hold X and Y values
X <- Y <- matrix(0,nrow = S, ncol = N)

# I am going to save my graphs to the "Rplots" directory

# For each value of X and Y add a random normal draw each of the S rounds
for (s in 2:S) {
  Xmov <- rnorm(N)
  X[s,] <- X[s-1,]+Xmov
  Ymov <- rnorm(N)
  Y[s,] <- Y[s-1,]+Ymov

# We can see a pure random walk looks pretty abrupt


plot(X,Y, type="n",frame.plot=F , axes=FALSE, xlab="",ylab="")

for (n in 1:N) {
  arrows(x0=X[1:S-1,n], y0=Y[1:S-1,n], x1=X[2:S,n], y1=Y[2:S,n], col = rainbow(N)[n], length=.05)

savePlot(filename = "2012_8_23_rand_walk1.png",
         type = "png")

# However, let's allow for some momentum (whatever speed the point had at the previous step that speed is retained during this step plus some random movement.  This is in effect a data source that exhibits a random walk even after first differencing.

X2 <- Y2 <- matrix(0,nrow = S, ncol = N)

for (s in 3:S) {
  # Add an inertia term
  Xinertia <- X2[s-1,] - X2[s-2,]
  Xmov <- rnorm(N) + Xinertia
  X2[s,] <- X2[s-1,]+Xmov

  Yinertia <- Y2[s-1,] - Y2[s-2,]
  Ymov <- rnorm(N)  + Yinertia
  Y2[s,] <- Y2[s-1,]+Ymov

# We can see a pure random walk looks pretty abrupt

plot(X2,Y2, type="n",frame.plot=F , axes=FALSE, xlab="",ylab="")

for (n in 1:N) {
  arrows(x0=X2[1:S-1,n], y0=Y2[1:S-1,n], x1=X2[2:S,n], y1=Y2[2:S,n], col = rainbow(N)[n], length=.05)

savePlot(filename = "2012_8_23_rand_walk2.png",
         type = "png")

# We can see the movement seems to be much smoother but cause rapid divergence in individual draws

# Lets try one more specification.  First we will cause there to be a common current vector (1,1) pushing the different strands towards the positive direction.  Also, there will be more momentum.

X3 <- Y3 <- matrix(0,nrow = S, ncol = N)

for (s in 7:S) {
  # Add an inertia term
  Xinertia <- 5*(X3[s-1,] - X2[s-2,]) +
              4*(X3[s-2,] - X2[s-3,]) +
              3*(X3[s-3,] - X2[s-4,]) +
              2*(X3[s-4,] - X2[s-5,]) +
              1*(X3[s-5,] - X2[s-6,])
  Xmov <- rnorm(N) + Xinertia/750 + 1
  X3[s,] <- X3[s-1,]+Xmov

  Yinertia <- 5*(Y3[s-1,] - Y2[s-2,]) +
              4*(Y3[s-2,] - Y2[s-3,]) +
              3*(Y3[s-3,] - Y2[s-4,]) +
              2*(Y3[s-4,] - Y2[s-5,]) +
              1*(Y3[s-5,] - Y2[s-6,])
  Ymov <- rnorm(N) + Yinertia/750 + 1
  Y3[s,] <- Y3[s-1,]+Ymov


plot(X3,Y3, type="n",frame.plot=F , axes=FALSE, xlab="",ylab="")

for (n in 1:N) {
  arrows(x0=X3[1:S-1,n], y0=Y3[1:S-1,n], x1=X3[2:S,n],
         y1=Y3[2:S,n], col = rainbow(N)[n], length=.05)

savePlot(filename = "2012_8_23_rand_walk3.png",
         type = "png")

Wednesday, August 22, 2012

Sharp Regression Discontinuity Example and Limitations

* Regression discontinuity is a method of analysis dating back to work by Thistlewait and Cook (1960) but recently popularized by a number of important papers such as Hahn, Todd, and Van der Klaauw (2001) <

* The method is argued to require weaker assumptions than natural experiments.

* The method is seemingly deceptively simple.

* Imagine there is some rule implemented at z = c a heterogenous value in the population.

* z could be correlated with the outcome variable y of interest.  However, if one were to look at the group of individuals whose value of z were sufficiently close to c then one would find that the only remaining difference would be the result of either recieving the treatment T or not recieving the treatment.

* Let's see how this works.

set obs 20000
set seed 101

* Imagine that a school has 20000 incoming students.

* They have SAT scores drawn from a uniform distribution ( not a realistic assumption)
gen SAT = 600 + int(181*uniform())*10

* As an administrator you would like to give out merit based scholarships to encourage students to do well at your school.

* This we will call a score of 2130.

recode SAT (0/2130=0) (2130/2400=1), gen(scholarship)

* There is also some measurable level of mentoring that affects performance which is independent of SAT scores and scholarship.
gen mentoring = rbinomial(1,.5)

* Let's also imagine that students with top SAT scores are more likely to do well without the scholarship.
gen performance = 25 + 2*(SAT/1500)^3 + 1*scholarship + 1*mentoring + rnormal()*5

twoway (scatter performance SAT , msize(tiny) msymbol(circle))  ///
       (lfit  performance SAT if SAT<2130 circle="circle" msize="msize" msymbol="msymbol" nbsp="nbsp" p="p" tiny="tiny">   (lfit performance SAT if SAT>2130, msize(tiny) msymbol(circle)),    ///
  legend(label(2 "No Scholarship") label(3 "Scholarship"))

* Performance is some index that your team has developed that combines grades, time to completion, post graduation job success, entry to graduate schools, as well as alumni contributions.

* You think that the relationship between SAT and performance is nonlinear
* Sepecification 0:
reg performance SAT scholarship mentoring

* However, you are suspicious of the relationship between students recieving the scholarship and future success.

* Thus RD! You look instead at those students who almost got the scholarship and those who just barely qualified for the scholarship.

* Sepecification 1:
reg performance SAT scholarship mentoring if SAT > 1930 & SAT < 2330

* We can see that our estimates are closer.  However they are not perfect yet it does not help if we restrict our data further.
* Sepecification 2:
reg performance SAT scholarship mentoring if SAT > 2070 & SAT < 2180

* Sepecification 3:
reg performance SAT scholarship mentoring if SAT > 2100 & SAT < 2160
* This is because we rapidly loose observations as our

* Sepecification 4:
reg performance SAT scholarship mentoring if SAT > 2115 & SAT < 2145

* It is easy to see the problem with the RD approach in this example.  RD is highly sensitive to sample selection.

* If our selection is too narrow then we do not have enough data to identify the effect or the confidence interval for useful analysis.

* In 0 we can see that our confidence interval does not enclose the true parameter estimate, while in all other specifications it does.

* However, from specifications 2-4 we might be forced to conclude that after using RD "precision", the scholarship had no effect size significantly difference from zero.

* We can that the estimated effect of mentoring suffers equally bad as that of estimating the effect of scholarship by restricting the sample.

* Ultimately using RD to restrict the sample is going to have an equally deliterious effect on other coefficients of interest.

* This might sound overall negative.  However, when you have large enough sample sizes things start improving.

* Let's imagine that the administrator is able to use multiple years of students to estimate the effect of the program.

* Increase sample size to 2 or 3 times the current sample size and include a year fixed effect and the RD estimator is still going to outperform the biased estimator which cannot get better through the inclusion of more data.

Tuesday, August 21, 2012

Symbolic Math - Differentiation in R

# While R is primarily used as a statistical language for numerical calculations it is often useful to be able to manipulate symbols so as to avoid approximation error.

# In addition to a vector language R also can handle a limitted symbolic language.

# In order to do this first you specify a symbolic "expression".

fx = expression(x^2, 'x')

D(fx , 'x')
# We can see that this expression yeilds 2*x

D(D(fx , 'x'),'x')
# This will take the second derivative 2

# Now let's try something a little more sophisticated
fx = expression(x^5-1/x+cos(x)^x, 'x')

D(fx , 'x')

# R can also handle multiple variables.

fxyz = expression((x*y)^5-1/x^z+cos(x)^x)

# deriv is a similar function to D except that it expands solutions out.  The results are a little more complicated to interpret.

deriv(fxyz , c('x', 'y', 'z'))

# Integration on the other hand is a bit more difficult.  It requires external packages.

Monday, August 20, 2012

Lorenz Attractor Simulation

# This is a brief exploration of Chaos theory.  The Lorenz attractor is an equation used to model convection.  It is defined by the system of equations:

# dx/dt = sigam(y-x)
# dy/dt = x(rho-z)-y
# dz/dt = xy-beta*z

# With sigma, rho, and beta representing model parameters.

# We can easily approximate this system by a series of discreet time steps

# First set the initial values
y = 5
x = 5
z = 5

# Now let's set the parameters
sigma = 10
rho = 28
beta = 8/3

for (i in 1:9999) {
  x[i+1] = x[i] + sigma*(y[i]-x[i])/200
  y[i+1] = y[i] + (x[i]*(rho-z[i])-y[i])/200
  z[i+1] = z[i] + (x[i]*y[i]-beta*z[i])/200

plot(x[!],y[!], type="n")

for(i in 1:(length(x)-1)) 
lines (x[i:(i+1)], y[i:(i+1)], col = rainbow(length(x))[i]) 

Sunday, August 19, 2012

Law of Iterative Expectations/Law of Total Expectations

* The law of iterative expectations (LIE) is an extremely useful rule that can be frequently and usefully employed in many econometric proofs.

* This post will explore some simulated results in an attempt to strengthen our intution.

* LIE is the statement E(Y)=E(E(Y|X))

* For a proof using sums (

* A proof using pdf probability rules can be found at (

* Example 1: OLS

* In OLS (given the linearity assumption and the zero conditional mean assumption) E(Y|X) = XB

* Thus using LIE: E(Y) = E(E(Y|X)) -> in the sample analogue mean(Y) = mean(XBhat) = mean(Yhat)

* Let's see this in action

set obs 1000

gen x = rnormal() + 5
gen u = rnormal()

gen y = x + u*10

reg y x

predict yhat
sum y yhat

* Example 2: Imagine you are trying to calculate the reccomended cooking times for a frozen food that you have developed that is best for the most people.

* You know that for each 1000 feet above sea level you need to reduce the temperature by 5% (this is entirely fictional).  You have experimented and found the ideal cooking temperature at sea level is 350. The result is E(Y|x)=350*(.95)^x where x is 1000 feet.

* Now you want to calculate a single cooking tempurature reccomendation that gives the ideal tempurature for an entire population.

* You know 30% of consumers live at sea level, 20% at 500 feet, 20% at 1000, 10% at 2000, 10% at 3500, and 10% at 5000.

* Thus expected best temperature reccomendation for a randomly drawn person is:

di .3*350*(.95)^0 + .2*350*(.95)^.5 + .2*350*(.95)^1 + .1*350*(.95)^2 + .1*350*(.95)^3.5 + .1*350*(.95)^5

* Thus E(Y)=E(E(Y|X))

* Note that this is a different temperature than taking the temperature at the average elevation.

di "Average elevation = " .3*0 + .2*500 + .2*1000 + .1*2000 + .1*3500 + .1*5000

di "Temperature reccomendation at average elevation = " 350*(.95)^((.3*0+.2*500+.2*1000+.1*2000+.1*3500+.1*5000)/1000)

Saturday, August 18, 2012

Inverse Probability Wieghting to Correct for Sample Selection/Missing Data

* Imagine that you have some data set that is missing some of the variables of interest but you have a complete set of explanatory variables.  You might be concerned that the selection from the sample is correlated or is causing correlation in errors with your explanatory variable of interests thus creating potential bias.

* Imagine that you are interested in estimating if obedience school for dogs has the potential to reduce their risk of biting people.  As the y variable you have self-reported (by owners) number of bites.  As the explanatory variable you have breed aggressiveness and an indicator if the dog went to obedience school.

* Imagine also that you have information on the aggressiveness of the owners of the dogs which is correlated with the error in estimating the number of bites.  It is also an explanatory variable of selection.

set obs 100000

gen n_classes = rpoisson(1)
gen breed_agg = rnormal()
gen owner_agg = rnormal()

* First lets calculate selection
gen p = normal(.5 -.5*n_classes + .5*owner_agg)
gen s = rbinomial(1,p)
gen u = rnormal()+owner_agg

* First let's assume there is no selection
gen bites = 2 + breed_agg - n_classes + 3*u

reg bites breed_agg n_classes
* We can see in this case the estimates look good (absent of selection)

replace bites = . if s == 0

reg bites breed_agg n_classes
* Now, the effects of classes seem greatly diminished in our observables because of the correlation between selection and the error component resulting from ownernship aggressiveness.

* There are two ways I can think of generating an unbiased estimates.

* We must do this by removing the correlation with selection and the error.

reg bites breed_agg n_classes owner_agg
* Is the easiest way to do this.  However, this post is about inverse probability weighting.  So that is what we will do.

probit s n_classes owner_agg

predict shat
* We want to estimate probability of selection from observables

gen ishat = 1/shat
* Then first the inverse of it to use in the pweight command

reg bites breed_agg n_classes [pweight=ishat]
* We can see this estimate is working well though the previous regression was generally better.

* This post deals with inverse probability weighting in simple OLS.  A future post will address inverse probability weighting in M-estimation:

Friday, August 17, 2012

The DELTA method

# The delta method is an extremely useful tool for estimating the standard errors of non-linear models.

# This post will use as a reference David Patterson of the University of Montana's post on the delta method

# The delta method states that the standard error of a distribution f(x) is approximated by f'(Mx) var(x) f(Mx) where f(x) is the gradiant of F(x) with respect to x (first derivative), and Mx=mean(x).

# Let's first generate a base variable x~N with mean 5 and standard deviation 3.

# Example 1: F(x) = x^2 -> f(x) = 2x ->
# thus var(F(x))~2*mean(x)*var(x)*2*mean(x) ->
# var(F(x)) ~

# Let's see how well our approximation compares with actual data
x2 = x^2

# We can see that our estimate (900) is pretty close to the sample variance (1050ish)

# Using our sampling distribution

# Example 2: F(x) = 1/x -> f(x) = (-1)*x^(-2)
# var(F(x)) ~ (-1)*mean(x)^(-2) * var(x) * (-1)*mean(x)^(-2)
# = mean(x)^(-4) * var(x)
mean(x)^(-4) * var(x)

# Let's see how well our approximation compares with actual data
xi = 1/x

# In this case, not very well.  This is probably because the distribution of x approaches the nondiferentiable point 0 for some xs making a first order approximation fail badly .

# Let us try with another variable
z = rnorm(10000)+20

mean(z)^(-4) * var(z)

zi = 1/z

# We can see now the delta method is working well.

# Example 3: F(z) = ln(z) -> f(z) = 1/z
# var(F(z)) ~ (1/mean(z))*var(z)*(1/mean(z))
# = (1/mean(z))^2*var(z)

lnz = log(z)

# Once again, looking good.

# Now let's imagine an estimator.  First off we need to recognize that estimators are random variables.

# Bhat = (X'X)^(-1)X'Y = (X'X)^(-1)X'(XB + e) = (X'X)^(-1)X'XB + (X'X)^(-1)X'e
#      = B + (X'X)^(-1)X'e
# E(Bhat|x) = E(B|x) + E[(X'X)^-1 X'e|x] -> assume (e|x)=0 ->
# E(Bhat|x) = B + (X'X)^-1 X'E[e|x] = B
# var(Bhat|x) = var(B + (X'X)^(-1)X'e|x) = var((X'X)^(-1)X'e|x)
# = (X'X)^(-1)X'var(e|x)((X'X)^(-1)X')' = (X'X)^(-1)X'var(e|x)X(X'X)^(-1)
# = (X'X)^(-1)X' *I X(X'X)^(-1)
# = sigma2*(X'X)^(-1)X'X(X'X)^(-1) = sigma2*(X'X)^(-1)

e = rnorm(1000)
y = 5*x + e*100

# Let's estimate OLS without residuals


uhat = residuals(ols)

# var(Bhat|x) ~ sigma2(X'X)^(-1) estimate

# standard error

Thursday, August 16, 2012

Linear models with heteroskedastic errors

* Imagine we have a sample of hotels and prices of rooms in those hotels
set obs 200
set seed 101
* This sets the random seed at 101.  Thus every time this simulation is run it will product identical results.

gen star = ceil(runiform()*7)/2+.5
  label var star "Number of stars of hotel"

gen id = _n

gen het = abs(rnormal())
  label var het "Unobserved heterogeneity."

* Generate a the number of reservations observed per hotel.
gen num_reservations = rpoisson(150)
expand num_reservations

gen seasonality = abs(rnormal())
  label var seasonality "Seasonal demand"

gen v = abs(rnormal())
  label var v "Unobserved variance in the variance term"

gen u = rnormal()*(5+het*15+7.5*star+22.5*seasonality+v*10)
  label var u "Error term"

gen p = 175 + 20*star + 15*seasonality + u + het

sum p
* There are some p values which are less than 0 but we can think of those as special deals, coupons, refunds, or other situations that might result in the effective price being less than 0 dollars.
* If we were to eliminate the less than 0 prices then we would in be enforcing left censoring which is a different problem.  See "tobit".  This blog has several posts touching on the use of the tobit.

* Estimate the price through direct OLS
reg p star seasonality

* Set the panel level observation
xtset id

* Though we have panel data we cannot effectively use fixed effect or random effects approaches to identify the price effect having one more star has on prices.
xtreg p star seasonality, fe
xtreg p star seasonality, re

* Now let's attempt a two step method to more efficient identify the error and reestimate the OLS.

reg p star seasonality
* The OLS regression looks pretty good.  Let's see if we can improve on it.  Note, the 95% confidence interval did not capture the true coefficient of 20 but that is not necessarily a problem.

reg p star seasonality, robust
reg p star seasonality, cluster(id)
* Using robust and cluster robust estimates of the standard error does not change the variance so much that the 95% confidence interval encloses the 20.

predict uhat, resid

gen uhat_abs = abs(uhat)
  label var uhat_abs "Abs of OLS residual"

two  (scatter uhat_abs seasonality) (scatter uhat_abs star), ///
              legend(label(1 "Seasonality") label(2 "Stars"))

* Since E(u)==0 var(u) is equal to u squared Var(u)=(u-E(u))^2
* Likewise, u^2 is approximated by u^2
* Similarly sd(u) should be approximated by (u^2)^.5=abs(u)
reg uhat_abs star seasonality

predict uhat_abs_hat, xb
  * I might be doing this wrong.  I am trying aweights which say something about weighting by the inverse of the variance of the observation.)

gen uhat2 = 1/uhat_abs_hat^2

reg p star seasonality
* Let's see how the unweighted estimate performs

reg p star seasonality [aweight = uhat2]
* Using variance weights does not seem to improve the estimate

* Alternatively we can use the MLE estimator allowing the conditional standard deviation of the error as well as the conditional mean to vary linearly.

cap program drop mle_ols
program mle_ols
  args log_like xb sigma
  qui replace `log_like' = ln(normalden($ML_y1-`xb',0,`sigma'))

ml model lf mle_ols (price: p = star seasonality) (sigma: star seasonality)
ml maximize

* Using the MLE estimator we seem to have gained precision in the 3rd decimal place of the coefficients.  The 95% CI still does not enclose the 20 but it is closer.  Still, this is not indicative of a problem.  Perhaps if we simulated this 1000 times and substantially more than 50 times the CI did not enclose the 20 then we might be worried.

Wednesday, August 15, 2012

A brief look at fe vs re

global num_reps = 100
* Set the number of repetitions

quietly forv i = 1(1)$num_reps {
* Loop through the simulation 100 times

gl noi
if `i'==1 gl noi noi
  * Only display regression results on the firt loop

set obs 1000

gen id = _n

gen het = rnormal()
  label var het "Individual heterogeneity"
gen xauto = rnormal()
  label var xauto "Autocorrelation in x"
gen uauto = rnormal()
  label var uauto "Autocorrelatoin in u"

expand 5
  * Create 5 observations per individual

gen x = rnormal() + .5*xauto
  * Generate an x that is part auto correlated and part unique draws
gen y = 5*x + rnormal()*50 + uauto*25
  * Generate a y in which error is also autocorrelated by individual

bysort id: gen t=_n
  * Create a time variable

xtset id t
  * Assign panel data indicators

$noi xtreg y x, fe
  * Run a fixed effect regression
gl fe`i' =  _b[x]
  * Save the results in the global fe#

$noi xtreg y x, re
  * Run a random effect regression
gl re`i' =  _b[x]
  * Save the results in the global re#

noi di `i'
  * Display repetition number

* Clear the old memory
set obs $num_reps

* Create empty variable holders.
gen re=.
gen fe=.

* Loop through the repetititions saving each to a variable
forv i=1(1)$num_reps {
  replace re = ${re`i'} if _n==`i'
  replace fe = ${fe`i'} if _n==`i'

sum re fe

Tuesday, August 14, 2012

Proof IV is same as 2SLS

This is a standard proof.  Normally I am the last one to attempt proofs but I wanted to try out $ \LaTeX{} $ in blogger.

LaTeX Script:

  % This is a comment; it is not shown in the final output.
  % The following shows a little of the typesetting power of LaTeX
First Stage: Standard OLS \\
(1.1) X=Z\widehat{\gamma\
(1.2) Z'X=Z\widehat{\gamma\\
(1.3) Z'X=Z'Z\widehat{\gamma\\
(1.4) (Z'Z)^{-1}Z'X=(Z'Z)^{-1}Z'Z\widehat{\gamma\\
(1.5) (Z'Z)^{-1}Z'X=\widehat{\gamma\\

Second Stage, insert projection of Z on X ( \widehat{X= Z \widehat{\gamma): \\
(2.1) Y= \widehat{X\beta _{2SLS\\
(2.2) \widehat{X}' Y= \widehat{X}\widehat{X\beta _{2SLS\\
(2.3) ( \widehat{X}\widehat{X})^{-1\widehat{X}' Y=  ( \widehat{X}\widehat{X})^{-1}  \widehat{X}\widehat{X\beta _{2SLS\\
(2.4) ( \widehat{X}\widehat{X})^{-1\widehat{X}' Y=  \beta _{2SLS\\
From (1.5) we insert  \widehat{X= Z \widehat{\gamma= Z (Z'Z)^{-1}Z'X \\
and  \widehat{X}' = \widehat{\gamma}'Z = X'Z (Z'Z)^{-1}Z'. \\
(2.5) ( X'Z (Z'Z)^{-1Z'Z (Z'Z)^{-1Z'X )^{-1}  X'Z (Z'Z)^{-1Z'Y=  \beta _{2SLS\\
(2.6) ( X'Z  (Z'Z)^{-1Z'X )^{-1}  X'Z (Z'Z)^{-1Z'Y=  \beta _{2SLS\\

We will first derive the IV estimator: \\
(3.1) Y = X \beta _{IV\\
(3.2) Z'Y = Z'X \beta _{IV\\
(3.3) (Z'X)'Z'Y = (Z'X)'Z'X \beta _{IV\\
(3.4) X'ZZ'Y = X'ZZ'X \beta _{IV\\
(3.5) (X'ZZ'X)^{-1X'ZZ'Y = (X'ZZ'X)^{-1X'ZZ'X \beta _{IV\\
(3.6) (X'ZZ'X)^{-1X'ZZ'Y =  \beta _{IV\\

To show that IV is same as the 2SLS we take an addition step after (3.2). Somewhat artificially we multiply both sides by X' Z (Z'Z)^{-1}  \\
(3.3b) X' Z (Z'Z)^{-1Z'Y = X' Z (Z'Z)^{-1Z'X \beta _{IV\\
(3.4b) (X' Z (Z'Z)^{-1Z'X)^{-1X' Z (Z'Z)^{-1Z'Y \\
                              = (X' Z (Z'Z)^{-1Z'X)^{-1X' Z (Z'Z)^{-1Z'X \beta _{IV\\
(3.5d) (X' Z (Z'Z)^{-1Z'X)^{-1X' Z (Z'Z)^{-1Z'Y = \beta _{IV\\
  Which must equal (3.6) because we have done only equal operations to both sides and (3.5d) is the same as (2.6).  Thus the IV estimator is the same as the 2SLS estimator.