Thursday, May 31, 2012

Value-added modelling - Stata simulation - iPad example

* Value-added modelling is a common approach to use to try to infer the "quality" or "value" of inputs.

* In education, these methods have become quite popular politically and academically.

* But in a sense using these methods in education is an abstraction.

* Let us first in order to grasp how value added methods were first developed think of a production example.

* Imagine that you are manufacturing tablet computers (iPad).

* On each step of the production process there is a different "value" that is added to the process as a result the particular inputs.

* In order to conceptualize this think of the product at each stage being worth a certain value that is then bought buy the next person in the manufacturing chain.

* However, you do not want to use the market to assemble the tablet computers.

* You want to assemble them in house.

* Therefore you want to figure out a way of inferring how much value each input has.

* To do this, imagine that run a series of non-market valuations after each stage of the production process to infer a current price.

* Then you take the change in that market price as a measure of the value of those inputs.

set seed 11

* Let's do this:

clear
set obs 10

* 10 different companies

gen comp_id = mod(_n-1,10)+1
  label var comp_id "Company ID"

* Each of the companies has a different assembly line structure.
* Let's imagine that at each stage that company's structure ads a constant amount of value to all products.

gen comp_fe = runiform()
  label var comp_fe "The fixed effect (specific to that company) added to each product each stage"

* Now let's imagine that each company produces 10 different products (over the sample time)
expand 10

sort comp_id

* Generate a list of product ids
gen prod_id = _n
  label var prod_id "Product ID"

* Each product line has some inherent design component that makes it more or less valued at each progressive stage than other products.
gen prod_fe = rnormal()/2 + 1/4
  label var prod_fe "Product Fixed Effect"

* Now imagine that there are 5 stages of production for each product
expand 5
bysort prod_id: gen prod_stage = _n

* You have different stages with the initial stage product idea.
* Stage 1 gather raw materials
* Stage 2 manufacture components
* Stage 3 assemble components (probably in China)
* Stage 4 ship product
* Stage 5 sell product at the retail locations



label var prod_stage "Production stage"

* Now imagine also that there are 100 contractors in tablet computer production market.

* These contractors get random contracts as to which product to work on.

* This is what we really want to know.

* How good are these contractors at "adding value" to the product.

* This is the trickiest part of the code so far.

* First we need to generate the contractors.

* Keep the data that we have generated so far:
preserve

* There are several ways of doing this.

* I will use the many:1 merge command to accomplish this task.
clear

* Imagine that our 100 contractors are subsidiaries of 5 different umbrella companies.

set obs 5

gen cont_company_id=_n
  label var cont_company_id "Contracting company ID"

* Each of these companies has a different work ethic
gen cont_company_fe = rnormal()*.25
  label var cont_company_fe "Contracting company effectiveness"

* Each of the contracting companies has 20 contractors they manage.

expand 20

gen cont_id = _n
  label var cont_id "Contractor ID"

gen cont_fe = rnormal() + 1
  label var cont_fe "Contractor effectiveness"

* Now we will save the contractor information to a temporary data file
save "contractor.dta", replace

restore

* Let us first assign contractor IDs randomly:

gen cont_id=int(runiform()*100+1)

* Now let's merger in the contractor data

merge m:1 cont_id using "contractor.dta"
drop _merge

* Let us think that whenever a different product is developed there is some unobserved component that adds or subtracts random value from a product line independent of all other inputs at each stage.

gen rand_effect = rnormal()
  label var rand_effect "Random idiosyncratic production effect unique to each product at each stage"
* In other words the error component

* Finally, imagine that at each stage there is an "average" amount of value added at that stage.

* We will use another merge command to do this:

preserve

clear
set obs 5

gen prod_stage=_n

gen stage_fe = runiform()*2
  label var stage_fe "Production stage fixed effect"

save "Stage.dta", replace

restore

merge m:1 prod_stage using "Stage.dta"
drop _merge

* Let us first add on a production stage zero representing the initial "value" of the product idea.

* This is a little tricky to do.

expand 2 if prod_stage == 1, generate(expand_indicator)

* I will expand the data for production stage 1 and indicate the created data with expand indicator.
replace prod_stage = 0 if expand_indicator==1
drop expand_indicator

* Now I want to make sure that stage 0 production is not done by any contractors so there is only the company effect and product effect.

foreach v in cont_id cont_company_id cont_company_fe cont_fe rand_effect stage_fe {
  replace `v' = 0 if prod_stage == 0
}

*****
* Now let us start to generate the values of the products at each stage.

* This is a cumulative model ie. Value Added

* So in effect y=lambda*y[t-1] + XB + v

* To begin with we will generate the initial value of y.

gen value=abs(rnormal()) + prod_fe +  comp_fe if stage==0

* First let us double check to make sure our data is sorted properly.
sort prod_id prod_stage

* This is the retention value of a product from each previous stage.

* If lambda is low then it means that once the product is processed it cannot be used in the previous production stage.

* If lambda is high then it means that the previous value is retained plus any value added of progressive stages.
gen lambda=.95

* Now let us generate the cumulative value added data.
replace value=lambda*value[_n-1] + prod_fe +  comp_fe + cont_fe +cont_company_fe + stage_fe + rand_effect if stage>0

**** Simulation END

bysort prod_stage: sum value
* We can see that on average at each production stage there is an increasing value of the product.
* However, we can also see that the variance in values increases as the value increases.
* This is because of the cumulative variance effect of the various components combined with the high retention value of each previous stage (lambda).

* Now that we have data generated through a Value-added simulation we can start testing different value added estimators.

* That will be for a later post!

Wednesday, May 30, 2012

Estimating Random Coefficients on x


* Stata code version 11.

* In Econometrics terms this is what we are thinking about:

* y=xb + e    e is a random variable
* b = B + u   u is a random variable

* Ie. there is an average effect B of x but there is also a individual effect, u.
* We want to know how much does the the effect of the policy vary accoss individuals.

* Therefore:
* Y=X(B+u) + e
* Y=XB + Xu + e
* Y=XB + v ;  v=(Xu + e)

clear
set seed 111
set obs 1000

gen u = rnormal()*3

* We cannot hope to estimate every individual u.
* However, we can get an estimate on the stanadard deviation of u.
* Which is >>>> 3 <<<<

gen b=u + 4

gen x=rnormal()

gen e=rnormal()*5

gen y= b*x + e

* We will attempt to use xtmixed to recover the variance on u and e.
* The standard deviation on u should be estimated by sd(x) since u is the random coefficient on x.
* And sd(_cons) is an estimate of the standard deviation of e.
* This might seem fishy but having a random coefficient on a constant is the same as having the standard addative error in the model.

xtmixed y x || x: x , iterate(2)
* xtmixed is tested with several specifications of iterate because it tends to bog down.
* Waiting for the algorithm to converge seemed far too time consuming.
* Epeciatally when one exams almost non-existant difference between 5 and 10.

xtmixed y x || x: x , iterate(5)

xtmixed y x || x: x , iterate(10)

xtmixed y x || x: x , iterate(20)

xtmixed y x || x: x

Tuesday, May 29, 2012

Write your own System OLS in Mata

* This is a method related to Seemingly Unrelated Regression (SUR) referred to in a previous post.

* Imagine that you have three different equations that have different dependent variables but the same explanatory variables.

* Let us think of the example of the demand for capturing and mining an astroid and the research that goes into it. See Tony Cookson's post on the subject

* In general we may think that might have three different relationships that we are trying to uncover but we don't understand the relationship between them.

* In equation 1: we have the price of platnium
* In equation 2: we have millions of dollars of investments in private spacecraft
* In equation 3: we have an index prepresenting the probability of making a launch attempt

* This will specify the means of the three variables to be drawn jointly
matrix m = (0,0,0)

* This will specify the covariance matrix for the errors
matrix C = (12, -5, 5 \ -5, 12, -3 \ 5, -3, 6)
matrix list C

* Clear the memory and tell stata to prepare for 1000 observations
clear
set obs 100

* Draw three variablesb
drawnorm u1 u2 u3, means(m) cov(C)

* Now let's generate our exogenous variables
gen platnum_Q = 10 + rnormal()
  label var platnum_Q "The current quantity of plantinum available on the market"
gen invest_age = 50 + rnormal()
  label var invest_age "Average age of investors"
gen national_news_coverage = rpoisson(10)
  label var national_news_coverage "The current amount of national news coverage of space crafts."

* Now generate the dependent variables
gen platnum_price = 3*platnum_Q-0*invest_age-.1*national_news_coverage+u1*20
gen space_investment = 2*platnum_Q+3.5*invest_age +5.2*national_news_coverage+u2*5
gen launch_index = 6.01*platnum_Q+5.3*invest_age-10*national_news_coverage+u3*10

* send variables to mata.

putmata x1=platnum_Q x2=invest_age x3=national_news_coverage, replace
putmata y1=platnum_price y2=space_investment y3=launch_index, replace

mata
N=rows(x1)
cons = J(N, 1, 1)

// Save the explanatory variables to a matrix
X=(x1 , x2 , x3, cons)

// Save the dependent variables to a matrix
// Notice that at this point this is the same as OLS except that in OLS there is only one Y.
Y=(y1 , y2 , y3)

// alternatively the Sytem OLS estimator is:
B_OLS = invsym(X'*X) * (X'*Y)
B_OLS

end

reg platnum_price platnum_Q invest_age national_news_coverage
reg space_investment platnum_Q invest_age national_news_coverage
reg launch_index platnum_Q invest_age national_news_coverage

* This is the same as using seemingly unrelated regression.
mvreg platnum_price  space_investment launch_index = platnum_Q invest_age national_news_coverage

* However, seemlingl unrelated regression allows for restrictions on coefficients.
sureg (platnum_price=platnum_Q invest_age national_news_coverage) ///
     (space_investment=platnum_Q invest_age national_news_coverage) ///
(launch_index=platnum_Q invest_age national_news_coverage)

Monday, May 28, 2012

Asymmetric Error with Right and Left Sensoring


* Dependent variable is censored:

* In a previous post I looked at what happens when we use the tobit maximum likelihood estimator even when the error term is not normally distributed.

* In general we see that despite the failure of the normality assumption the tobit is shown to be a good estimator in a wide variety situations with different error structures.

* However, all of those distributions of errors were symmetric distributions.

* There is no reason to believe that in general the unobserved heterogeneity should be symmetric around the expected value.

* Let's see what happens as we relax this assumption

cap program drop tobit_monte_carlo
program define tobit_monte_carlo, rclass

  * Let's first set up the simulation
  clear

  * Set the number of observations
  set obs 3000

  * Let's imagine that we are trying to infer the damages caused by various things to homes in coastal cities.

  * Generate some explanatory variables
  gen weather = rpoisson(1)
    label var weather "The home was hit by extreme weather."

  gen crime = rbeta(2,6)
    label var crime "Property crime rate in home's area"

  gen occupants = rpoisson(4)
    label var occupants "The number of people occupying the home"

  gen age = (runiform()*40)+18
    label var age "The age of the owner"

  gen age2=age^2

  gen credit = (runiform()*600)+200
    label var credit "The credit worthiness of the owner"

  * Now lets imagine that there is a lot of low level unexplained damages

  * This will loop from 1 to 4 to 7 to 10.
  foreach i in 2 1 6 9 {
   * This generates a error distribution
   gen e`i' = rbeta(2,`i')
   sum e`i'
   replace e`i'=(e`i'-r(mean))/r(sd)
   if `i'==10    replace e`i'=e`i'*(-1)
 
   * The name option saves the graph to memory with the name e`i'
  if "`0'"=="graph" qui hist e`i', title(e~rbeta(2,`i')) name(e`i', replace)
   * This creates a local list of all of the graphs in memory by adding on to the list every time this loops.
local graphnames `graphnames' e`i'

  }
  * Graphs the combined 4 graphs
  if "`0'"=="graph" graph combine `graphnames'

  foreach i in 2 1 6 9 {
  * First let's generate the true thing we would like to understand. True amount of home damage.
    gen home_damage`i' = -10000 + 100000*weather + 10000*crime + 5000*occupants - 500*age + 20*age2 + 100*credit + e`i'*100000
  }
  sum home_damage*
  * We can reasonably think of repairers made to the home as a reasonable interpretation for negative values of home_damage.

  * However, we only have information on insurance payments.  Meaning each home had a different deductable:
  gen deducatable = 5000

  * Each home also has a maximum that the insurance policy will cover:
  gen maximum = `2'

  * Let us first impose our maximums and minimums
  foreach i in 2 1 6 9 {
    gen insurance_claims`i' = min(home_damage`i', maximum)
* This puts a cap on payouts but it is a little trickier figuring out minimums

* We know that if the claim is less than the deductible then it is not recorded.
replace insurance_claims`i' = 0 if insurance_claims`i'
  }
  sum insurance_claims*
  * We can see the different distributions of errors slightly affect payouts but not by much.

*****************************************************************
*** simulation end

* So we want to know, how much did the different factors affect home damages?
* We can observe the insurance claims, the deductibles, and the maximum payout but not any damages that are less or more than that.

* remember home_damage`i' = -10000 + 100000*weather + 10000*crime + 5000*occupants - 500*age + 20*age2 + 100*credit + e`i'*100000

* create a return list for the simulation command
gl return_list

* Let's see how well the OLS estimator does at recovering the coefficients
  foreach i in 2 1 6 9 {
    reg home_damage`i' weather crime occupants age age2 credit
    foreach v in weather crime occupants age age2 credit {
 return scalar OLS_`v'`i' = _b[`v']
 gl return_list $return_list OLS_`v'`i'=r(OLS_`v'`i')

}
  }

  * Let's see how well the tobit estimator does at recovering the coefficients
  foreach i in 2 1 6 9 {
    tobit home_damage`i' weather crime occupants age age2 credit, ll(5000) ul(`2')
     foreach v in weather crime occupants age age2 credit {
 return scalar Tob_`v'`i' = _b[`v']
 gl return_list $return_list Tob_`v'`i'=r(Tob_`v'`i')
}
 }
* End program
end

tobit_monte_carlo graph 100000

di "simulate $return_list , reps(50): tobit_monte_carlo nograph 100000"

simulate $return_list , reps(50): tobit_monte_carlo nograph 100000

order *weather* *crime* *occup* *age? *age2? *credit*

sum
* It seems that the OLS estimator generally outperforms the tobit estimator.
* There is no reason that this should be the case except that the data suffers from both top and bottom coding.

* I suspect that if there was only bottom coding then the Tobit estimator would outperform the OLS estimator.

* In order to test this we can try:
simulate $return_list , reps(50): tobit_monte_carlo nograph 10000000

order *weather* *crime* *occup* *age? *age2? *credit*

sum
* It seems that the OLS estimator still generally outperforms the tobit estimator.
* Though it is hard to say.  Perhaps this is due to the small sample size.
* A larger sample size should help the QMLE be more consistent.

Friday, May 25, 2012

Estimating Random Coefficients in a panel data


* Imagine that you are concerned that each of your individual panel blocks has a unique random coefficient.

* This might be reasonable if say you are wondering what the effect on employment will be if you increase the federal minimum wage.

* States that already have a minimum wage is higher than the expected increase will probably not find much change while states that are much below the intended increase might find a large change.

* Note that if you define your x variable reasonably well in this case you might not even worry about random coefficients.

* Let us start by setting up the simulation.

clear
set seed 11
set obs 50
* Generate 50 states

gen state=_n

* Now let us speficy a random number of observations per state

gen nobs = rpoisson(15)*4

* Now lets specify a random coefficient per state for the policy change.

gen b=-1+rnormal()

* Now let's expand our sample
expand nobs

gen x=rnormal()*3 + 10

gen y= b*x + rnormal()*3

* Some states have 1 observation while others have 12

* In Econometrics terms this is what we are worrying about:

* Y=Xb + e    e is a random variable
* b = B + u   u is a random variable

* Therefore:
* Y=X(B+u) + e
* Y=XB + Xu + e
* Y=XB + v ;  v=(Xu + e)

xtmixed y x || state: x, nocon

reg y x

* We can see that both xtmixed and normal OLS work pretty well.

* This is primarily because v is uncorrelated with x

* Image now that b is correlated.

gen b2=-1+rnormal()*2+x-10

sum b2
* b2 still has an average effect of -1

gen y2= b2*x + rnormal()*3

reg y2 x
* We can see that OLS is biased positively because:

corr b2 x

xtmixed y2 x || state: x, nocon

* xtmixed does seems to improve the outcome because xtmixed state: x  allows v to be correlation with x.

* However let's test this theory with a Monte Carlo estiamtion:

* First we need to define a program for stata to repeat
cap program drop xtmixed_monte_carlo
program define xtmixed_monte_carlo, rclass

  clear
  * remove the set see option
  set obs 50
  * Generate 50 states

  gen state=_n

  * Now let us speficy a random number of observations per state

  gen nobs = rpoisson(15)*4

  * Now lets specify a random coefficient per state for the policy change.

  gen b=-1+rnormal()

  * Now let's expand our sample
  expand nobs

  gen x=rnormal()*3 + 10

  gen y= b*x + rnormal()*3

  * Some states have 1 observation while others have 12

  * In Econometrics terms this is what we are worrying about:

  * Y=Xb + e    e is a random variable
  * b = B + u   u is a random variable

  * Therefore:
  * Y=X(B+u) + e
  * Y=XB + Xu + e
  * Y=XB + v ;  v=(Xu + e)

  xtmixed y x || state: x, nocon
  * Now we define a return command that will return the cofficients of interest.
    return scalar XTM1 = _b[x]

  reg y x
  * Now we define a return command that will return the cofficients of interest.
    return scalar OLS1 = _b[x]

  * We can see that both xtmixed and normal OLS work pretty well.

  * This is primarily because v is uncorrelated with x

  * Imagine now that b is correlated.

  gen b2=-1+rnormal()*2+(x-10)

  sum b2
  * b2 still has an average effect of -1

  gen y2= b2*x + rnormal()*3

  reg y2 x
  * We can see that OLS is biased positively because:
  * Now we define a return command that will return the cofficients of interest.
    return scalar OLS2 = _b[x]

  corr b2 x

xtmixed y x || x: R.state, nocon
  * Now we define a return command that will return the cofficients of interest.
    return scalar XTM2 = _b[x]

end

xtmixed_monte_carlo
* We can see the coefficient estimates of our first simulation run.

* This might take a little while.  This command will replate the entire simulation 50 times with the estimates as well.
simulate OLS1=r(OLS1) OLS2=r(OLS2) XTM1=r(XTM1) XTM2=r(XTM2), reps(50): xtmixed_monte_carlo

sum
* We can see that clearly OLS and Xtmixed are both biased because of the correlation between u and x.

* However, they appear to work quite well at identifying the average coefficient when the correlation does not work.

* What xtmixed helps do is to identify the variance of u as well.

simulate OLS1=r(OLS1) OLS2=r(OLS2) XTM1=r(XTM1) XTM2=r(XTM2), reps(500): xtmixed_monte_carlo

Tuesday, May 22, 2012

Tobit Normality Assumption Fail - Tobit Still Works

* This simulation looks at what happens when the underlying data generating process is not normal (key assumption with Tobit).



* This post is a follow up to the previous post on Bottom Coding and Tobit on May 21st.

set seed 11

  * Let's first set up the simulation
  clear

  * Set the number of observations
  set obs 1000

  * Set the random seed
 set seed 101

  * Generate some explanatory variables
  gen man_num_sibs = rpoisson(3)
    label var man_num_sibs "The number of sibblings that the man has"

  gen woman_num_sibs = rpoisson(3)
    label var woman_num_sibs "The number of sibblings that the spouse has"

  gen income = abs(rnormal())*2
    label var income "Family income, 10k/year"
   
  * Generate the number of children each man has with the error being
  * drawn from a poisson distribution which has either positive or negative
  * signs randomly
  gen e1 = rpoisson(3)*(-1)^rbinomial(1,.5)
  sum e1
  replace e1=e1/r(sd)*2

gen Y1 = .8*man_num_sibs + .6*woman_num_sibs - 2*income + e1
    label var Y1 "The true underlying amount of children some men would have"
 
  * Retrict the number of children to the positive range.
gen Nchildren1 = max(Y1,0)

tobit Nchildren1 man_num_sibs woman_num_sibs income, ll(0)
* Despite a very non-normal error the tobit estimator still works quite well

  * Generate the number of children each man has with the error being
  * drawn from a log normal distribution with random positive or negative signs
  gen e2 = exp(rnormal())*(-1)^rbinomial(1,.5)
  sum e2
  replace e2=e2/r(sd)*2
gen Y2 = .8*man_num_sibs + .6*woman_num_sibs - 2*income + e2
    label var Y2 "The true underlying amount of children some men would have"
 
  * Retrict the number of children to the positive range.
gen Nchildren2 = max(Y2,0)

  tobit Nchildren2 man_num_sibs woman_num_sibs income, ll(0)
* Despite a very non-normal error the tobit estimator still works quite well

  * Generate the number of children each man has with the error being
  * drawn from a double log normal distribution with random positive or negative signs
  gen e3 = exp(exp(rnormal()))*(-1)^rbinomial(1,.5)
    sum e3
  replace e3=e3/r(sd)*2
gen Y3 = .8*man_num_sibs + .6*woman_num_sibs - 2*income + e3
    label var Y3 "The true underlying amount of children some men would have"
 
  * Retrict the number of children to the positive range.
gen Nchildren3 = max(Y3,0)

  tobit Nchildren3 man_num_sibs woman_num_sibs income, ll(0)
* Despite a very non-normal error the tobit estimator still works pretty good.



* Compare with OLS.  Hard to tell which is preferred from this.  It would be useful to use
* a monte carlo simulation to discover if the tobit seems unbiased.  See the previous post
* on using simulations to understand bias.
reg Nchildren3 man_num_sibs woman_num_sibs income

  * Generate the number of children each man has with the error being
  * drawn from a chi-squared distribution with random positive or negative signs
  gen e4 = (-1)^rbinomial(1,.5)*(rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2+ ///
                                 rnormal()^2+rnormal()^2+rnormal()^2+rnormal()^2)
  label var e4 "Bimodal error - tobit still works"
 
    sum e4
  replace e4=e4/r(sd)*2
gen Y4 = .8*man_num_sibs + .6*woman_num_sibs - 2*income + e4
    label var Y3 "The true underlying amount of children some men would have"
 
  * Retrict the number of children to the positive range.
gen Nchildren4 = max(Y4,0)

  tobit Nchildren4 man_num_sibs woman_num_sibs income, ll(0)
* Despite a very non-normal error the tobit estimator still works quite well

sum e?

hist e4, kden

Monday, May 21, 2012

Dependent variable is bottom coded

* This simulation follows James' Tobits 1958 paper on:
* Estimation of Relationships for Limited Dependent Variables By JAMES TOBIN

* You want to estimate the Bs from E(Y)=XB but you can only observe W=W_min=Y if YW_man and W=Y else.

* This kind of data censorship is very common.  Many things are only observable in that they are never less than zero or greater than some value.

* For example, if you are trying to estimate the number of children a man is likely to have, there is no defined upper limit but there is clearly a well defined lower limit.
set seed 101

cap program drop gen_data
* Define a program to generate the data
program gen_data

  * Let's first set up the simulation
 clear

 version 11

  * Set the number of observations
  set obs 10000

  * Set the random seed

  * Generate some explanatory variables
  gen man_num_sibs = rpoisson(3)
    label var man_num_sibs "The number of sibblings that the man has"

  gen woman_num_sibs = rpoisson(3)
    label var woman_num_sibs "The number of sibblings that the spouse has"

  gen income = abs(rnormal())*2
    label var income "Family income, 10k/year"
   
  * Generate a random normal error
  gen e = rnormal()*2

  * Generate the number of children each man has
  gen Y = .8*man_num_sibs + .6*woman_num_sibs - 2*income + e
    label var Y "The true underlying amount of children some men would have"
 
  * Retrict the number of children to the positive range.
  gen Nchildren = max(Y,0)

  * Execute whatever command is specified with this program
  `0'
  * End the data generation
end

* Run the program once to generate the data
gen_data

reg Nchildren man_num_sibs woman_num_sibs income
* We can see that all of the coefficients are biased towards zero.

* This makes sense in that it if you restrict the range of the ys.
* Then the magnitudes of the coefficients is also appropriately restricted.

* However, we want to know what the uncensored effect of income is on the number of children is.

* Given the structure of the data we might restrict our sample say only to couples who both come from a family of 3 or more children.

gen restrict = 1 if man_num_sibs>=3 & woman_num_sibs>=3

reg Nchildren man_num_sibs woman_num_sibs income if restrict==1

* We can see that the estimates are now less biased if more noisy.  Let's try a further restriction:
gen restrict2 = 1 if man_num_sibs>3 & woman_num_sibs>3

reg Nchildren man_num_sibs woman_num_sibs income if restrict2==1
* The estimate is still pretty bad looking.

* We can see that the estimates are now less biased if more noisy.  Let's try a further restriction:
gen restrict3 = 1 if man_num_sibs>4 & woman_num_sibs>4

reg Nchildren man_num_sibs woman_num_sibs income if restrict3==1
* The estimate is looking better.

reg Nchildren man_num_sibs woman_num_sibs income if man_num_sibs>5 & woman_num_sibs>5 & income<1
* The estimate actually looks pretty good.  This does not necccessarily mean that the estimator is unbiased.
* Just that this draw of the estimator.

simulate, rep(50): gen_data reg Nchildren man_num_sibs woman_num_sibs income if man_num_sibs>5 & woman_num_sibs>5 & income<1
sum

* We can see that the estimates look pretty close to unbiased. 
* However, the noisiness of the estimates is so large due to the restricted sample size that it would be nearly impossible for any single point estimate draw to be large enough to reject the null.

* The more concerning thing with a restriction such as this is that it is not clear any more what is being measured. 
* Ie. what is the effect of income on children in a family where both parents come from families of 4 sibblings or more and where the income is less than 10,000 a year?
* Are these samples really comparable?

* A final restriction that might seem worth investigating is:
gen_data reg Nchildren man_num_sibs woman_num_sibs income if Nchildren>0

* This is really creating a sort of selection bias.  It does not help out the estimates.

* In effect all of these estimates are: w=XB + e + u(sample sensorship error)

* Because we are simulating data we can actually observe u (because we know Y).
gen u = Y - Nchildren

reg u man_num_sibs woman_num_sibs income
* We can see that all of the explanatory variables are strong predictors of the censorship error.

* What is a little tricky is why:
reg u man_num_sibs woman_num_sibs income if Nchildren>0
* does not mean that the restricted regression (if Nchildren>0) is not unbiased.

* Under this estimate the bias is from a different source.  This we can observe as well.

* It is due to there now being a correlation between the original errors (e) and the explanatory variables:

* We know that initially corr(X,e)=0
reg e man_num_sibs woman_num_sibs income

* However, what happens when we restrict the sample?
reg e man_num_sibs woman_num_sibs income if Nchildren>0

* Strange? No, not really.

* Just as the data is censored as a result of the choices of the X variables resulting in a sensorship error, so too if the data is restricted as a function of the explanatory variables then the unobserved error will restricted in a systematic fashion that is a function of the explanatory variables.

  label var e "e (No restrictions)"

gen e2=e if Nchildren>0
  label var e2 "e (# of children>0)"
 
two (line e income, sort ) (line e2 income, sort col(yellow)),  plotregion(fcolor(gs10)) ///
   title(Data restrictions can lead to unfortunate correlations)
  
* However, fortunately in this case James Tobit made some thoughtful adjustments to OLS:
* Assuming that the error is normally distributed:

* Remember E(Y) = XB

* He recognized that the probability of observing W=Nchildren=W_min is equal to P(Y+e* = P(e* => P(Y-W_min>e)=1-Unit_Normal_CFD{(Y-W_min)/sd(e)}

* In other words, the likelihood of observing an outcome equal to the
* minimum value (W_min) is equal to the probabily that the random
* error e will push the Y value below W_min.

* Likewise the probability of observing an outcome x above W_min
* but below the observed W is:

* = P(W>x>L|Y) = P(Y-e>x) = P(e
* Stata has a built in command called the tobit:
* You just need to set the lower limit:

tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)

simulate, reps(50) : gen_data tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)
sum
* Looks like some pretty unbiased estiamtes

gen_data

tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)

* Now what happens when we do not allow the dependent variable to be continuous
replace Nchildren=round(Nchildren)

tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)

simulate, reps(50) : gen_data tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)
sum

* We can see that making the number of children whole numbers does not bias the results.
* However, it does cause the estimators to be less precise (larger standard deviations).

* We can explain this easily:
* Y=XB + u
* Y_round = XB + u + rounding_error

* Since corr(rounding_error,X)=0, the only price of rounding the dependent variable is a larger error resulting in less precise estimators.


* Note, a poisson regression would typically be thought of as an appropriate regression given the count nature of the rounded data.

* This however, would not be appropriate given the underyling data generating process.
 poisson Nchildren man_num_sibs woman_num_sibs income

Friday, May 18, 2012

Unobserved effects model - extensions


* I often worry about linearity assumptions and their
* implications.  However, there are some serious strengths
* to the assumptions of linearity.

* For instance, what if the relationship between unobservable
* variable z and the observable variable education is non-linear
* and the relationship between z and earnings is either linear or
* non-linear, does including a fixed effect in the model
* still control for this potential source of bias?

* Stata code
clear

* Imagine we have 10,000 individuals that we track
set obs 10000
set seed 1021

gen intelligence = abs(rnormal())
  label var intelligence "Intelligence"

gen int_test = intelligence+rnormal()/3
  * Imagine we have data from some intelligence test available.
  * It is a proxy for intigence but does not truly measure
  * underlying intelligence.

sort int_test
gen int_tile = _n
  * Ranks the intelligence of subjects from
  * 1 to _N

gen int_per = int_tile/(_N+1)
  * changes the scale of intelligence from 1 to _N to
  * 0 to 1

* Now lets assume you fit a normal distribution
gen IQ_test = invnormal(int_per)
  label var IQ_test "Normal fitted values for intelligence approximation"
  hist IQ_test, bin(100) kden

gen GPA = round(4*(1-invibeta(2,5, 1-int_per)),.1)
  hist GPA, bin(20) kden
* Imagine that there has been grade inflation


gen id = _n
  label var id "Individual specific ID"

* create 5 observations for each initial observation
expand 5

bysort id: gen year=_n*5
  label var year "Year of observation"

tab year

gen education=0
bysort id: replace education = round(rbinomial(1,.3)+intelligence^.7+education[_n-1],.25) if _n>1
  label var education "Explanatory variable education (with time constant and time varying components)"
* Now education is non-linearly a function of intelligence

corr education intelligence
* So there is clearly a strong correlation between intelligence
* and years of education in the data.

gen intel_7= intelligence^.7

corr education intel_7
* has an even better fit.



gen u = 10*rnormal()+2*intelligence^2
  label var u "Error term (correlated with unobservables intelligence)"
  * Likewise the erorr is a different non-linear function of intelligence

corr u edu
  * Thus the error term and eduction are correlated.

gen earnings = education + u
  label var earnings "Earnings"


reg earnings education
* We can see that OLS is biased because of the correlation
* between the error term and the explanatory variable.

* Note: that though the earnings expression above did not specify
* an intercept, there exists one as a result of the error term
* having a non-zero expected value because intelligence has a
* non-zero expected value.

* Let us trying including intelligence
reg earnings education intelligence

* Let us trying including our intelligence test as a proxy for
* intelligence.
reg earnings education int_test

* We can see that the estimate is still biased if not as badly as
* that of not including any proxy.

* Imagine instead of having a corretly scaled but noisy
* intelligence test we have an IQ test that imposes its
* own scale on intelligence from a noisy measure.
reg earnings education IQ_test
* This creates a more biased measure.

* Now imagine you do not even have an intelligence test but
* instead GPA which imposes its own inflated scale as a measure
* of intelligence.
reg earnings education GPA
* This estimate is more biased.  Ultimately, this is because
* intelligence is more correlated with IQ than GPA.

corr intelligence IQ_test GPA

* Note, that even though GPA is a bad measure it is still a better
* proxy than intelligence.

* Now, the thing about this data is that you need not worry too
* much about the proxy's for intelligence or even intelligence
* for intelligence is assumed to be a fixed factor in your data
* and the effect of intelligence can therefore be controlled
* by a simple fixed effect model.

areg earnings education, absorb(id)
* This shows the immense power of having panel data.  No matter,
* the relationship between fixed unobserved variables, and the
* explanatory variables, and the error.  So long as the effect
* of intelligence on the returns to education is linear. A simple
* fixed effects model is sufficient to control for this bias.

* However, if the returns to education (the coefficient) is also
* correlated with intelligence (reasonable) then we are in a different
* much more complicated scenario and we had best start thinking
* about the world of Correlated Random Coefficients.

Thursday, May 17, 2012

Unobserved fixed effects model


* Often times we are concerned that there are some unobserved
* factors which are correlated with our explanatory variables x
* as well as with our error term u.

* For example, we might be concerned that intelligence is
* correlated with years of schooling as well as future
* expected earnings.  However, fortunately, intelligence
* is thought of as a time constant factor.

* Therefore, if we remove time constant factors we might
* be able to approximate the returns to education.
* (This is assuming the returns to years of education is
* constant.  If it is a function of intelligence then
* we are going to need to think about being more clever
* about this.)

* Stata code
clear

* Imagine we have 200 individuals that we track
set obs 200
set seed 101

gen c = rnormal()
  label var c "Time Constant Heterogeniety (individual specific)"

gen id = _n
  label var id "Individual specific ID"

* create 5 observations for each initial observation
expand 2

bysort id: gen year=_n
  label var year "Year of observation"

tab year

gen x = rnormal()+c
  label var x "explanatory variable X (with time constant and time varying components)"

gen u = 3*rnormal()+3*c
  label var u "Error term (correlated with unobservables c)"

gen y = x + u
  label var y "Outcome variable"

reg y x
* We can see that OLS is biased

xtset id year
* Tells stata to use id as a panel data individual identifier

xtreg y x, fe
* However, the fixed effect estimator is unbiased because it
* successfully eliminates the correlation between the time
* constant correlation between the x and the error u.

* Note: an identical command is:
reg y x i.id

* Or:
areg y x, absorb(id)

* An alternative approach is the Chamberlain Munlack device.
* If we fear that the constant part of x might be correlated
* with u then we can easily control for that by including
* it in the regression:
bysort id: egen x_mean = mean(x)

reg y x x_mean

* When there is only two time periods difference in difference
* the same but in time periods more than two it tends to be
* different.  Though it is also effective at removing time
* constant effects.
gen y_diff = y-l.y
gen x_diff = x-l.x

reg y_diff x_diff

Wednesday, May 16, 2012

Testing the Rank Condition of IV Estimators


* One of the key assumptions of the IV estimators
* is that the rank E(Z'X)=K=# of elements of B.

* It is difficult to test the rank condition in
* general.  However following Wooldridge 2010 it
* is easy to test the rank condition of instrumental
* variables when there is only on endogenous variable
* xk

* xk = d1 + d2x2 + ... + dk-1 xk-1 + theta1*z1 + r

* In order for the rank condition to hold E(theta)!=0

*******************************************************
* Let's see this test in action:

clear
set obs 1000
set seed 100

gen x1 = rnormal()
gen x2 = rnormal()
gen x3 = rnormal()

* generate the unobserved factor V which contributes
* to create a correlation between xk and u.
gen V = rnormal()

* generate the instrumental variable
gen z = rnormal()

* generate xk the suspected endogenous variable.
gen xk = rnormal() + V + z

gen u = rnormal()*2 - 2*V

gen y = 1*x1 + 2*x2 + 4*x3 + xk + u

* Simulation End

reg y x1 x2 x3 xk
* We can see that xk is biased

* In order to test the rank condition of the IV estimator
* we estimate

reg xk x1 x2 x3 z
* z appears to be strongly significant.  Therefore we
* conclude that the rank condition is sustained.
* Therefore we can safely use the IV estimator:

ivreg y x1 x2 x3 (xk=z)
* So the IV estimator seems to be working pretty well.

* When might we expect the rank condition to fail?

*******************************************************
* When the instrumental variable is weak (has small
* explanatory power relative to xk):

clear
* Set the number of observations to generate to 1000.
set obs 1000
set seed 101

* generate the instrumental variable
gen z = rnormal()

gen x1 = rnormal()
gen x2 = rnormal()
gen x3 = rnormal()

* generate the unobserved factor V which contributes
* to create a correlation between xk and u.
gen V = rnormal()

* generate xk the suspected endogenous variable.
gen xk = 2*rnormal() + V + z*.1

reg xk x1 x2 x3 z
* We cannot see a statistical difference between
* theta and 0 (though we know one exists).

* note, just because our Rank condition test fails
* does not mean that the underlying rank condition
* fails.  In this case E(Z'X)!=0 therefore the
* rank condition does not fail in the population
* In order to see this trying increasing the number
* of observations.

* Generate the endogeneity between xk and yb
gen u = rnormal()*2 - 2*V

gen y = 1*x1 + 2*x2 + 4*x3 + xk + u

ivreg y x1 x2 x3 (xk=z)
* The IV estimator is simply too weak to get
* good estimates.

reg y x*
* Once again the OLS regression works pretty well even
* when the IV estimator is not working well.

Tuesday, May 15, 2012

The Weak Instrument Problem

* Instrumental variables are extremely useful in
* economics in general because they provide a
* mechanism to get around the issue that some
* important predictive variables of interest
* in economics might also be correlated with
* other unobservable factors that might affect
* the outcome.

* A typical example is years of education and wages:
* wage = alpha + beta*years_of_education + u

* The problem is that the number of years of
* education might be related to motivation which
* might also be related to u.  So
* corr(years_of_ed, u) != 0 and the
* normal OLS estimators will be biased
* because:

* Y = XB + u
* X'Y = X'XB + X'u
* (X'X)^-1 X'Y = B + (X'X)^-1 X'u

* If we assume that X'u are uncorrelated and
* the Xs are fixed then an unbiased estimate
* of B is B_hat:

* E((X'X)^-1 X'Y) = B + E((X'X)^-1 X'u)
* E((X'X)^-1 X'Y) = B
* B_hat = (X'X)^-1 X'Y

* But if corr(X,u)!=0 then:
* B_hat = (X'X)^-1 X'Y
* B_hat = (X'X)^-1 X'(XB + u)
* B_hat = (X'X)^-1 X'XB + (X'X)^-1 X'u
* B_hat = B + (X'X)^-1 X'u
* E(B_hat) = B + E((X'X)^-1 X'u)

* to show unbiasedness: Assume fixed Xs
* E(B_hat) = B + (X'X)^-1 E(X'u)
* E(X'u)!=0 if corr(X,u)!=0
* E(B_hat) = B

* to show consistency:
* plim(B_hat) = B + (X'X)^-1 plim(X'u)
* plim(X'u)!=0 if corr(X,u)!=0
* plim(B_hat) = B

* So in order to get around this economists
* often use the instrumental variables estimator.

* Imagine that there is some instrument Z
* which is uncorrelated with u and
* is correlated with X. Z must also be
* uncorrelated with B in the standard case.

* Y = XB + u
* Z'Y = Z'XB + Z'u
* (Z'X)^-1 Z'Y = (Z'X)^-1 Z'XB + (Z'X)^-1 Z'u
* (Z'X)^-1 Z'Y = B + (Z'X)^-1 Z'u

* Since we have assumed corr(Z'u) = 0
* Z'u=0 at the population level.

* Thus we get the IV estimator:
* (Z'X)^-1 Z'Y = B_hat
* B_hat = (Z'X)^-1 Z'Y

* To show consistency:
* B_hat = (Z'X)^-1 Z'Y
* B_hat = (Z'X)^-1 Z'(XB + u)
* B_hat = (Z'X)^-1 Z'XB + (Z'X)^-1 Z'u
* B_hat =  B + (Z'X)^-1 Z'u

* plim(B_hat) =  B + plim((Z'X)^-1 Z'u)
* plim(B_hat) =  B + (Z'X)^-1 plim(Z'u)
* plim(B_hat) =  B

* Note: the IV estimator is not unbiased but
* it is consistent!

* However, if corr(Z,u) != 0 even if it is small
* and the instrument is "weak" that is
* |Z'X| is small then we have a weak instrument
* problem.  This is because (Z'X)^-1 is 1 over
* a small number which will means that
* even a slight correlation in corr(Z,u) can cause
* a large bias in the IV estimator.

* Let's see how that works!

clear
set obs 1000

* First the standard IV estimator
gen Z1 = rnormal()

* However, Z is not correlated with u yet.
gen u1 = rnormal()

* So Z is weak and remember u1 has to be
* correlated with X in order to justify
* use of the IV estimator:
gen X1 = rnormal()*10 + Z1 + u1*2

reg X1 Z1

gen Y1 = 3*X1 + 5*u1

ivreg Y1 (X1=Z1)
* We can see that Z is a pretty effective
* instrument at estimating B even if
* the correlation between Z and X is small.

******************************************
* Now let's see what happens when there
* is a small correlation between Z and u

* Imagine there is some additional
* explanatory variable V which is unobserved
* but partially explains the instrument
* as well.

gen V = rnormal()

* First the standard IV estimator
gen Z2 = rnormal() - V

* However, Z is somewhat correlated with u.
gen u2 = 10*rnormal() + V/4

* So Z is weak and remember u2 has to be
* correlated with X in order to justify
* use of the IV estimator:
gen X2 = rnormal()*10 + Z2/4 + .1*u2

reg X2 Z2

gen Y2 = 3*X2 + 5*u2

ivreg Y2 (X2=Z2)
* In this case we get a better estimate of
* B by using the known biased OLS estimator
* than by using the instrumental variables
* estimator.

reg Y2 X2

* Even though the correlation between the
* instrument and the error is small
pwcorr Z2 u2, sig

******************************************
* This is primarily a function of the weakness
* of Z at explaining X.  See what happens
* if Z has more explanatory power

gen X3 = rnormal()*10 + 10*Z2 + .1*u2

reg X3 Z2

gen Y3 = 3*X3 + 5*u2

ivreg Y3 (X3=Z2)
* Now even though Z is slightly correlated
* with the error u.

reg Y2 X2

* Even though the correlation between
* Z and u is the same as previously,
* the strength of the instruments
* in explaining X wins out and gives us a
* better estimator than OLS.

Sunday, May 13, 2012

Average Treatement Effects and Correlated Random Coefficients


* Average Treatement Effects and Correlated Random Coefficients

* Wooldridge, J. (2003). Further results on instrumental
* variables estimation of average treatment effects in
* the correlated random coefficient model. Economics
* Letters, 79(2), 185-191. doi:10.1016/S0165-1765
* (02)00318-X

* This simulation follows the paper by Wooldridge which
* shows that it is possible to consistently estimate
* the population average of the correlated random
* coefficient (CRC) model with multiple treatment
* variables under the standard assumptions under
* which instrumental variables (IV) estimators are
* consistent.

* In order to understand CRC think of the following
* model. Equation (1):

* E(y|a,b,w)=a+wb=a + b1*w1 + b2*w2 ... + bg*wg     (1)

* b can depend on unobserved heterogeniety as well
* as w.  Writing the equation in error form (2):

* y = a + wb + e, E(e|a,b,w)=0

* b can vary for every individual observation i.

* Therefore we are not trying to estimate each individual
* b which is impossible, but instead we are attempting
* to estimate the E(b) or the average treatment effect (ATE).

* This in general is difficult when bj is correlated with
* unobserved heterogeniety of individual j.

* In order to tackle this problem Wooldridge introduces
* a instrumental variable which is redundant or ignorable
* in covariates x and instruments z ie.

* E(y|a,b,w,x,z)=E(y|a,b,w)                          (6)

* The next assumption is what seperates xs from zs.

* E(a|x,z)=E(a|x)=gamma0 + x*gamma                   (7)

* This says that the mean a can depend on the explanatory
* variables x but not on the instrumental variable z.

* E(bj|x,z)=E(b|x)=beta0 + (x-E(x))*deltaj

* This assumption says that the average coefficient can
* depend on the explanatory variables x but not on the
* instrumental variable z.

* We can write a = gamma0 + x*gamma + c , E(c|x,z)=0 (8)
* and b=beta0 + (x-E(x))*deltaj + vj, E(vj|x,z)=0    (9)

* Ultimately by substituting this back into (1) we get:

* y=gamma0 + xgamma + wbeta + w1(x-E(x))*delta1 +....
*  wG(x-E(x))*deltaG + c + wv + e                    (10)

* The composite error term is c + wv + e.  Under the
* assumptions thus far E(c|x,z)=E(e|x,z)=0.  However
* E(wv|x,z)!=0 because b is generally not a
* deterministic linear function of x.

* Let us simulate up to this point imagining that we do
* have our bs as deterministic linear functions of xs.

clear
set obs 10000

gen x1 = rnormal()
gen x2 = rnormal()

* We will force w to be uncorrelated with x.
gen w1 = rnormal()
gen w2 = rnormal()

* Each idividual has his/her own intercept or starting
* point c
gen c=rnormal()
gen a= -5 + 3*x1 - 2*x2 + c

* Let us first imagine b being a deterministic function
* of observables x.  Each idividual has a unique response
* w.
gen b1 =  1 + .5*x1 + -2*x2
gen b2 = -2 + 1.75*x1 + 3*x2

* Imagine w1 as being years of education, w2 as being
* offered a job right out of college and x1 as intelligence
* and x2 as GPA.  The interesting thing is having a high
* GPA might be correlated with years of experience
* but it also might help explain the effect years of education
* has on y (future income).

* Let us generate our reduced form error
gen e=rnormal()*10

* Now let's generate our y variables
gen y = a + b1*w1 + b2*w2+e

* First let us generate our w1(x-E(x)) variables:
* The average of observed x is of course not the E(x) but it is
* a consistent estimator of E(x).

sum x1
gen w1_x1 = w1*(x1-r(mean))
gen w2_x1 = w2*(x1-r(mean))

sum x2
gen w1_x2 = w1*(x2-r(mean))
gen w2_x2 = w2*(x2-r(mean))

* Now we can estimate all of our coefficients directly using
* equation 10.

* y=gamma0 + x gamma + w beta + w1(x-E(x))*delta1 +....
*  wG(x-E(x))*deltaG + c + wv + e                    (10)
reg y x1 x2 w1 w2 w1_x1 w1_x2 w2_x1 w2_x2

* One can see that in the case where b is completely
* linearly dependant on x the above CRC estimator works fine.

* Stay tuned for what happens when there is some error in b!

Saturday, May 12, 2012

Average Partial Effects




* Average Partial Effects (APEs)

* Stata Simulation to generate a binary response variables
* We want to estimate the average partial effect.
* Of the explanatory variable.  This often a more useful
* estimate than the coefficients from the probit.

* Set up the simulation
set seed 101
clear
set obs 10000

* Generate two explanatory variables
gen subsidy = rbinomial(1,.5)
  label define subsidy 0 "No Subsidy" 1 "Subsidy"
  label values subsidy subsidy
gen road_access = rbinomial(1,.3)
  label define road_access 0 "No Road Access" 1 "Road Access"
  label values road_access road_access

* Generate the error terms
gen u=rnormal()*5

* Generate the linear form that will be normally transformed
gen XB_u= 2*subsidy + road_access + u

sum XB_u

gen XB_u_norm = (XB_u-r(mean))/r(sd)
gen double Py=normal(XB_u_norm)
label var Py "True probability of market sale"
gen y=rbinomial(1,Py)

* We have to first calculate the true average partial effect.
* Since it is not possible to specify it ex ante directly through
* the simulation proceedure.

* Generate y with subsidy = 0
gen XB_X10_u= 2*0 + road_access + u

* Calculate the probability of 1 with subsidy == 0
sum XB_u
gen XB_subsidy0_u_norm = (XB_X10_u-r(mean))/r(sd)

* Calculate the average partial effect of subsidy
gen double Py_subsidy0=normal(XB_subsidy0_u_norm)

* Generate partial effect.
gen double pe_subsidy0=(Py-Py_subsidy0)/subsidy

* Generate y with road_access size = 0
gen XB_RA0_u= 2*subsidy +1*0 + u

sum XB_u
gen XB_RA0_u_norm = (XB_RA0_u-r(mean))/r(sd)

* Calculate the average partial effect of subsidy
gen double Py_RA0=normal(XB_RA0_u_norm)

* Calculate the average partial effect
* road_access size.
gen double pe_RA=(Py-Py_RA0)/road_access

sum pe*

************************************************************************
* Simulation END

probit y subsidy road_access

margins, dydx(*)
* This command finds the average partial effect of the explanatory variable on the
* the probability observing a 1 in the dependent variable.  This is basically taking
* the partial effects estimated by the probit for each observation then taking the
* average across all observations.

* This is extremely useful because the direct results of the probit estimations
* can not be directly interpreted as partial effects without a transformation.

reg y subsidy road_access
* Interestingly, the OLS model ignoring the binary nature of the response variable
* yields results that are very similar to the post estimation partial effects estimates.

predict u_hat, residual
graph box u_hat, over(subsidy) over(road_access) ///
  title(Residuals by Road access and Subsidy) legend(on)
* We can see that the errors are heteroskedastic in the explanatory variables.

* We we use heteroskedastically robust standard errors since by necessity
* the errors are heteroskedastic.
reg y subsidy road_access, robust

sum pe*
* We can see the results of the previous estimations are pretty good at estimating
* the true partial effects.

Friday, May 11, 2012

Biased vs. Consistency: What do simulations tell us?


                Simulations do not have much to say about consistency.  However, they can tell us a great deal about how biased an estimator is under various scenarios.  In general simulations can examine what happens in gritty scenarios where normal assumptions are violated.  For instance, what happens if the error is slightly correlated with the explanatory variable. 
                They can also serve to show how fast an estimator converges on the true value.  In general we often only know that an estimator is “consistent” or “relatively efficient”.  What is not clear is, how badly does the estimator perform when we have limited data.  Some estimators seem to work well with 100 observations, others with 10,000.
                In order to test the unbiasedness of estimators we do a Monte Carlos type analysis.  This analysis allows for the data to be repeatedly drawn many times and the estimator to be used with each draw.  If the average of those estimated draws are converging on the true parameter, as the number of simulation increases, then the estimate is unbiased in the conditions created by the simulation.
                More difficult to show, consistency can also be approached by simulations.  This is generally what happens when you change the number of observations of a simulation from 100, to 1,000, to 10,000.  We usually see that the estimate gets better as the simulated sample size also gets larger.  However, failure to approach the true parameter value within any level of data from a difference of 100 to 100 million observations does not indicate a failure of an estimate to converge.  Convergence is a technical term with a technical definition that allows for any speed of convergence so long as convergence happens as the number of observations approaches infinity.
                Thus, simulations of estimators are primarily useful for testing for the biasedness of estimators under different scenarios.  In addition to that they can also tell us some things about the consistency of estimators as well.

Thursday, May 10, 2012

Reverse Engineering a Probit





* Stata simulation of a Probit

* Image that you are interested in estimating a binary response variable.
* For instance the likelihood of a farmer choosing to go to market with
* some portion of his or her yield rather than consume at the farmers
* personal plot.

* This is a binary response variable (i.e. one taking on a value of 0 or 1).

* There are a number of functional forms that help transform a linear model
* with no true minimum or maximum into a binary model for which the probability
* of an outcome approaches either 0 or 1.

* The most common choices are the probit and logit models though other options
* present themselves as well.

* From a maximum likelihood perspective think of ordinary least squares as
* the model that maximizes the probity of observing an outcome when the error
* is distributed with the probability density function of the normal distribution.

* max_choosing{beta} ln(pdf(u)) where u=Y-XB

* For a probit, what you are instead doing is maximizing the likelihood of
* observing a cumulative density function which takes 'yhat'.  That is:

* max_choosing{beta} ln(CFD(XB)) where Yhat=XB

* So this guides us on how to simulate a probit model accurately.

* Set the random seed
set seed 101
* Clear the memory
clear
* Tell stata to prepare 1000 rows for data entry
set obs 10000

* Generate yield as a chi-squared distribution
gen yield = rnormal()^2

* Generate number of family members. Poisson has the same mean and variance.
* Add +1 to ensure that the minimum household size is 1.
gen household_size = rpoisson(5)+1

* Generate the error
gen u=rnormal()*5

* First we generate the probability of selling at the market.
* In order to do that we need to do find the CDF of the XB+u
* values.  Stata first needs to know the variance and mean of
* XB+u.

**** Note the coefficient on yeild and household size
gen XB_u= 10*yeild - household_size + u

sum XB_u

* Now we normalize.
gen XB_u_norm = (XB_u-r(mean))/r(sd)

* Then we calculate the probabilities using normal CDF.
gen double Py=normal(XB_u_norm)
label var Py "True probability of market sale"

* Finally we need to change this from a probability vector to a response
* vector.
gen y=rbinomial(1,Py)

************************************************************************
* Simulation END

* Now lets see how well the probit does at recovering our initial estimates:

probit y yeild household_size

predict yhat
label var yhat "Estimated probability of market sale"

* Now, wait a second.  These values look way too small!
* But not so.  Take a look at the normalization we did above.

sum XB_u
di "This normalization shrunk the coefficients by a factor of " r(sd)

* So to recover the estimates multiply by r(sd)

di "The estimate for yield is " r(sd)*_b[yeild]
di "The estimate for household size is "  r(sd)*_b[household_size]

* Not exactly on but pretty close.

two (line yhat Py, sort) if _n/10==int(_n/10), title(Probit Probability Fit compared with True)
* The if statement thins out the distribution, telling Stata to show only
* 1 out of 10 observations.

* Later we can compare how well the probit performs relative to a logit and a
* linear heteroskedastically robust OLS model.



Wednesday, May 9, 2012

Oh wait! Quantile regression wins!



* Stata simulation to estimate performance of quantile regression at
* different quantiles compared with other alternatives.

clear
set obs 10000
set seed 101

gen x1 = runiform()
* Single exogenous x

gen u = rnormal()
* Single uncorrelated error u

gen beta=1
  label var beta "True Beta"
* At first beta is = 1

gen y0 = u

gen y = beta*x1 + y0
* Generate a starting point for y

forv i=1/20 {
  cap drop ytile
  * Removes old ytile from the data
  xtile ytile=y, nq(100)
  * Generates a variable that orders all of the
  * y variables into 100 quantiles
  qui replace beta = 8-((200+ytile)*.21)^.5
  * beta goes from  4-31/20)^.5 to 4-(133/20)^.5 which is
  * a range of 1.25 to 2.6 ish.
  replace y = beta*x1 + u + 190

  di "`i'"
}
* This repetitive routine makes it so that beta is smoothly related to y

  replace y = y-190
* Make y closer a small range so that the beta hat seems more effective

line beta y, sort
* We can see the beta function is changing effect accross the different
* quantiles of y.  We can see that if x1 is a policy variable then because
* x1 is greater for lower ys and smaller for higher ys.  Then the effect of
* sign of the betas is to keep the variance in y constant or even decrease
* it.  It is possible for the variance in y to still be larger if the
* variances of the xs are larger since in effect we are adding two random
* variables together.

sum y y0
gen beta_hat=.
  label var beta_hat "Estimated Beta from qreg"
gen beta_hat1=.
  label var beta_hat1 "Estimated Beta from reg at quantile"
gen beta_hat2=.
  label var beta_hat2 "Estimated Beta from reg at quantile with overlap"
gen beta_hat3=.
  label var beta_hat3 "Estimated Beta from qreg at quantile with overlap"

forv i =1/100 {
  cap qreg y x1, quantile(`=`i'/100')
  * This tells stata to do a quantile regression at every point
  * the 'cap' tells it not to stop if there is an error.

  if _rc==0 qui replace beta_hat=_b[x1] if ytile==`i'
  * _rc==0 then it means there was no error in the previous quantile regression

  * This does a regression for each of the 100 quantiles
  cap reg y x1 if ytile==`i'
  * This tells stata to do a regression only in the specific quantile
  * the 'cap' tells it not to stop if there is an error.

  if _rc==0 qui replace beta_hat1=_b[x1] if ytile==`i'
  * _rc==0 then it means there was no error in the previous regression


  * This does a regression for each of the 100 quantiles but allows for
  * overlap.  We expect that the estimated betas will not be independent
  * because there is overlap in the sample.
  cap reg y x1 if ytile<= `i'+15 & ytile>= `i'-15
  * This tells stata to do a regression only in the specific quantile
  * the 'cap' tells it not to stop if there is an error.

  if _rc==0 qui replace beta_hat2=_b[x1] if ytile==`i'
  * _rc==0 then it means there was no error in the previous regression

  cap qreg y x1 if ytile<= `i'+15 & ytile>= `i'-15
  * This tells stata to do a regression only in the specific quantile
  * the 'cap' tells it not to stop if there is an error.

  if _rc==0 qui replace beta_hat3=_b[x1] if ytile==`i'
  * _rc==0 then it means there was no error in the previous regression

  di _continue " `i'"
}

two (line beta y, sort) (line beta_hat y, sort) (line beta_hat1 y, sort) (line beta_hat2 y, sort), title(Results of Qreg)

two (connected beta ytile, sort) (line beta_hat ytile, sort) (line beta_hat1 ytile, sort) ///
 (line beta_hat2 ytile, sort) (line beta_hat3 ytile, sort), title(Qreg against semi-parametric alternatives) legend(on size(vsmall))

* Compared to other estimators, quantile regreesion is working awesome!

Tuesday, May 8, 2012

A note on generating random numbers


The ‘set seed’ command tells Stata where to begin drawing random numbers from. In order to understand this you must keep in mind that Stata cannot generate truly random numbers but can only draw numbers that look random. (This is the limitation of binary computers.  If Stata were running on a quantum computer then truly random numbers could be possible.)

There is a benefit in this however.  Numbers that look random but are not but are sufficiently close to being random to give us the ability to make inference about how a random number would work.  The true benefit of not dealing with random number is that these numbers can be replicated.
  
This is extremely convenient in that it allows two researchers in two different parts of the world to run the same simulation and get the exact same results.  In general the exactness of the results is trivial.  However, it is extremely useful to know that you can change parameters this way or that way, but that the underlying 'random' draw remains the same.  Likewise, if a researcher in a different part of the world were to run the same simulation with different parameters, it is likely that the differences between the two simulations is caused by the change in parameters, not the change in the random draws.

Quantile Regression Fail


* Stata simulation to estimate performance of quantile regression at
* different quantiles

clear
set obs 10000
set seed 101

gen x1 = runiform()
* Single exogenous x

gen u = rnormal()
* Single uncorrelated error u

gen beta=1
  label var beta "True Beta"
* At first beta is = 1

gen y0 = u

gen y = beta*x1 + y0
* Generate a starting point for y

forv i=1/20 {
  cap drop ytile
  * Removes old ytile from the data
  xtile ytile=y, nq(100)
  * Generates a variable that orders all of the
  * y variables into 100 quantiles
  qui replace beta = 4-((ytile+30)/20)^.5
  * beta goes from  4-31/20)^.5 to 4-(133/20)^.5 which is
  * a range of 1.25 to 2.6 ish.
  replace y = beta*x1 + u + 190

  di "`i'"
}
* This repetitive routine makes it so that beta is smoothly related to y

  replace y = y-190
* Make y closer a small range so that the beta hat seems more effective

line beta y, sort
* We can see the beta function is changing effect accross the different
* quantiles of y.  We can see that if x1 is a policy variable then because
* x1 is greater for lower ys and smaller for higher ys.  Then the effect of
* sign of the betas is to keep the variance in y constant or even decrease
* it.  It is possible for the variance in y to still be larger if the
* variances of the xs are larger since in effect we are adding two random
* variables together.

sum y y0

gen beta_hat=.
  label var beta_hat "Estimated Beta"
forv i =1/100 {
  cap qreg y x1, quantile(`=`i'/100')
  * This tells stata to do a quantile regression at every point
  * the 'cap' tells it not to stop if there is an error.
 
  if _rc==0 qui replace beta_hat=_b[x1] if ytile==`i'
  * _rc==0 then it means there was no error in the previous quantile regression
 
  di _continue " `i'"
}

two (line beta y, sort) (line beta_hat y, sort), title(Results of Qreg)

two (line beta ytile, sort) (line beta_hat ytile, sort), title(Results of Qreg)

* The quantile regression seems to be working mediocre
* at picking up some of the shape of the beta distribution.
* However, near the edges which people are often interested in
* it is failing.

* One might say that this is due to the tails having too few observations.
* However, increasing the sample size to 100,000 gives almost identical results!

* Ah, but hope is not lost.  Stay tuned for future releases of semi-parametric
* estimators!

Monday, May 7, 2012

Estimating Supply

* Stata Simulation
* Simulate Commodity Supply
* Classical Microeconomics

* Let's think of the typical firm in classical microeconomics

* Max{x} {Profit funciton =p*f(x)-w*x)}

* In order for there to be an interior solution: f'(x)>0, f''(x)<0

* There is an infinite number of functions that have this property.

* Let's choose a typical one.

* 0
* f(x) = x^(alpha)
* so f'(x) = alpha*x^(alpha-1)
* and Profit funciton' =p*alpha*x^(alpha-1) -w=0
* x^(alpha-1)=w/(p*alpha)
* x=(w/(p*alpha))^(1/(alpha-1))

* Let's imagine a 1000 cities

* Each has a different market price and different input price

clear
set obs 1000
set seed 101

* Declare the parameter of interest
global alpha=.8

gen p = (runiform()+2.3)+1
 * hist p
gen w = rnormal()^2+1
 * hist w
 
* Here are two different right skewed distributions of prices
* Firm produced based on:
gen x=(w/(p*$alpha))^(1/($alpha-1))

sum x

* Generate the output as a result of the inputs
* Plus some stochastic production shocks
gen y=x^$alpha*(runiform()+.5)
* A more challenging fit would be:
* gen y=x^$alpha*(runiform()^3*4)

sum y, detail

* Most firms produce between 2.5 and 52 units

* Firms make profit:
gen profit = p*y - w*x

sum profit

* First thing, can we recover the marginal product of x
* ln(y) = alpha*ln(x)+ln(runiform()+.5)
* E(ln(y)) = alpha*E(ln(x)) + E(ln(runiform()+.5))
* E(ln(runiform()+.5))<0 since E(runiform()+.5)=1 and ln(1)=0
* by Jensen's inequality E(ln(runiform()+.5))<0
* However, that should not throw off our estimates of alpha

gen lny=ln(y)
gen lnx=ln(x)

reg lny lnx
di _b[lnx] "is pretty close to $alpha"


* We could also try to estimate this by non-linear least squares
nl (y=x^{alpha})

* This seems to do a pretty good job as well.

* We might however be interested in seeing how supply responds to changes
* in prices.

* let's try a simple OLS
reg y p w

predict y_hat
  label var y_hat
line y_hat y, sort
* does not quite look right


* This seems to be a pretty good fit.  Let's try a different approach
gen lnp = ln(p)
gen lnw = ln(w)

reg lny lnp lnw
  predict lny_hat

* I want to graph both lny_hat and y_hat on the same graph
* But I will need to unitize everything (distributions to
* start at 0 and end at 1

foreach v in lny_hat lny y_hat y {
  sum `v'
  gen unit_`v'=(`v'-r(min))/(r(max)-r(min))
}

label var unit_lny_hat "ln_y_hat unitized"
label var unit_y_hat "y_hat unitized"

two (connected unit_lny_hat unit_lny, sort msize(vsmall) color(black)) ///
    (connected unit_y_hat unit_y, sort  msize(vsmall) color(green))  , ///
    ytitle("Unitized Y or lnY")
   
* Elasticities however, work even better.  Why is that? 
* look first how x is generated
* x=(w/(p*$alpha))^(1/($alpha-1))
* now y=x^$alpha*(runiform()+.5)
* y=(w/(p*$alpha))^(1/($alpha-1))^$alpha*(runiform()+.5)
* y=(w/(p*$alpha))^(1/($alpha-1))*$alpha*(runiform()+.5)
* lny=(1/($alpha-1))*$alpha*(runiform()+.5)ln(w/(p*$alpha))
* lny=(1/($alpha-1))*$alpha*(runiform()+.5){ln(w) - ln(p) - ln($alpha)}
* plim(lny)=(1/($alpha-1))*$alpha*plim((runiform()+.5))plim({ln(w) - ln(p) - ln($alpha)})
* plim(lny)=(1/($alpha-1))*$alpha*plim({ln(w) - ln(p) - ln($alpha)})
* plim(lny)=(1/($alpha-1))*$alpha*[E{ln(w) - E(ln(p)) - ln($alpha)]

di "Therefore (1/($alpha-1))*$alpha = " (1/($alpha-1))*$alpha " ~ " -_b[lnp] " ~ " _b[lnw]
di "And (1/($alpha-1))*$alpha)*-ln($alpha) = " (1/($alpha-1)*$alpha)*-ln($alpha) " ~ "  _b[_cons] " ?"

Sunday, May 6, 2012

Estimating demand (AIDS) - Part 3

* Estimating demand (AIDS) - Part
* Stata simulation and estimation

* v=max{x,y}{x^a * y^b * z^c} s.t. (W=xPx + yPy + zPz)
* since ln is monotonic and increasing this is the same as:
* max{x,y}{a*ln(x)+b*ln(y)+c*ln(z)} s.t. (W=xPx + yPy + zPz)

* L=a*ln(x)+b*ln(y)+c*ln(y) M(W>=xPx + yPy + zPz)
* Lx:=   a/x + M*Px=0
* Ly:=   b/y + M*Py=0
* Lz:=   c/z + M*Pz=0
* LM:=  W-(xPx + yPy + zPz)=0

* a/xPx             = -M
* b/yPy             = -M
* c/zPz             = -M
* a/xPx             = b/yPy
* xPx/a             = yPy/b
* xPx               = ayPy/b
* zPz               = cyPy/b
* W-(ayPy/b + yPy + cyPy/b) = 0
* W-(a/b + 1 + c/b)yPy = 0
* W-(a/b + b/b + c/b)yPy = 0
* W = ((a+b+c)/b)yPy
* y = W*b/(Py*(a+b+c))
* x = W*a/(Px*(a+b+c))
* z = W*c/(Pz*(a+b+c))

* y                 = W/(Py(a/b + 1))
* because we know the solutions are symetic
* x                 = W/(Px(b/a + 1))

clear

* so Let's start the simulation
* Say we have a cross section with 1000 cities
set seed 101
set obs 10000

* first let us set the parameters a and b, these are not observable
gen a = .3
gen b = .4
gen c = .05


* each city faces a different price.  the .5 is so that no city
* gets a price too close to zero.
gen Px=runiform()*4.2+3
gen Py=runiform()*4.5+1
gen Pz=runiform()*4.2+2

* each city has a different level of wealth
gen W=runiform()*5+1

* Lets generate the errors in X.  We want the errors to be correlated
* which can be gotten at by a little trick but I also wanted the errors
* to be inform.  Unfortunately adding together 2 uniform distributions
* will make another uniform distribution
  gen rv1=runiform()
  gen rv2=runiform()
  gen rv3=runiform()
 
* The correlation between the variables is
  gl rho12 = .5
 
   
  gen v1=(rv1)*5.3
  gen v2=(rv1*$rho12 + rv2*(1-$rho12)^.5)*.3
  gen v3=(rv1*$rho12 + rv3*(1-$rho12)^.5)*.3

  corr v1 v2
  * Not quite perfect but a good approximation

* Now let's generate consumption level
gen y = W/Py *b/(a+b+c)   + v1
  label var y "Food"
gen x = W/Px *a/(a+b+c)   + v2
  label var x "Entertainment"
gen z = W/Pz *c/(a+b+c)   + v3
  label var z "Unobserved Remitances"

* Adjust wealth to reflect true wealth 
replace W = y*Py + x*Px + z*Pz

sum x y

************************************************************
* simulation END
************************************************************

* One of the difficulties (and strengths) in estimating from simulations is that
* you don't always know when an estimator is performing correctly.

* For instance if we want to estimate demand response to change in price
* dy/dPy then due to the non-linear nature of real demand functions
* the most straightforward regressions are going to give us some funky estimates.

* For example:
reg y W Py
local dydW=_b[W]
local dydPy=_b[Py]
predict y_hat
  label var y_hat "OLS fitted values"

* Everything looks good but what does it mean?
* The true effect of dy/dW is
* dy/dW=1/Py *b/(a+b+c)
sum Py

gen b_abc=b/((a+b+c)*r(mean))
sum b_abc

di "So and estimate of `dydW' is looking pretty close to " r(mean)

* This is expected because y is linear in W.

* {remember y = W*b/(Py*(a+b+c)) = (Py^-1) * W * b/(a+b+c) }
* However, check out what happens when you take dy/dPy
* dy/dPy = (-1)*(Py^-2) * W * b/(a+b+c)

gen dydPy_true = (-1)*(Py^-2) * W * b/(a+b+c)
sum dydPy_true
di "The estimate of `dydPy' is looking pretty close to " r(mean) " as well"

* We can see that the OLS estimate is biased upwards because E(d g(y)/dPy)) > g(E(dy/dPy))
* when g is convex (Jensen's inequality).  However, the degree of the bias is not
* so large as to make us worry.

* An alternative approach which is rather appealing is taking the log of everything
* ln(y) = By1 ln(W) - By2 ln(Py) + ln(b/(a+b+c))
* ln(x) = Bx1 ln(W) - Bx2 ln(Px) + ln(a/(a+b+c))

gen lny=ln(y)
gen lnx=ln(x)
gen lnW=ln(W)
gen lnPy=ln(Py)
gen lnPx=ln(Px)

* This really not so useful with the Cobb-Douglas.  However, it does give us a method of checking
* if we think our demand function is Cobb-Douglas:

reg lny lnW lnPy
* We might attempt to use the constants to get a grasp of the consumer parameters
* ln(b/(a+b+c)) = `b_abc', b/(a+b+c) = exp(`b_abc')
 di exp(_b[_cons])
 sum b_abc
predict lny_hat

gen y_ln_exp_hat = exp(lny_hat)
  label var y_ln_exp_hat "OLS fitted values form Log regression"

two (line y Py, sort) (line y_hat Py, sort color(blue) ) (line y_ln_exp_hat Py, sort color(red))

reg lnx lnW lnPx
local a_abc=_b[_cons]
predict lnx_hat
gen x_ln_exp_hat = exp(lnx_hat)

* If there was no bias in the estimates then we expect By1 = By2 = Bx1 = Bx2 =1
* But the ln function is concave therefore our estimates will be downward biased by
* Jensen's inequality.  In this case all of the estimates seem to be biased by at least 50%

*******************************************************************************************
* So in practice what do people do?

* A common model is the AIDS (Almost Ideal Demand System) to estimate share of demand as
* a function of price and expenditure
* Lets see how it performs at estimating the expenditure function (assume we know z and Pz)

* First LAI
* generate share values
gen sy = Py*y/W
  label var sy "Share of expenditures on y"
gen sx = Px*x/W
gen sz = Pz*z/W

sum s*

gen lnPz = ln(Pz)

* Then generate lnP, the price index (P)
gen lnP = sx*lnPx + sy*lnPy + sz*lnPz

* Not let's generate the Beta term
gen lnW_P = ln(W)-lnP

* So to estimate AIDS, which is share as a function of price
reg sx lnPx lnPy lnPz lnW_P
  predict sx_hat
 
reg sy lnPx lnPy lnPz lnW_P
  predict sy_hat
  label var sy_hat "y hat from AIDS"
 
reg sz lnPx lnPy lnPz lnW_P
  predict sz_hat
 
two (line sy Py, sort) (line sy_hat Py, sort)

* It seems the AIDS model fits pretty well.