Econometrics By Simulation: April 2012

Monday, April 30, 2012

Production Possibility Frontier: Simulation & Estimation

* Production Possibility Frontier

* Stata simulation and estimation 

* This simulatoin estimates a production possibility frontier model 

* while simultaneously estimating the effects of market power

* on productivity.  It loosely models the paper by Saleem Shaik, 

* Albert Allen, Seanicaa Edwards, and James Harris 2009

* "Market Structure Conduct Performance Hypothesis Revisited Using 

* Stochastic Frontier Efficiency Analysis"

* v represents firm or time specific random errors

* u represents technical inefficiencies

* Performance(u)=f(X1)

* Output(Y2)=f(X2;B)+v-u

* We want to ask if performance u is explained by: market share and 

* market concentration

* u = intercept + B1 mshare + B2 Mconc + e

* Y = intercept + A1 labor + A2 capital + v - u

* They model different firms in different industries which they 

* hope have the same production structure but different market shares 

* and market concentration.

* Set the random seed

set seed 101

* I simulate 10 firms in 7 different industries

clear

set obs 10

forv i=1/7 {

  * Generate labels for each firm in each industry _# refers to the different

  * industries

  * Generate market shares using a beta distribution.  The mean is about .1

  * but it ranges from 0 to 1

  gen share_`i' = rbeta(2,5)

  * Create a measure of total market share

  egen total_share_`i' = sum(share_`i') 

  * Standardize the shares so that they add up to 1

  replace share_`i' = share_`i'/total_share_`i' 

  * Drop total share

  drop total_share_*

  * Order firms from largest to smallest

  gsort -share_`i'

  * Label the firms according to size

  gen firm_`i'=_n

  * Create concentration index CRM

  gen CRM_`i' = sum(share_`i')

  * But we will only use CR4

  gen CR4_temp = CRM_`i' if firm_`i'==4

  egen CR4_`i' = mean(CR4_temp)

  drop CR4_temp

}

sum 

* Now we want to create 1000 orders per firm

expand 1000

forv i=1/7 {

  * Let's create vectors of input per observation

  gen labor_`i'=runiform()*10

  gen capital_`i'=runiform()*10

  * Larger firms (share) tend to get larger contracts 

  qui replace labor_`i'=labor_`i'+6*share_`i'

  qui replace capital_`i'=capital_`i'+6*share_`i'

  * u is the half normal function plus some extra terms

  * that Shaik uses.  The argument is that larger firms (large share)

  * are large because they are efficient but that high

  * market concentration leads to greater inefficiency.

  gen u_`i' = 1+abs(rnormal()*5) - 2*share_`i' + CR4_`i'

  * v is normally distributed

  gen v_`i' = rnormal()*20

  * Now we are ready to formulate our production model

  gen y_`i' = 5*labor_`i' + 5*capital_`i' + 10*((labor_`i')*(capital_`i'))^.75 - u_`i' + v_`i' 

}

* Unfortunately our data is wide now when we want to use the xtfrontier command.

* So we will transform from wide to tall with the reshape command

* This is neccessary

gen obs_num = _n

reshape long share_ firm_ CRM_ CR4_ labor_ capital_ u_  v_ y_, i(obs_num) j(industry)

sort industry obs_num

* I will map out the production possibility frontier

xtile x_y= y, nquantiles(50)

xtile x_labor=labor, nquantiles(50)

xtile x_capital=capital, nquantiles(50)

foreach i in 5 10 15 20 25 30 35 40 45 {

  bysort x_labor: egen min_labor_`i'=mean(labor) if x_y==`i'

  bysort x_labor: egen min_capital_`i'=mean(capital) if x_y==`i'

  gen min_capital_`i'2=min_capital_`i'^2

  gen min_capital_`i'3=min_capital_`i'^3

  gen min_capital_`i'4=min_capital_`i'^4

  gen min_capital_`i'5=min_capital_`i'^5

  qui reg min_labor_`i' min_capital_`i' min_capital_`i'2 ///

        min_capital_`i'3 min_capital_`i'4 min_capital_`i'5

  predict min_labor_fit_`i'

  label var min_labor_fit_`i' "`=`i'*2'% percentile"

}

* Creates a noisy graph of production possibilities

twoway (line min_labor_20 min_capital_20) ///

       (line min_labor_30 min_capital_30) ///

       (line min_labor_40 min_capital_40) ///

        , title(Production Possibility Frontier) ///

* A smoother graph is found by using the fitted values from the previous regression.

twoway (line min_labor_fit_5 min_capital_5, sort) ///

       (line min_labor_fit_10 min_capital_10, sort) ///

       (line min_labor_fit_15 min_capital_15, sort) ///

       (line min_labor_fit_20 min_capital_20, sort) ///

       (line min_labor_fit_25 min_capital_25, sort) ///

       (line min_labor_fit_30 min_capital_30, sort) ///

       (line min_labor_fit_35 min_capital_35, sort) ///

       (line min_labor_fit_40 min_capital_40, sort) ///

       (line min_labor_fit_45 min_capital_45, sort) ///

       , title(Production Possibility Frontier) ///

* Now to remove the unneccessary _s from the variables

foreach v in share_ firm_ CRM_ CR4_ labor_ capital_ u_  v_ y_ {

  * This command will find the first instance of _ in the `v' and replace it empty

  local v_temp = subinstr("`v'","_","",1)

  rename `v' `v_temp'

}

gen labor_capital = (labor*capital)^.75

* This is pretty similar to doing a single OLS regression:

reg y labor capital CR4 share

* This is pretty similar to doing a single OLS regression:

reg y labor capital labor_capital CR4 share

* However, you can see that the simultaneous equation estimate is able to get at what

* we are interested in, "How does market share and concentration predict performance?"

* The OLS regression is only trying to predict output which ends up with a weaker estimate.

* Set the different industries as the panel variable

xtset industry

* First step; I will estimate the production fontier model, with a max number of 

* interations of 10.

xtfrontier y labor capital, ti iter(10)

xtfrontier y labor capital labor_capital, ti

* I have not got this to work entirely the way I would like it to.  I might return to this in the future.

Sunday, April 29, 2012

Seemingly Unrelated Regression - with Correlated Errors

* Stata Simulation and Estimation

* This code simulates a systems OLS: Seemingly Unrelated Regression (SUR) with Panel data
* on three different dependent variables each with different explanatory variables.
* In this case we will image that we want to predict the domestic price of
* wheat, corn, soybeans from the world price of wheat, corn, soybeans in Moldova
* This simple analysis would not be as easy if you wanted to use a large country
* since the price of these products domestically could cause changes in world prices.

* Imagine a 3 equation SUR system with:

* y1 = x1B1 + u1
* y2 = x2B2 + u2
* y3 = x3B3 + u3

* The key in SUR is to assume that E(y1|x1,x2,x2)=E(y1|x1)=x1B1 or in other words that it
* does not create bias in estimating y1 if you were to leave out x2 and x3. In the example
* this argument explicitly states that you do not need to know the world price of corn to
* predict the domestic price of wheat so long as you are already using the world price of
* wheat in your alanysis.

* Clear the memory
clear
set seed 101

* Image that we have data for 20 years starting in 1985
set obs 20
gen year=1984+_n

* First let's image that over the years there are random price shocks that hit all three domestic
* products such as drought, heat waves, early frosts. These are captured by the error term v.

gen v=4*rnormal()

* Let's imagine also that the expected price of wheat is 4, corn 3, and soybeans 3.5

* First we generate the shocks to different prices that are not shared plus the shocks that
* are shared (v).

gen u_wheat=rnormal()+v
gen u_corn=rnormal()+v
gen u_soy=rnormal()+v

* Then we generate world prices for year 1985

gen wheat_world=abs(4 + 2*rnormal()) if year ==1985
gen corn_world =abs(3 + 2*rnormal()) if year ==1985
gen soy_world=abs(3.5 + 2*rnormal()) if year ==1985

* Let's assume that world prices are AR1. That is, last year's price is positively correlated
* with this year's price but not next year's price.

replace wheat_world=abs(4/2 + 2*rnormal() + wheat_world[_n-1]/2) if year !=1985
replace corn_world =abs(3/2 + 2*rnormal() + corn_world[_n-1]/2) if year !=1985
replace soy_world=abs(3.5/2 + 2*rnormal() + soy_world[_n-1]/2)   if year !=1985

* World prices are also autoregressive to the first degree AR1. That is controlling for
* previous period prices perfectly predicts current prices.

* The abs function ensures that world prices never become negative.

* In addition, let's say there was a common trend that caused world prices to rise graduatelly
* near the center of our data before falling again.

* We want the year 1 and year 20 to have no trend. Let's let the trend be a second degree polynomial.
* so t[year]=alpha1*year + beta*year^2. we want t[1]=0, t[10]=5, t[20]=0

gen t = 0 if year == 2004 | year == 1985
replace t = 5 if year == 1995
gen year2=year^2

reg t year year2
predict trend
drop t

* Now lets add the trend back to world prices

replace wheat_world=wheat_world+trend
replace corn_world =corn_world+trend
replace soy_world=soy_world+trend

* Let's see what the world price look like
twoway (line wheat_world year) (line corn_world year) (line soy_world year)

* Now, let us specify how we want world prices to influence domestic prices.
* In an free global market world prices should be perfectly predictive of local prices.
* So let us say the coefficients on world prices in predicting domestic prices is 1.5.
* To put everything together:

gen wheat_moldova = .75*wheat_world + abs(u_wheat + 2)
gen corn_moldova = .75*corn_world + abs(u_corn + 2)
gen soy_moldova   = .75*soy_world + abs(u_soy +2)

* Let's see what the world price look like
twoway (line wheat_world year) (line wheat_moldova year) ///
       (line corn_world year) (line corn_moldova year) ///
       (line soy_world year)   (line soy_moldova year)

* We could do three different regressions:
reg wheat_moldova wheat_world
reg corn_moldova corn_world
reg soy_moldova soy_world

* Which gives us unbiased and consistent estimates. However, we could also do
* SUR which should allow for correlation between the errors (v). Thus

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world)

* Since we think (know) that the coefficients on the different world prices are the same
* we could specify a constraint to force them to be the same accross equations.

constraint 1 [wheat_moldova]wheat_world=[corn_moldova]corn_world
constraint 2 [wheat_moldova]wheat_world=[soy_moldova]soy_world

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world) , const(1 2 )

* To test the relative merits of SUR next to OLS we will use Monte Carlo simulation.
* Same code as previously but now a program needs be defined. With return statements.
cap program drop sur_test

program define sur_test, rclass
clear
* set seed 101 - must turn this off or else every repetition will be exactly the same
set obs 20
gen year=1984+_n
gen v=4*rnormal()
gen u_wheat=rnormal()+v
gen u_corn=rnormal()+v
gen u_soy=rnormal()+v
gen wheat_world=abs(4 + 2*rnormal()) if year ==1985
gen corn_world =abs(3 + 2*rnormal()) if year ==1985
gen soy_world=abs(3.5 + 2*rnormal()) if year ==1985
replace wheat_world=abs(4/2 + 2*rnormal() + wheat_world[_n-1]/2) if year !=1985
replace corn_world =abs(3/2 + 2*rnormal() + corn_world[_n-1]/2) if year !=1985
replace soy_world=abs(3.5/2 + 2*rnormal() + soy_world[_n-1]/2)   if year !=1985
gen t = 0 if year == 2004 | year == 1985
replace t = 5 if year == 1995
gen year2=year^2
reg t year year2
predict trend
drop t
replace wheat_world=wheat_world+trend
replace corn_world =corn_world+trend
replace soy_world=soy_world+trend
gen wheat_moldova = .75*wheat_world + abs(u_wheat + 2)
gen corn_moldova = .75*corn_world + abs(u_corn + 2)
gen soy_moldova   = .75*soy_world + abs(u_soy +2)
* Same code as before but without the comments up to this point
reg wheat_moldova wheat_world
* Return, returns a scalar
return scalar wheat_ols=_b[wheat_world]
reg corn_moldova corn_world
return scalar corn_ols=_b[corn_world]
reg soy_moldova soy_world
return scalar soy_ols=_b[soy_world]
sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world)

* This retrieves the cofficients and sends them out as returns. There must be a better
* way of doing this.
matrix A= e(b)'
svmat A, names(sureg)

cap return scalar wheat_sureg=sureg[1]
cap return scalar corn_sureg=sureg[3]
cap return scalar soy_sureg=sureg[5]

* Since we think (know) that the coefficients on the different world prices are the same
* we could specify a constraint to force them to be the same accross equations.

constraint 1 [wheat_moldova]wheat_world=[corn_moldova]corn_world
constraint 2 [wheat_moldova]wheat_world=[soy_moldova]soy_world

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world) , const(1 2 )

matrix A2=e(b)'
svmat A2, names(sureg2)

cap return scalar wheat_sureg2=sureg2[1]
cap return scalar corn_sureg2=sureg2[3]
cap return scalar soy_sureg2=sureg2[5]

end
* End of the program

sur_test

simulate wheat_ols=r(wheat_ols) wheat_sureg=r(wheat_sureg) wheat_sureg2=r(wheat_sureg2) ///
   corn_ols=r(corn_ols) corn_sureg=r(corn_sureg) corn_sureg2=r(corn_sureg2) ///
   soy_ols=r(soy_ols) soy_sureg=r(soy_sureg) soy_sureg2=r(soy_sureg2), rep(100): sur_test

sum

* From the simulation we can see that SUR outperforms ols and SUR2 outperforms sureg.
* This is visible generally by how much closer the average estimate of the cofficients are
* on the true value which is .75 In addition, you can see that SUR and SUR2 are more
* efficient because they have smaller standard errors.

Saturday, April 28, 2012

vlook

* Stata code, custom command

* vlook is a powerful command for rapidly assessing the capability of current data to explain one variable

* This code creates a stata command that does a diagnostic of a single variable in a data set.

* Drops the program vlook if it already is in memory
cap program drop vlook

* Defines the program vlook
program define vlook, rclass

* Summarizes the variable of interest
sum `1', detail

* Creates a histogram of the variable
hist `1'

*** This sections checks the correlation between the variable and all other variables in the dataset
* Declare empty variable holders
local corr_matrix25
local corr_matrix5
local corr_matrix75

* Loops through all possible variables
foreach v of varlist * {
* Excude the variable being examined
if ("`v'"!="`1'") {
* Correlates each of the variables in the dataset
cap corr `v' `1'
* If there is no error _rc and the correlation is greater than .25 and less then .5 add `v' to the
* global called corr_matrix25.
if (_rc==0&abs(r(rho))>=.25&abs(r(rho))<.5) local corr_matrix25 `corr_matrix25' `v'
* If there is no error _rc and the correlation is greater than .5 and less then .75 add `v' to the
* global called corr_matrix5.
if (_rc==0&abs(r(rho))>=.5&abs(r(rho))<.75) local `corr_matrix5' `corr_matrix5' `v'
* If there is no error _rc and the correlation is greater than or equal to .5 and less then .75 add
* `v' to the global called corr_matrix75.
if (_rc==0&abs(r(rho))>=.75) local `corr_matrix' `corr_matrix75' `v'
}
}

di "`1' is correlated with $corr_matrix25 at (.25<=corr<.5)"
corr `1' $corr_matrix25
di "`1' is correlated with $corr_matrix5 at (.5<=corr<.75)"
corr `1' $corr_matrix5
di "`1' is correlated with $corr_matrix75 at (.75<=corr)"
corr `1' $corr_matrix75

qui reg `1' $corr_matrix75 $corr_matrix5 $corr_matrix25
* This saves the r2 from the previous regression. It is the upper limmit of explainable variance by use
* of the available variables which are at lease .25% correlated with the variable of interest
local upperlimit = string(e(r2),"%9.2f")

*** This sections makes a list of variables with explanatory power to analysze with respect to the variable
*** of interest.
* Creates an empty global to hold variables.
local reg_vars
* Loops through only variables that are at least 50% correlated with the variable of interest.
foreach v in $corr_matrix75 $corr_matrix5 {
* This creates a flag to tell stata to exclude a variable from the regression.
local flag=0
* Loops through all of the variables that have already been added to reg_vars.
foreach vv in $reg_vars {
qui corr `v' `vv'
* If any variable is too correlated with any other variable it is excluded from the regression.
if (("`v'"!="`vv'")&(abs(r(rho))>=.90)) {
local flag=1
di "Excluding `v' from regression because of high correlation with `vv'"
* Ends the most nested loop
break
}
}
* If the flag has not been sprung add the variable `v' to the regression
if (`flag'==0) local reg_vars `reg_vars' `v'
}

return local regvars `regvars'

*** This does a detailed quantile analysis of the explanatory power of the variables
*** around different quantiles.

local quantile_list 125 250 375 500 625 750 875
* First declare a matrix to hold the results of the quantile regression

local nrows = wordcount("`reg_vars'")
local ncols = wordcount("`quantile_list'")

mata: A=J(strtoreal(st_local("nrows"))+2, strtoreal(st_local("ncols")), 0)

* Assign the quantile levels to the first row of the matrix
mata: i=0
foreach v in `quantile_list' {
mata: i++
mata: A[1,i]=strtoreal(st_local("v"))/1000
}


mata: i=0

foreach q in `quantile_list' {
mata: i++
qui di "qreg `1' `reg_vars', quantile(`=`q'/1000')"
qui qreg `1' `reg_vars', quantile(`=`q'/1000')
mata: ii=1

foreach v in `reg_vars' _cons {
mata: ii++
cap local temp=string(_b[`v'],"%9.2f")
if _rc!=0 local temp=.
mata: A[ii,i]=strtoreal(st_local("temp"))
}
}

* This displays the variables corresponding to what rows in the quantile regression.
di "Rows of the coefficient matrix on the quantile regression:"
local i = 0
foreach v in quantile `reg_vars' _cons{
local i = `i'+1
di "`i' - `v'"
}
mata:A

di _newline "reg `1' `reg_vars'"
reg `1' `reg_vars'

di "The upper limit on explainable variance is `upperlimit'"

end

*******************************************************
* Command Example *

use http://www.ats.ucla.edu/stat/stata/dae/crime, clear
vlook crime

Friday, April 27, 2012

Easy recode all missing SPSS or SAS into STATA

* Replaces the variable observations which match the argument after the == sign (-9999) with stata missing(.)

foreach v of varlist * {
capture replace `v'=. if `v'==-9999
}

Program to recode all missing in SPSS to Stata missing

* Stata code
* Converts all values that equal missing such as -9999 to Stata missing values .

* Drops the program if it is already in the memory.
cap program drop recode_missing

* Defines a new program called recode_missing.
program define recode_missing

local variable_changed = 0
local variable_skipped_count = 0
local variable_skipped

* Loops through all variables identified by the first argument in the command.
foreach v of varlist `1' {
* Replaces the variable observations which match the second argument.
capture replace `v'=. if `v'==`2'
if _rc==0 local variable_changed = `variable_changed' + 1
if _rc!=0 {
local variable_skipped_count = `variable_skipped_count' + 1
local variable_skipped `variable_skipped' "`v'"
}
}

di "Recoded observations from `1' which equalled `2' into stata missing"
di "# of variables changed: `variable_changed' "
if `variable_skipped_count'!=0 {
di "# of variables skipped: `variable_skipped_count' "
di "skipped: `variable_skipped'"
}
end

* Example syntax
* . recode_missing * -9999
* This command will recode all variables that contain a -9999 into stata missing values ".".

***********************************************************
* example of the command in action *
use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear
gen test="adsf"

* Does not make any logical sense to recode 0s as missing
recode_missing * 0

* Recoded observations from * which equaled 0 into stata missing
* # of variables changed: 4
* # of variables skipped: 1
* skipped: test