## Monday, April 30, 2012

### Production Possibility Frontier: Simulation & Estimation

* Production Possibility Frontier
* Stata simulation and estimation

* This simulatoin estimates a production possibility frontier model
* while simultaneously estimating the effects of market power
* on productivity.  It loosely models the paper by Saleem Shaik,
* Albert Allen, Seanicaa Edwards, and James Harris 2009
* "Market Structure Conduct Performance Hypothesis Revisited Using
* Stochastic Frontier Efficiency Analysis"

* v represents firm or time specific random errors
* u represents technical inefficiencies

* Performance(u)=f(X1)
* Output(Y2)=f(X2;B)+v-u

* We want to ask if performance u is explained by: market share and
* market concentration

* u = intercept + B1 mshare + B2 Mconc + e

* Y = intercept + A1 labor + A2 capital + v - u

* They model different firms in different industries which they
* hope have the same production structure but different market shares
* and market concentration.

* Set the random seed
set seed 101

* I simulate 10 firms in 7 different industries

clear
set obs 10

forv i=1/7 {
* Generate labels for each firm in each industry _# refers to the different
* industries

* Generate market shares using a beta distribution.  The mean is about .1
* but it ranges from 0 to 1
gen share_`i' = rbeta(2,5)

* Create a measure of total market share
egen total_share_`i' = sum(share_`i')

* Standardize the shares so that they add up to 1
replace share_`i' = share_`i'/total_share_`i'

* Drop total share
drop total_share_*

* Order firms from largest to smallest
gsort -share_`i'

* Label the firms according to size
gen firm_`i'=_n

* Create concentration index CRM
gen CRM_`i' = sum(share_`i')

* But we will only use CR4
gen CR4_temp = CRM_`i' if firm_`i'==4
egen CR4_`i' = mean(CR4_temp)
drop CR4_temp
}
sum

* Now we want to create 1000 orders per firm
expand 1000

forv i=1/7 {

* Let's create vectors of input per observation
gen labor_`i'=runiform()*10
gen capital_`i'=runiform()*10

* Larger firms (share) tend to get larger contracts
qui replace labor_`i'=labor_`i'+6*share_`i'
qui replace capital_`i'=capital_`i'+6*share_`i'

* u is the half normal function plus some extra terms
* that Shaik uses.  The argument is that larger firms (large share)
* are large because they are efficient but that high
* market concentration leads to greater inefficiency.
gen u_`i' = 1+abs(rnormal()*5) - 2*share_`i' + CR4_`i'

* v is normally distributed
gen v_`i' = rnormal()*20

* Now we are ready to formulate our production model
gen y_`i' = 5*labor_`i' + 5*capital_`i' + 10*((labor_`i')*(capital_`i'))^.75 - u_`i' + v_`i'
}

* Unfortunately our data is wide now when we want to use the xtfrontier command.
* So we will transform from wide to tall with the reshape command

* This is neccessary
gen obs_num = _n
reshape long share_ firm_ CRM_ CR4_ labor_ capital_ u_  v_ y_, i(obs_num) j(industry)

sort industry obs_num

* I will map out the production possibility frontier
xtile x_y= y, nquantiles(50)
xtile x_labor=labor, nquantiles(50)
xtile x_capital=capital, nquantiles(50)

foreach i in 5 10 15 20 25 30 35 40 45 {
bysort x_labor: egen min_labor_`i'=mean(labor) if x_y==`i'
bysort x_labor: egen min_capital_`i'=mean(capital) if x_y==`i'

gen min_capital_`i'2=min_capital_`i'^2
gen min_capital_`i'3=min_capital_`i'^3
gen min_capital_`i'4=min_capital_`i'^4
gen min_capital_`i'5=min_capital_`i'^5

qui reg min_labor_`i' min_capital_`i' min_capital_`i'2 ///
min_capital_`i'3 min_capital_`i'4 min_capital_`i'5
predict min_labor_fit_`i'
label var min_labor_fit_`i' "`=`i'*2'% percentile"
}

* Creates a noisy graph of production possibilities
twoway (line min_labor_20 min_capital_20) ///
(line min_labor_30 min_capital_30) ///
(line min_labor_40 min_capital_40) ///
, title(Production Possibility Frontier) ///

* A smoother graph is found by using the fitted values from the previous regression.
twoway (line min_labor_fit_5 min_capital_5, sort) ///
(line min_labor_fit_10 min_capital_10, sort) ///
(line min_labor_fit_15 min_capital_15, sort) ///
(line min_labor_fit_20 min_capital_20, sort) ///
(line min_labor_fit_25 min_capital_25, sort) ///
(line min_labor_fit_30 min_capital_30, sort) ///
(line min_labor_fit_35 min_capital_35, sort) ///
(line min_labor_fit_40 min_capital_40, sort) ///
(line min_labor_fit_45 min_capital_45, sort) ///
, title(Production Possibility Frontier) ///

* Now to remove the unneccessary _s from the variables
foreach v in share_ firm_ CRM_ CR4_ labor_ capital_ u_  v_ y_ {
* This command will find the first instance of _ in the `v' and replace it empty
local v_temp = subinstr("`v'","_","",1)
rename `v' `v_temp'
}

gen labor_capital = (labor*capital)^.75

* This is pretty similar to doing a single OLS regression:
reg y labor capital CR4 share

* This is pretty similar to doing a single OLS regression:
reg y labor capital labor_capital CR4 share

* However, you can see that the simultaneous equation estimate is able to get at what
* we are interested in, "How does market share and concentration predict performance?"
* The OLS regression is only trying to predict output which ends up with a weaker estimate.

* Set the different industries as the panel variable
xtset industry

* First step; I will estimate the production fontier model, with a max number of
* interations of 10.
xtfrontier y labor capital, ti iter(10)

xtfrontier y labor capital labor_capital, ti

* I have not got this to work entirely the way I would like it to.  I might return to this in the future.

## Sunday, April 29, 2012

### Seemingly Unrelated Regression - with Correlated Errors

* Stata Simulation and Estimation

* This code simulates a systems OLS: Seemingly Unrelated Regression (SUR) with Panel data
* on three different dependent variables each with different explanatory variables.
* In this case we will image that we want to predict the domestic price of
* wheat, corn, soybeans from the world price of wheat, corn, soybeans in Moldova
* This simple analysis would not be as easy if you wanted to use a large country
* since the price of these products domestically could cause changes in world prices.

* Imagine a 3 equation SUR system with:

* y1 = x1B1 + u1
* y2 = x2B2 + u2
* y3 = x3B3 + u3

* The key in SUR is to assume that E(y1|x1,x2,x2)=E(y1|x1)=x1B1 or in other words that it
* does not create bias in estimating y1 if you were to leave out x2 and x3.  In the example
* this argument explicitly states that you do not need to know the world price of corn to
* predict the domestic price of wheat so long as you are already using the world price of

* Clear the memory
clear
set seed 101

* Image that we have data for 20 years starting in 1985
set obs 20
gen year=1984+_n

* First let's image that over the years there are random price shocks that hit all three domestic
* products such as drought, heat waves, early frosts.  These are captured by the error term v.

gen v=4*rnormal()

* Let's imagine also that the expected price of wheat is 4, corn 3, and soybeans 3.5

* First we generate the shocks to different prices that are not shared plus the shocks that
* are shared (v).

gen u_wheat=rnormal()+v
gen u_corn=rnormal()+v
gen u_soy=rnormal()+v

* Then we generate world prices for year 1985

gen wheat_world=abs(4 + 2*rnormal()) if year ==1985
gen corn_world =abs(3 + 2*rnormal()) if year ==1985
gen soy_world=abs(3.5 + 2*rnormal()) if year ==1985

* Let's assume that world prices are AR1.  That is, last year's price is positively correlated
* with this year's price but not next year's price.

replace wheat_world=abs(4/2 + 2*rnormal() + wheat_world[_n-1]/2) if year !=1985
replace corn_world =abs(3/2 + 2*rnormal() + corn_world[_n-1]/2)  if year !=1985
replace soy_world=abs(3.5/2 + 2*rnormal() + soy_world[_n-1]/2)   if year !=1985

* World prices are also autoregressive to the first degree AR1.  That is controlling for
* previous period prices perfectly predicts current prices.

* The abs function ensures that world prices never become negative.

* In addition, let's say there was a common trend that caused world prices to rise graduatelly
* near the center of our data before falling again.

* We want the year 1 and year 20 to have no trend.  Let's let the trend be a second degree polynomial.
* so t[year]=alpha1*year + beta*year^2.  we want t=0, t=5, t=0

gen t = 0 if year == 2004 | year == 1985
replace t = 5 if year == 1995
gen year2=year^2

reg t year year2
predict trend
drop t

* Now lets add the trend back to world prices

replace wheat_world=wheat_world+trend
replace corn_world =corn_world+trend
replace soy_world=soy_world+trend

* Let's see what the world price look like
twoway (line wheat_world year) (line corn_world year) (line soy_world year)

* Now, let us specify how we want world prices to influence domestic prices.
* In an free global market world prices should be perfectly predictive of local prices.
* So let us say the coefficients on world prices in predicting domestic prices is 1.5.
* To put everything together:

gen wheat_moldova = .75*wheat_world + abs(u_wheat + 2)
gen corn_moldova  = .75*corn_world + abs(u_corn + 2)
gen soy_moldova   = .75*soy_world + abs(u_soy +2)

* Let's see what the world price look like
twoway (line wheat_world year) (line wheat_moldova year) ///
(line corn_world year)  (line corn_moldova year) ///
(line soy_world year)   (line soy_moldova year)

* We could do three different regressions:
reg wheat_moldova wheat_world
reg corn_moldova corn_world
reg soy_moldova soy_world

* Which gives us unbiased and consistent estimates.  However, we could also do
* SUR which should allow for correlation between the errors (v).  Thus

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world)

* Since we think (know) that the coefficients on the different world prices are the same
* we could specify a constraint to force them to be the same accross equations.

constraint 1 [wheat_moldova]wheat_world=[corn_moldova]corn_world
constraint 2 [wheat_moldova]wheat_world=[soy_moldova]soy_world

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world) , const(1 2 )

* To test the relative merits of SUR next to OLS we will use Monte Carlo simulation.
* Same code as previously but now a program needs be defined.  With return statements.
cap program drop sur_test

program define sur_test, rclass
clear
* set seed 101  - must turn this off or else every repetition will be exactly the same
set obs 20
gen year=1984+_n
gen v=4*rnormal()
gen u_wheat=rnormal()+v
gen u_corn=rnormal()+v
gen u_soy=rnormal()+v
gen wheat_world=abs(4 + 2*rnormal()) if year ==1985
gen corn_world =abs(3 + 2*rnormal()) if year ==1985
gen soy_world=abs(3.5 + 2*rnormal()) if year ==1985
replace wheat_world=abs(4/2 + 2*rnormal() + wheat_world[_n-1]/2) if year !=1985
replace corn_world =abs(3/2 + 2*rnormal() + corn_world[_n-1]/2)  if year !=1985
replace soy_world=abs(3.5/2 + 2*rnormal() + soy_world[_n-1]/2)   if year !=1985
gen t = 0 if year == 2004 | year == 1985
replace t = 5 if year == 1995
gen year2=year^2
reg t year year2
predict trend
drop t
replace wheat_world=wheat_world+trend
replace corn_world =corn_world+trend
replace soy_world=soy_world+trend
gen wheat_moldova = .75*wheat_world + abs(u_wheat + 2)
gen corn_moldova  = .75*corn_world + abs(u_corn + 2)
gen soy_moldova   = .75*soy_world + abs(u_soy +2)
* Same code as before but without the comments up to this point
reg wheat_moldova wheat_world
* Return, returns a scalar
return scalar wheat_ols=_b[wheat_world]
reg corn_moldova corn_world
return scalar corn_ols=_b[corn_world]
reg soy_moldova soy_world
return scalar soy_ols=_b[soy_world]
sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world)

* This retrieves the cofficients and sends them out as returns.  There must be a better
* way of doing this.
matrix A= e(b)'
svmat A, names(sureg)

cap return scalar wheat_sureg=sureg
cap return scalar corn_sureg=sureg
cap return scalar soy_sureg=sureg

* Since we think (know) that the coefficients on the different world prices are the same
* we could specify a constraint to force them to be the same accross equations.

constraint 1 [wheat_moldova]wheat_world=[corn_moldova]corn_world
constraint 2 [wheat_moldova]wheat_world=[soy_moldova]soy_world

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world) , const(1 2 )

matrix A2=e(b)'
svmat A2, names(sureg2)

cap return scalar wheat_sureg2=sureg2
cap return scalar corn_sureg2=sureg2
cap return scalar soy_sureg2=sureg2

end
* End of the program

sur_test

simulate wheat_ols=r(wheat_ols) wheat_sureg=r(wheat_sureg) wheat_sureg2=r(wheat_sureg2) ///
corn_ols=r(corn_ols) corn_sureg=r(corn_sureg) corn_sureg2=r(corn_sureg2) ///
soy_ols=r(soy_ols) soy_sureg=r(soy_sureg) soy_sureg2=r(soy_sureg2), rep(100): sur_test

sum

* From the simulation we can see that SUR outperforms ols and SUR2 outperforms sureg.
* This is visible generally by how much closer the average estimate of the cofficients are
* on the true value which is .75 In addition, you can see that SUR and SUR2 are more
* efficient because they have smaller standard errors.

## Saturday, April 28, 2012

### vlook

* Stata code, custom command
* vlook is a powerful command for rapidly assessing the capability of current data to explain one variable

* This code creates a stata command that does a diagnostic of a single variable in a data set.

* Drops the program vlook if it already is in memory
cap program drop vlook

* Defines the program vlook
program define vlook, rclass

* Summarizes the variable of interest
sum `1', detail

* Creates a histogram of the variable
hist `1'

*** This sections checks the correlation between the variable and all other variables in the dataset
* Declare empty variable holders
local corr_matrix25
local corr_matrix5
local corr_matrix75

* Loops through all possible variables
foreach v of varlist * {
* Excude the variable being examined
if ("`v'"!="`1'") {
* Correlates each of the variables in the dataset
cap corr `v' `1'
* If there is no error _rc and the correlation is greater than .25 and less then .5 add `v' to the
* global called corr_matrix25.
if  (_rc==0&abs(r(rho))>=.25&abs(r(rho))<.5) local corr_matrix25 `corr_matrix25' `v'
* If there is no error _rc and the correlation is greater than .5 and less then .75 add `v' to the
* global called corr_matrix5.
if  (_rc==0&abs(r(rho))>=.5&abs(r(rho))<.75) local `corr_matrix5'  `corr_matrix5' `v'
* If there is no error _rc and the correlation is greater than or equal to .5 and less then .75 add
* `v' to the global called corr_matrix75.
if  (_rc==0&abs(r(rho))>=.75) local `corr_matrix' `corr_matrix75' `v'
}
}

di "`1' is correlated with \$corr_matrix25 at (.25<=corr<.5)"
corr `1' \$corr_matrix25
di "`1' is correlated with \$corr_matrix5 at (.5<=corr<.75)"
corr `1' \$corr_matrix5
di "`1' is correlated with \$corr_matrix75 at (.75<=corr)"
corr `1' \$corr_matrix75

qui reg `1' \$corr_matrix75 \$corr_matrix5 \$corr_matrix25
* This saves the r2 from the previous regression.  It is the upper limmit of explainable variance by use
* of the available variables which are at lease .25% correlated with the variable of interest
local upperlimit = string(e(r2),"%9.2f")

*** This sections makes a list of variables with explanatory power to analysze with respect to the variable
*** of interest.
* Creates an empty global to hold variables.
local reg_vars
* Loops through only variables that are at least 50% correlated with the variable of interest.
foreach v in \$corr_matrix75 \$corr_matrix5 {
* This creates a flag to tell stata to exclude a variable from the regression.
local flag=0
* Loops through all of the variables that have already been added to reg_vars.
foreach vv in \$reg_vars {
qui corr `v' `vv'
* If any variable is too correlated with any other variable it is excluded from the regression.
if (("`v'"!="`vv'")&(abs(r(rho))>=.90)) {
local flag=1
di "Excluding `v' from regression because of high correlation with `vv'"
* Ends the most nested loop
break
}
}
* If the flag has not been sprung add the variable `v' to the regression
if (`flag'==0)   local reg_vars `reg_vars' `v'
}

return local regvars `regvars'

*** This does a detailed quantile analysis of the explanatory power of the variables
*** around different quantiles.

local quantile_list 125 250 375 500 625 750 875
* First declare a matrix to hold the results of the quantile regression

local nrows = wordcount("`reg_vars'")
local ncols = wordcount("`quantile_list'")

mata: A=J(strtoreal(st_local("nrows"))+2, strtoreal(st_local("ncols")), 0)

* Assign the quantile levels to the first row of the matrix
mata: i=0
foreach v in `quantile_list' {
mata: i++
mata:  A[1,i]=strtoreal(st_local("v"))/1000
}

mata: i=0

foreach q in `quantile_list'  {
mata: i++
qui di "qreg `1' `reg_vars', quantile(`=`q'/1000')"
qui qreg `1' `reg_vars', quantile(`=`q'/1000')
mata: ii=1

foreach v in `reg_vars' _cons {
mata: ii++
cap local temp=string(_b[`v'],"%9.2f")
if _rc!=0 local temp=.
mata: A[ii,i]=strtoreal(st_local("temp"))
}
}

* This displays the variables corresponding to what rows in the quantile regression.
di "Rows of the coefficient matrix on the quantile regression:"
local i = 0
foreach v in quantile `reg_vars' _cons{
local i = `i'+1
di "`i' - `v'"
}
mata:A

di _newline "reg `1' `reg_vars'"
reg `1' `reg_vars'

di "The upper limit on explainable variance is `upperlimit'"

end

*******************************************************
*                  Command Example                    *

use http://www.ats.ucla.edu/stat/stata/dae/crime, clear
vlook crime

## Friday, April 27, 2012

### Easy recode all missing SPSS or SAS into STATA

* Replaces the variable observations which match the argument after the == sign (-9999) with stata missing(.)

foreach v of varlist * {
capture replace `v'=. if `v'==-9999
}

### Program to recode all missing in SPSS to Stata missing

* Stata code
* Converts all values that equal missing such as -9999 to Stata missing values .

* Drops the program if it is already in the memory.
cap program drop recode_missing

* Defines a new program called recode_missing.
program define recode_missing

local variable_changed = 0
local variable_skipped_count = 0
local variable_skipped

* Loops through all variables identified by the first argument in the command.
foreach v of varlist `1' {
* Replaces the variable observations which match the second argument.
capture replace `v'=. if `v'==`2'
if _rc==0 local variable_changed = `variable_changed' + 1
if _rc!=0 {
local variable_skipped_count = `variable_skipped_count' + 1
local variable_skipped `variable_skipped' "`v'"
}
}

di "Recoded observations from `1' which equalled `2' into stata missing"
di "# of variables changed: `variable_changed' "
if `variable_skipped_count'!=0 {
di "# of variables skipped: `variable_skipped_count' "
di "skipped:  `variable_skipped'"
}
end

* Example syntax
* . recode_missing * -9999
* This command will recode all variables that contain a -9999 into stata missing values ".".

***********************************************************
*          example of the command  in action              *
use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear