Monday, April 30, 2012

Production Possibility Frontier: Simulation & Estimation

* Production Possibility Frontier
* Stata simulation and estimation

* This simulatoin estimates a production possibility frontier model
* while simultaneously estimating the effects of market power
* on productivity.  It loosely models the paper by Saleem Shaik,
* Albert Allen, Seanicaa Edwards, and James Harris 2009
* "Market Structure Conduct Performance Hypothesis Revisited Using
* Stochastic Frontier Efficiency Analysis"

* v represents firm or time specific random errors
* u represents technical inefficiencies

* Performance(u)=f(X1)
* Output(Y2)=f(X2;B)+v-u

* We want to ask if performance u is explained by: market share and
* market concentration

* u = intercept + B1 mshare + B2 Mconc + e

* Y = intercept + A1 labor + A2 capital + v - u

* They model different firms in different industries which they
* hope have the same production structure but different market shares
* and market concentration.

* Set the random seed
set seed 101

* I simulate 10 firms in 7 different industries

clear
set obs 10

forv i=1/7 {
  * Generate labels for each firm in each industry _# refers to the different
  * industries
 

  * Generate market shares using a beta distribution.  The mean is about .1
  * but it ranges from 0 to 1
  gen share_`i' = rbeta(2,5)
 
  * Create a measure of total market share
  egen total_share_`i' = sum(share_`i')
 
  * Standardize the shares so that they add up to 1
  replace share_`i' = share_`i'/total_share_`i'
 
  * Drop total share
  drop total_share_*
 
  * Order firms from largest to smallest
  gsort -share_`i'

  * Label the firms according to size
  gen firm_`i'=_n

  * Create concentration index CRM
  gen CRM_`i' = sum(share_`i')
 
  * But we will only use CR4
  gen CR4_temp = CRM_`i' if firm_`i'==4
  egen CR4_`i' = mean(CR4_temp)
  drop CR4_temp
}
sum

* Now we want to create 1000 orders per firm
expand 1000

forv i=1/7 {

  * Let's create vectors of input per observation
  gen labor_`i'=runiform()*10
  gen capital_`i'=runiform()*10

  * Larger firms (share) tend to get larger contracts
  qui replace labor_`i'=labor_`i'+6*share_`i'
  qui replace capital_`i'=capital_`i'+6*share_`i'

  * u is the half normal function plus some extra terms
  * that Shaik uses.  The argument is that larger firms (large share)
  * are large because they are efficient but that high
  * market concentration leads to greater inefficiency.
  gen u_`i' = 1+abs(rnormal()*5) - 2*share_`i' + CR4_`i'
 
  * v is normally distributed
  gen v_`i' = rnormal()*20
 
  * Now we are ready to formulate our production model
  gen y_`i' = 5*labor_`i' + 5*capital_`i' + 10*((labor_`i')*(capital_`i'))^.75 - u_`i' + v_`i'
}

* Unfortunately our data is wide now when we want to use the xtfrontier command.
* So we will transform from wide to tall with the reshape command

* This is neccessary
gen obs_num = _n
reshape long share_ firm_ CRM_ CR4_ labor_ capital_ u_  v_ y_, i(obs_num) j(industry)

sort industry obs_num

* I will map out the production possibility frontier
xtile x_y= y, nquantiles(50)
xtile x_labor=labor, nquantiles(50)
xtile x_capital=capital, nquantiles(50)

foreach i in 5 10 15 20 25 30 35 40 45 {
  bysort x_labor: egen min_labor_`i'=mean(labor) if x_y==`i'
  bysort x_labor: egen min_capital_`i'=mean(capital) if x_y==`i'
 
  gen min_capital_`i'2=min_capital_`i'^2
  gen min_capital_`i'3=min_capital_`i'^3
  gen min_capital_`i'4=min_capital_`i'^4
  gen min_capital_`i'5=min_capital_`i'^5

  qui reg min_labor_`i' min_capital_`i' min_capital_`i'2 ///
        min_capital_`i'3 min_capital_`i'4 min_capital_`i'5
  predict min_labor_fit_`i'
  label var min_labor_fit_`i' "`=`i'*2'% percentile"
}

* Creates a noisy graph of production possibilities
twoway (line min_labor_20 min_capital_20) ///
       (line min_labor_30 min_capital_30) ///
       (line min_labor_40 min_capital_40) ///
        , title(Production Possibility Frontier) ///

* A smoother graph is found by using the fitted values from the previous regression.
twoway (line min_labor_fit_5 min_capital_5, sort) ///
       (line min_labor_fit_10 min_capital_10, sort) ///
       (line min_labor_fit_15 min_capital_15, sort) ///
       (line min_labor_fit_20 min_capital_20, sort) ///
       (line min_labor_fit_25 min_capital_25, sort) ///
       (line min_labor_fit_30 min_capital_30, sort) ///
       (line min_labor_fit_35 min_capital_35, sort) ///
       (line min_labor_fit_40 min_capital_40, sort) ///
       (line min_labor_fit_45 min_capital_45, sort) ///
       , title(Production Possibility Frontier) ///

* Now to remove the unneccessary _s from the variables
foreach v in share_ firm_ CRM_ CR4_ labor_ capital_ u_  v_ y_ {
  * This command will find the first instance of _ in the `v' and replace it empty
  local v_temp = subinstr("`v'","_","",1)
  rename `v' `v_temp'
}

gen labor_capital = (labor*capital)^.75

* This is pretty similar to doing a single OLS regression:
reg y labor capital CR4 share

* This is pretty similar to doing a single OLS regression:
reg y labor capital labor_capital CR4 share


* However, you can see that the simultaneous equation estimate is able to get at what
* we are interested in, "How does market share and concentration predict performance?"
* The OLS regression is only trying to predict output which ends up with a weaker estimate.

* Set the different industries as the panel variable
xtset industry

* First step; I will estimate the production fontier model, with a max number of
* interations of 10.
xtfrontier y labor capital, ti iter(10)

xtfrontier y labor capital labor_capital, ti

* I have not got this to work entirely the way I would like it to.  I might return to this in the future.

Sunday, April 29, 2012

Seemingly Unrelated Regression - with Correlated Errors


* Stata Simulation and Estimation

* This code simulates a systems OLS: Seemingly Unrelated Regression (SUR) with Panel data
* on three different dependent variables each with different explanatory variables.
* In this case we will image that we want to predict the domestic price of
* wheat, corn, soybeans from the world price of wheat, corn, soybeans in Moldova
* This simple analysis would not be as easy if you wanted to use a large country
* since the price of these products domestically could cause changes in world prices.

* Imagine a 3 equation SUR system with:

* y1 = x1B1 + u1
* y2 = x2B2 + u2
* y3 = x3B3 + u3

* The key in SUR is to assume that E(y1|x1,x2,x2)=E(y1|x1)=x1B1 or in other words that it
* does not create bias in estimating y1 if you were to leave out x2 and x3.  In the example
* this argument explicitly states that you do not need to know the world price of corn to
* predict the domestic price of wheat so long as you are already using the world price of
* wheat in your alanysis.

* Clear the memory
clear
set seed 101

* Image that we have data for 20 years starting in 1985
set obs 20
gen year=1984+_n

* First let's image that over the years there are random price shocks that hit all three domestic
* products such as drought, heat waves, early frosts.  These are captured by the error term v.

gen v=4*rnormal()

* Let's imagine also that the expected price of wheat is 4, corn 3, and soybeans 3.5

* First we generate the shocks to different prices that are not shared plus the shocks that
* are shared (v).

gen u_wheat=rnormal()+v
gen u_corn=rnormal()+v
gen u_soy=rnormal()+v

* Then we generate world prices for year 1985

gen wheat_world=abs(4 + 2*rnormal()) if year ==1985
gen corn_world =abs(3 + 2*rnormal()) if year ==1985
gen soy_world=abs(3.5 + 2*rnormal()) if year ==1985

* Let's assume that world prices are AR1.  That is, last year's price is positively correlated
* with this year's price but not next year's price.

replace wheat_world=abs(4/2 + 2*rnormal() + wheat_world[_n-1]/2) if year !=1985
replace corn_world =abs(3/2 + 2*rnormal() + corn_world[_n-1]/2)  if year !=1985
replace soy_world=abs(3.5/2 + 2*rnormal() + soy_world[_n-1]/2)   if year !=1985

* World prices are also autoregressive to the first degree AR1.  That is controlling for
* previous period prices perfectly predicts current prices.

* The abs function ensures that world prices never become negative.

* In addition, let's say there was a common trend that caused world prices to rise graduatelly
* near the center of our data before falling again.

* We want the year 1 and year 20 to have no trend.  Let's let the trend be a second degree polynomial.
* so t[year]=alpha1*year + beta*year^2.  we want t[1]=0, t[10]=5, t[20]=0

gen t = 0 if year == 2004 | year == 1985
replace t = 5 if year == 1995
gen year2=year^2

reg t year year2
predict trend
drop t

* Now lets add the trend back to world prices

replace wheat_world=wheat_world+trend
replace corn_world =corn_world+trend
replace soy_world=soy_world+trend

* Let's see what the world price look like
twoway (line wheat_world year) (line corn_world year) (line soy_world year)

* Now, let us specify how we want world prices to influence domestic prices.
* In an free global market world prices should be perfectly predictive of local prices.
* So let us say the coefficients on world prices in predicting domestic prices is 1.5.
* To put everything together:

gen wheat_moldova = .75*wheat_world + abs(u_wheat + 2)
gen corn_moldova  = .75*corn_world + abs(u_corn + 2)
gen soy_moldova   = .75*soy_world + abs(u_soy +2)

* Let's see what the world price look like
twoway (line wheat_world year) (line wheat_moldova year) ///
       (line corn_world year)  (line corn_moldova year) ///
       (line soy_world year)   (line soy_moldova year)

* We could do three different regressions:
reg wheat_moldova wheat_world
reg corn_moldova corn_world
reg soy_moldova soy_world

* Which gives us unbiased and consistent estimates.  However, we could also do
* SUR which should allow for correlation between the errors (v).  Thus

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world)

* Since we think (know) that the coefficients on the different world prices are the same
* we could specify a constraint to force them to be the same accross equations.

constraint 1 [wheat_moldova]wheat_world=[corn_moldova]corn_world
constraint 2 [wheat_moldova]wheat_world=[soy_moldova]soy_world

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world) , const(1 2 )


* To test the relative merits of SUR next to OLS we will use Monte Carlo simulation.
* Same code as previously but now a program needs be defined.  With return statements.
cap program drop sur_test

program define sur_test, rclass
  clear
* set seed 101  - must turn this off or else every repetition will be exactly the same
  set obs 20
  gen year=1984+_n
  gen v=4*rnormal()
  gen u_wheat=rnormal()+v
  gen u_corn=rnormal()+v
  gen u_soy=rnormal()+v
  gen wheat_world=abs(4 + 2*rnormal()) if year ==1985
  gen corn_world =abs(3 + 2*rnormal()) if year ==1985
  gen soy_world=abs(3.5 + 2*rnormal()) if year ==1985
  replace wheat_world=abs(4/2 + 2*rnormal() + wheat_world[_n-1]/2) if year !=1985
  replace corn_world =abs(3/2 + 2*rnormal() + corn_world[_n-1]/2)  if year !=1985
  replace soy_world=abs(3.5/2 + 2*rnormal() + soy_world[_n-1]/2)   if year !=1985
  gen t = 0 if year == 2004 | year == 1985
  replace t = 5 if year == 1995
  gen year2=year^2
  reg t year year2
  predict trend
  drop t
  replace wheat_world=wheat_world+trend
  replace corn_world =corn_world+trend
  replace soy_world=soy_world+trend
  gen wheat_moldova = .75*wheat_world + abs(u_wheat + 2)
  gen corn_moldova  = .75*corn_world + abs(u_corn + 2)
  gen soy_moldova   = .75*soy_world + abs(u_soy +2)
* Same code as before but without the comments up to this point
  reg wheat_moldova wheat_world
* Return, returns a scalar
return scalar wheat_ols=_b[wheat_world]
  reg corn_moldova corn_world
return scalar corn_ols=_b[corn_world]
  reg soy_moldova soy_world
return scalar soy_ols=_b[soy_world]
  sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world)

* This retrieves the cofficients and sends them out as returns.  There must be a better
* way of doing this.
matrix A= e(b)'
svmat A, names(sureg)

cap return scalar wheat_sureg=sureg[1]
cap return scalar corn_sureg=sureg[3]
cap return scalar soy_sureg=sureg[5]

* Since we think (know) that the coefficients on the different world prices are the same
* we could specify a constraint to force them to be the same accross equations.

constraint 1 [wheat_moldova]wheat_world=[corn_moldova]corn_world
constraint 2 [wheat_moldova]wheat_world=[soy_moldova]soy_world

sureg (wheat_moldova wheat_world) (corn_moldova corn_world) (soy_moldova soy_world) , const(1 2 )

matrix A2=e(b)'
svmat A2, names(sureg2)

cap return scalar wheat_sureg2=sureg2[1]
cap return scalar corn_sureg2=sureg2[3]
cap return scalar soy_sureg2=sureg2[5]

end
* End of the program

sur_test

simulate wheat_ols=r(wheat_ols) wheat_sureg=r(wheat_sureg) wheat_sureg2=r(wheat_sureg2) ///
   corn_ols=r(corn_ols) corn_sureg=r(corn_sureg) corn_sureg2=r(corn_sureg2) ///
   soy_ols=r(soy_ols) soy_sureg=r(soy_sureg) soy_sureg2=r(soy_sureg2), rep(100): sur_test

sum
 
* From the simulation we can see that SUR outperforms ols and SUR2 outperforms sureg.
* This is visible generally by how much closer the average estimate of the cofficients are
* on the true value which is .75 In addition, you can see that SUR and SUR2 are more
* efficient because they have smaller standard errors.

Saturday, April 28, 2012

vlook

* Stata code, custom command
* vlook is a powerful command for rapidly assessing the capability of current data to explain one variable

* This code creates a stata command that does a diagnostic of a single variable in a data set.


* Drops the program vlook if it already is in memory
cap program drop vlook


* Defines the program vlook
program define vlook, rclass


  * Summarizes the variable of interest
  sum `1', detail  
    
  * Creates a histogram of the variable
  hist `1'
  
  *** This sections checks the correlation between the variable and all other variables in the dataset
  * Declare empty variable holders
  local corr_matrix25
  local corr_matrix5
  local corr_matrix75
  
  * Loops through all possible variables
  foreach v of varlist * {
    * Excude the variable being examined
    if ("`v'"!="`1'") {
        * Correlates each of the variables in the dataset
      cap corr `v' `1'
        * If there is no error _rc and the correlation is greater than .25 and less then .5 add `v' to the 
        * global called corr_matrix25. 
      if  (_rc==0&abs(r(rho))>=.25&abs(r(rho))<.5) local corr_matrix25 `corr_matrix25' `v'
        * If there is no error _rc and the correlation is greater than .5 and less then .75 add `v' to the 
        * global called corr_matrix5.
      if  (_rc==0&abs(r(rho))>=.5&abs(r(rho))<.75) local `corr_matrix5'  `corr_matrix5' `v'
        * If there is no error _rc and the correlation is greater than or equal to .5 and less then .75 add 
        * `v' to the global called corr_matrix75.
      if  (_rc==0&abs(r(rho))>=.75) local `corr_matrix' `corr_matrix75' `v'
    }
  }
  
  di "`1' is correlated with $corr_matrix25 at (.25<=corr<.5)"
    corr `1' $corr_matrix25
  di "`1' is correlated with $corr_matrix5 at (.5<=corr<.75)"
    corr `1' $corr_matrix5
  di "`1' is correlated with $corr_matrix75 at (.75<=corr)"
    corr `1' $corr_matrix75


  qui reg `1' $corr_matrix75 $corr_matrix5 $corr_matrix25
  * This saves the r2 from the previous regression.  It is the upper limmit of explainable variance by use
  * of the available variables which are at lease .25% correlated with the variable of interest
  local upperlimit = string(e(r2),"%9.2f")


  *** This sections makes a list of variables with explanatory power to analysze with respect to the variable
  *** of interest.
  * Creates an empty global to hold variables.
  local reg_vars
  * Loops through only variables that are at least 50% correlated with the variable of interest.
  foreach v in $corr_matrix75 $corr_matrix5 {
    * This creates a flag to tell stata to exclude a variable from the regression.
    local flag=0
* Loops through all of the variables that have already been added to reg_vars.
    foreach vv in $reg_vars {
    qui corr `v' `vv'
* If any variable is too correlated with any other variable it is excluded from the regression.
    if (("`v'"!="`vv'")&(abs(r(rho))>=.90)) {
      local flag=1
      di "Excluding `v' from regression because of high correlation with `vv'"
      * Ends the most nested loop
 break
    }
    }
  * If the flag has not been sprung add the variable `v' to the regression
  if (`flag'==0)   local reg_vars `reg_vars' `v'
  }
  
  return local regvars `regvars'
  
  *** This does a detailed quantile analysis of the explanatory power of the variables
  *** around different quantiles.
  
  local quantile_list 125 250 375 500 625 750 875
  * First declare a matrix to hold the results of the quantile regression
  
  local nrows = wordcount("`reg_vars'") 
  local ncols = wordcount("`quantile_list'") 
  
  mata: A=J(strtoreal(st_local("nrows"))+2, strtoreal(st_local("ncols")), 0)
  
  * Assign the quantile levels to the first row of the matrix
  mata: i=0
  foreach v in `quantile_list' {
    mata: i++
    mata:  A[1,i]=strtoreal(st_local("v"))/1000
  }


  
  mata: i=0
  
  foreach q in `quantile_list'  {
    mata: i++
qui di "qreg `1' `reg_vars', quantile(`=`q'/1000')"
qui qreg `1' `reg_vars', quantile(`=`q'/1000')
    mata: ii=1  


    foreach v in `reg_vars' _cons {
      mata: ii++
 cap local temp=string(_b[`v'],"%9.2f")
 if _rc!=0 local temp=.
 mata: A[ii,i]=strtoreal(st_local("temp"))
    }
  }
  
  * This displays the variables corresponding to what rows in the quantile regression.
  di "Rows of the coefficient matrix on the quantile regression:"
  local i = 0
  foreach v in quantile `reg_vars' _cons{
   local i = `i'+1
   di "`i' - `v'"
  }
  mata:A


  di _newline "reg `1' `reg_vars'"
  reg `1' `reg_vars'
  
  di "The upper limit on explainable variance is `upperlimit'"


end




*******************************************************
*                  Command Example                    *


use http://www.ats.ucla.edu/stat/stata/dae/crime, clear
vlook crime

Friday, April 27, 2012

Easy recode all missing SPSS or SAS into STATA

* Replaces the variable observations which match the argument after the == sign (-9999) with stata missing(.)


foreach v of varlist * {
    capture replace `v'=. if `v'==-9999
}

Program to recode all missing in SPSS to Stata missing

* Stata code
* Converts all values that equal missing such as -9999 to Stata missing values .

* Drops the program if it is already in the memory.
cap program drop recode_missing


* Defines a new program called recode_missing.
program define recode_missing

  local variable_changed = 0
  local variable_skipped_count = 0
  local variable_skipped


  * Loops through all variables identified by the first argument in the command.
  foreach v of varlist `1' {
   * Replaces the variable observations which match the second argument.
    capture replace `v'=. if `v'==`2'
   if _rc==0 local variable_changed = `variable_changed' + 1
   if _rc!=0 { 
      local variable_skipped_count = `variable_skipped_count' + 1
      local variable_skipped `variable_skipped' "`v'" 
   }
  }


  di "Recoded observations from `1' which equalled `2' into stata missing"
  di "# of variables changed: `variable_changed' "
  if `variable_skipped_count'!=0 {
 di "# of variables skipped: `variable_skipped_count' "
 di "skipped:  `variable_skipped'"
  }
end


* Example syntax
* . recode_missing * -9999
* This command will recode all variables that contain a -9999 into stata missing values ".".






***********************************************************
*          example of the command  in action              * 
use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear
gen test="adsf"


* Does not make any logical sense to recode 0s as missing
recode_missing * 0


* Recoded observations from * which equaled 0 into stata missing
* # of variables changed: 4 
* # of variables skipped: 1 
* skipped:  test