Monday, February 11, 2013
Non-Parametric Regression Discontinuity
* I recently went to an interesting seminar today by Matias Cattaneo from the University of Michigan.
* He was presenting some of his work on non-parametric regression discontinuity design which I found interesting.
* What he was working on and the conclusions of the paper was interesting but even more interesting was a release by him and coauthors of a Stata package that implements RD design for easy
* net install rdrobust, from(http://www-personal.umich.edu/~cattaneo/rdrobust) replace
* Regression discontinuity is a technique which allows identification of a localized effect around a natural or structured policy discontinuity.
* For instance, if you wondering what the effect federal grants have on college attendance, then you may be concerned that just looking at those students who are eligible for federal grants in contrast with those who are not eligible will be problematic because students who are eligible (low income) may different than those who are not eligible for the grant (not low income).
* The RD argument is that if individuals do not, as a response to the grant being available, move their reported income level to become eligible for the grant than those who are near the cut off for the grant and those not near the cut off will be fundamentally very similar.
* This may occur if for instance the income cut-off for the grant is unknown.
* So even if students are systematically under-reporting their income, they are not doing it aware of the actual cut off, so the students sufficiently close, above and below the cut off are arguably the "same" or drawn from the same pool except that one group received the program and another group did not.
* The previous post deals some with assuming a linear structure of the underlying characteristics.
* http://www.econometricsbysimulation.com/2012/08/sharp-regression-discontinuity-example.html
* However, the more interesting case (potentially) may be when we assume a nonlinear response to the income in our dependent variable.
* But before going there let's think about what this method boils down to.
* Like all identification methods in statistics or econometrics when we do not have experimental data, identification of an effect is driven by some exogeneity argument.
* That is, x causes y and is unrelated to u (the error). In the case when u may be correlated with the error the use an exogenous variable to force the movement in the variable of interest may be sufficient to identify a causal effect.
* In this case, clearly it is not enough to simply see what the average y response (GPA, attendance, graduation rates, whatever) is to a change in grant level because those who receive the grants are systematically different from those who do not.
* However, because the cut off for receiving the grant is unknown, around the cut off the two samples who receive the grant and who do not can arguably be considered the same.
* Thus, we could say that the unknown position of the cut off is the random exogenous variable which near the cut off forces some students into the group that receives the grant and some students into the group that does not.
* Let's imagine some non-parametric relationship between income and performance:
clear
set obs 10000
gen income = 3^((runiform()-.75)*4)
label var income "Reported Income"
sum income
gen perf0 = ln(income) + sin((income-r(min))/r(max)*4*_pi)/3 + 3
label var perf0 "Performance Index - Base"
scatter perf0 income
* Looks pretty non-parametric
* Let's add in some random noise
gen perf1 = perf0 + rnormal()*.5
label var perf1 "Performance Index - with noise"
scatter perf1 income
* Using the user written command rcspline, we can see the local average performance as a function of income.
* ssc install rcspline
rcspline perf1 income, nknots(7) showknots title(Cubic Spline)
* I specify "7" knots which are the maximum allowed in the rcspline command.
* The spline seems to fit the generated data well.
* Now let's add a discontinuity at .5.
gen grant = income<.5
sum grant
* So about 50% of our sample is eligible for the grant.
* Now let's add the grant effect.
* First let's generate an income variable that is centered at the grant cut point.
gen income_center = income-.5
gen perf2 = perf1 + .5*grant - .1*income_center*grant
* Thus the grant is more effective for students with lower income.
label var perf2 "Observed Performance"
**** Simulation done: Estimation Start ****
rcspline perf2 income, knots(.15 .25 .35 .37 .4 .45 .5 .55 .6 .65 .75 .85 1.1 1.25 1.5) title(Cubic Spline)
* This is obviously not the ideal plot and I have had some difficulty finding a command which will generate the plot that I would like.
* However, we can see that there does appear to be "something" going on.
reg perf2 income grant
* We can see that our itial estimate of the effect of the grant is entirely wrong.
* It appears so far that the effect of the grant on performance is actually hindering performance (which we know is false).
* Now, let's try our new command rdrobust
rdrobust perf2 income_center
* The default cut point is at 0. Thus using income_centered works.
* Though this estimate is negative and thus seems the reverse of what we would expect, it is actually working quite well.
* That is because regression discontinuity is trying to identify the effect of the discontinuity on the outcome variable with the default assumption that at the discontinuity the forcing variable is becoming 1.
* In this case however, the discontinuity is really driving the grant to be equal to zero.
* Thus we must inverse the sign on the rd estimator in order to identify the true effect in this case.
* Alternatively, we could switch the sign of income.
gen nincome_center = income_center*(-1)
rdrobust perf2 nincome_center
* rdrobust is a newly designed command that has some extra bells and whistles that other regression discontinuity commands have as well as some oddities.
* I would suggest also looking to the more official stata command rd (ssc install rd)
rd perf2 nincome_center
* This command is nice because it estimates many bandwidths through the mbw option.
* The default mbw is "100 50 200" which means, use the 100 MSE (mean squared error) minimizing bandwidth, half of it and twice it.
* We can plot our estimates of the treatment effect using a range of bandwidths.
gen effect_est = .
label var effect_est "Estimated Effect"
gen band_scale = .
label var band_scale "Bandwidth as a Scale Factor of Bandwidth that Minimizes MSE"
forv i = 1/16 {
rd perf2 nincome_center, mbw(100 `=`i'*25')
if `i' ~= 4 replace effect_est = _b[lwald`=`i'*25'] if _n==`i'
if `i' == 4 replace effect_est = _b[lwald] if _n==`i'
replace band_scale = `=`i'*25' if _n==`i'
}
gen true_effect = .5
label var true_effect "True effect"
two (scatter effect_est band_scale) (line true_effect band_scale)
* We can see around the 100% MSE bandwidth estimates are fairly steady though they dip a tiny bit.
Thursday, February 7, 2013
2SLS with multiple endogenous variables
* I am wondering if when using 2SLS you must use a multivariate OLS in the reduced form or if you can just do each individual endogenous variable.
* Let's see!
clear
set obs 10000
* First generate the instruments
gen z1 = rnormal()
gen z2 = rnormal()
* Now the error. u1 and u2 are the parts of w1 and w2 correlated with the error u.
gen u1 = rnormal()
gen u2 = rnormal()
gen u3 = rnormal()*3
gen w1 = 1*z1 + .5*z2 + rnormal() + 2*u1
gen w2 = -.5*z1 + 1.5*z2 + rnormal() - 2*u2 + .2*u1
gen u = u1 + u2 + u3
gen y = 4 + w1*1 + w2*1 + u
* We can see because u is correlated with w1 and w2, OLS will be biased and inconsistent.
reg y w1 w2
* Instrumental variable regression could solve this problem
ivreg y (w1 w2 = z1 z2)
* Let's see about our 2SLS (which should be the same as IV)
reg w1 z1 z2
predict w1_hat
reg w2 z1 z2
predict w2_hat
reg y w1_hat w2_hat
* It seems that 2SLS using separate regressors is producing the same results.
* This is probably because in SOLS if you use the same regressors then I think the coefficients are the same but the standard errors may be adjusted for cross equation correlations (I think I recall).
Wednesday, February 6, 2013
Non-Parametric PDF Fit Test
* This is an idea that I decided to explore before inspecting how others have addressed the problem.
* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.
* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.
* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.
cap: program drop dist_check
program define dist_check, rclass
* The syntax command parses the input into useful macros used for later calculations.
syntax varlist, Tgenerate(numlist >= 100) [Dist(string)]
if "`dist'"=="" local dist="normal"
if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"
di "Simulate `tgenerate' draws"
* Construct a t-vector in Mata to store estimation results.
mata: tdist = J(`tgenerate', 1, .)
*******
* This will randomly draw from the distribution being tested a number of times equal to tgenerate.
*******
preserve
qui: drop if `varlist' == .
forv i=1(1)`tgenerate' {
tempvar x zx z p d
if "`dist'"=="normal" qui: gen `x' = rnormal()
if "`dist'"=="poisson" {
qui: sum `varlist'
qui: gen `x' = rpoisson(r(mean))
}
if "`dist'"=="uniform" qui: gen `x' = runiform()
sort `x'
qui: gen `p' = (_n-.5)/_N
if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}
if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
* The invpoisson distribution in Stata is misspecified. 1-p is neccessary.
qui: gen `d' = (`z'-`x')^2
}
if "`dist'"=="uniform" {
qui: gen `z' = _n/_N
qui: gen `d' = (`z'-`x')^2
}
qui: reg `d'
local t = _b[_cons]/_se[_cons]
mata: tdist[`i', 1]=`t'
drop `x' `z' `p' `d'
cap drop `zx'
}
* From the above loop we should have a vector of t values.
* We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.
mata: tsorted = sort(tdist, 1)
* One tailed test
local c90 = floor(`tgenerate'*.90)+1
mata: st_local("t90", strofreal(tsorted[`c90']))
local c95 = floor(`tgenerate'*.95)+1
mata: st_local("t95", strofreal(tsorted[`c95']))
local c99 = floor(`tgenerate'*.99)+1
mata: st_local("t99", strofreal(tsorted[`c99']))
* Two Tailed
* 90% CI
local c90U = floor((`tgenerate'+.5)*.95)+1
mata: st_local("t90U", strofreal(tsorted[`c90U']))
local c90L = ceil((`tgenerate'+.5)*.05)-1
mata: st_local("t90L", strofreal(tsorted[`c90L']))
* 95% CI
local c95U = floor((`tgenerate'+.5)*.975)+1
mata: st_local("t95U", strofreal(tsorted[`c95U']))
local c95L = ceil((`tgenerate'+.5)*.025)-1
mata: st_local("t95L", strofreal(tsorted[`c95L']))
* 99% CI
local c99U = floor((`tgenerate'+.5)*.99)+1
mata: st_local("t99U", strofreal(tsorted[`c99U']))
local c99L = ceil((`tgenerate'+.5)*.01)-1
mata: st_local("t99L", strofreal(tsorted[`c99L']))
** Now we do the estimation
tempvar x zx z p d
qui: gen `x' = `varlist'
sort `x'
qui: gen `p' = (_n-.5)/_N
* We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}
if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
qui: gen `d' = (`z'-`x')^2
}
if "`dist'"=="uniform" {
qui: sum `x'
qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
qui: gen `d' = (`z'-`x')^2
}
* This is the regression of interest.
qui: reg `d'
local t = _b[_cons]/_se[_cons]
* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t: `: di %9.3f `t''" _newline
di "One-way Analysis"
di " CI (1%) : 0.000 to `: di %9.3f `t99''"
di " CI (5%) : 0.000 to `: di %9.3f `t95''"
di " CI (10%): 0.000 to `: di %9.3f `t90''"
mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2
di "One-sided p:`: di %9.3f `p1'' (`position' out of `tgenerate')"
di _newline "Two-way Analysis"
di " CI (1%): `: di %9.3f `t99L'' to `: di %9.3f `t99U''"
di " CI (5%): `: di %9.3f `t95L'' to `: di %9.3f `t95U''"
di " CI (10%): `: di %9.3f `t90L'' to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"
return scalar p1 = `p1'
return scalar p2 = `p2'
restore
end
* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.
* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
clear
set obs `obs'
gen x = `rdraw'
dist_check x, t(`reps') dist(`dist')
end
* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)
* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1
sum
* mean r1_10 and mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.
* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.
* This looks great.
* Now let's see about the test's power to reject the null when the true distribution is not the null.
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* This distribution is almost uniform
hist x
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1
sum
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* This distribution is almost uniform
hist x
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1
sum
* I think there might be some slight errors. I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.
Tuesday, February 5, 2013
Test of PDF Fit - Does not work
* I came up with this idea for testing if an observed distribution could be tested.
* The idea is that it standardized your data of interest and then tests if when the data is ordered if the difference between the empirical distribution of our data and your null distribution.
* The following program will sort the data and construct a pdf of our null (the normal in this case).
cap: program drop normal_check
program define normal_check
* Preverse the input data
preserve
* Sort from smallest to largest
sort x
* Standardize x
egen xnew = std(x)
* Create a CDF of what each obsrvations probability is under the null.
gen p = (_n-.5)/_N
* Create a pdf projection of what each bin's expected value is under the null.
gen targ_x = invnormal(p)
* Calculate the difference (which is what we will be testing.
gen dif_norm = xnew-targ_x
* Regress the difference on a constant
reg dif_norm
* If the option "graph" was specified then graph the hypathetical and the observed densities.
if "`1'" == "graph" twoway kdensity xnew || kdensity x
* Restore the data to how it was before initiating the program
restore
end
* Now let's generate some sample data
clear
set obs 100
gen x = rnormal()
normal_check graph
* We can see that our PDF fitness test has failed.
* Perhaps if we bootstrap the process?
bs: normal_check
* We reject the null (though we know it is true).
* Nope. Why not?
* The sorting mechanism is a violation of the independence assumption.
* We need to generate alternative test statistics that will adjust for non-random sorting.
* My post tomorrow will demonstrate a response to this post which is generally effective at detecting underlying distributions.
* Distribution detection however is nothing new in statistics.
* Look up the ksmirnov command. There is also Anderson-Darling and Chi-Squared options though I do not know what is coded in Stata or what their syntax looks like.
Monday, February 4, 2013
Potential IV Challenges even with RCTs
* You design an RCT with a single experimental binary treatment.
* But you think your instrument has the power to influence multiple endogenous response variables.
* We know we cannot use a single instrument to instrument for more than one variable jointly.
* Does it still work as a valid instrument if it affects other variables that can in turn affect our outcome?
clear
set obs 1000
gen z = rbinomial(1,.5)
gen end1 = rnormal()
gen end2 = rnormal()
* end1 and end2 are the endogenous component of x1 and x2 which will bias our results if we are not careful.
gen x1 = rnormal()+z+end1
gen x2 = rnormal()+z+end2
gen u = rnormal()*3
gen y = x1 + x2 + end1 + end2 + u
reg y x1 x2
* Clearly the OLS estimator is biased.
ivreg y (x1=z)
ivreg y (x2=z)
* Trying to use the IV individually only makes things worse.
ivreg y x1 (x2=z)
* Trying to control for x1 does not help.
* So what do we do?
* If we have control over how the instrument is being constructed then we think things through carefully making sure that if we are interested on the effect of x1 on y then our instrument only results in a direct effect and no additional affect on x2.
* Unfortunately, this task can not always be feasible.
* Another consideration we should take into account is potential variable responses to the istrument.
* For instance:
clear
set obs 1000
gen z = rbinomial(1,.5)
gen end1 = rnormal()
* end1 and end2 are the endogenous component of x1 and x2 which will bias our results if we are not careful.
gen g1 = rnormal()
gen grc = g1+1
label var grc "Random Coefficient on the Instrument"
gen x1 = rnormal()+z*grc+(end1)
gen u = rnormal()*3
gen brc = g1+1
label var brc "Random Coefficient on the Endogenous Variable"
gen y = x1*brc + end1 + u
reg y x1
* OLS is still biased.
ivreg y (x1=z)
* Unfortunatley, now too is IV.
* This is because, though our instrument is uncorrelated our errors, the response to the instrument is variable and that response may also be correlated with a variable response on the variable of interest.
* Thus, though corr(u,z)=0 and cor(x,z)!=0, we can still have problems.
* This particular problem is actually something that I am working on with Jeffrey Wooldridge and Yali Wang.
Saturday, February 2, 2013
Chamberlain Mundlak Device and the Cluster Robust Hausman Test
* Unobserved variation can be divided in a useful manner.
* y_it = X_it*B + c_i + u_it
* c_i is fixed individual effect or random individual effect if c_i is uncorrelated with Xi (the time averages of X_it).
* In order for c_it to bias B is if there is some correlation between X_it and c_i.
* Therefore if we were to create a new variable which "controls" any time constant variation in X_it then the remaining v_i must be uncorrelated with X_it.
* Thus the Chamberlain Mundlak Device was born.
* Let's see it in action!
clear
set obs 100
gen id = _n
gen A1 = rnormal()
gen A2 = rnormal()
* Let's say we have 2 observations per individual
expand 2
gen x = rnormal()+A1
gen u = rnormal()*3
gen y = -5 +2*x + A1 + A2 + u
* Now in the above model there is both a portion of the unobserved variance correlated with the average x (A1) and a random portion uncorrelated with the average x (A2) the individual level.
* The fixed effect varies with the x variable while the random one does not.
* The standard approach in this case would be to use the Hausman test to differentiate between fixed effect and random effect models.
xtset id
* Let's first set id as the panel data identifier.
xtreg y x, fe
estimates store fe
* We store the estimates for use in the Hausman test
xtreg y x, re
hausman fe, sigmamore
* We strongly reject the null which we should expect so in classical econometric reasoning we choose to use the fixed effect estimator.
* An alternative method of estimating the fe estimator is by constructing the Chamberlain-Mundlak device.
* This device exploits the knowledge that the only portion of the time constant variation in X that can be correlated with u must be correlated only with the time average X for each individual.
bysort id: egen x_bar = mean(x)
reg y x x_bar
* Amazingly we can see that the new estimator is the same as the fe estimator above.
* Notice however the degrees of freedom.
* In the fe esimator we have used up half of our degrees of freedom.
* Yet our x estimate is the same size and our standard errors are very similar?
* If we double our sample size should not our standard errors decrease substantially?
* The answer is no. Why?
* I am going to run the fixed effect estimator manually.
reg y x i.id
* Look at our SSE or the R2. In the fixed effect model the R2 is much larger.
* This is because in terms of the random effect (A2), the fixed effect model controls for both the portion of the unobserved individual level variance which is correlated with the average x for each student as well as that portion uncorrelated with the average x.
* The Chamberlain-Mundlak (CM) device however only controls for the portion of the variance correlated with the average Xs. Thus there is much more unexplained variance which ends up reducing the power of our test which is approximately accounted for when we adjust the sample size.
* The CM can be additionally useful because it provides an alternative form of the Hausman test.
reg y x x_bar
* The significance of the generated regressor x_bar indicates the exogeneity of the unobserved individual effects.
* The test can be easily adjusted to be robust to cluster effects by specifying cluster in the regression.
reg y x x_bar, cluster(id)
Friday, February 1, 2013
Macro Parsing
Do file
* In Stata when programming commands we often want to be able to be able to develop our own syntax for our own commands.
* The command "syntax" is often at providing a method for parsing command inputs into easily managed forms.
* However, sometimes we may want syntax options not permitted by the command "syntax".
* If we were to program our own instrumental variable regression that looked like Stata's these limitations might cause problems.
* myivreg y x1 x2 (w1 w2 w3 = z1 z2 z3)
* This is one method how this kind of parsing could be accomplished using the gettoken command.
* Gettoken parses or breaks a macro when a certain character or set of characters is input.
* Let's see it in action
local example1 abc def ghi,jkl
di "`example1'"
* We can see that the local example1 is a disorganized string
* We can seperate it into before and after the space with the following commands.
gettoken before after: example1
di "`before'"
di "`after'"
* The default parse character is " "
* We can specify an alternative parse character or set of characters using the parse option.
gettoken before after: example1, parse(",")
di "`before'"
di "`after'"
* Notice that the parse character is included in the after parse
* Let's see gettoken in: myivreg y x1 x2 (w1 w2 w3 = z1 z2 z3)
cap program drop myivreg
program define myivreg
di "The local 0 contains the entire command line: `0'"
gettoken main options: 0 , parse(",")
gettoken dep_exog brackets: 0, parse("(")
gettoken dep_var exog : dep_exog
di "Dependent Variables: `dep_var'"
di "Exogenous Variables: `exog'"
gettoken cntr: brackets, parse(")")
local cntr=subinstr("`cntr'", "(", "", .)
di "Inside brackets: `cntr'"
gettoken endog inst: cntr, parse("=")
local inst=subinstr("`inst'", "=", "", .)
di "Endogenous Variables: `endog'"
di "Instrumental Variables: `inst'"
* At this point we have all of the main pieces of elements we would need to run our ivreg.
ivreg `dep_var' `exog' (`endog' = `inst')
end
* Let's generate some dummy data
clear
set obs 100
* This will generate some strange data:
* x2 -> z3 -> w0 -> z2 -> w1 -> z1 -> w2 -> x1 -> y
* Where -> means causes. This is not a valid iv regression data
foreach v in x2 z3 w0 z2 w1 z1 w2 x1 y {
gen `v' = rnormal() `corr_vars'
local corr_vars ="+`v'"
}
myivreg y (w0=z1)
myivreg y x1 x2 (w0 w1 w2 = z1 z2 z3)
* Everthing seems to be working well.
myivreg y x1 x2 (w0 w1 w2 = z1 z2 z3), test
* However, adding the option "test" does not cause a problem. Is this a problem. Not neccessarily.
* We just have not specified any code to ensure that unused syntax is accounted for.
* Combining the gettoken command with the syntax command can accomplish this.
cap program drop myivreg2
program define myivreg2
di "The local 0 contains the entire command line: `0'"
* This is a new bit of code that ensures that no options will be included
syntax anything
gettoken dep_exog brackets: anything, parse("(")
gettoken dep_var exog : dep_exog
di "Dependent Variables: `dep_var'"
di "Exogenous Variables: `exog'"
gettoken cntr: brackets, parse(")")
local cntr=subinstr("`cntr'", "(", "", .)
di "Inside brackets: `cntr'"
gettoken endog inst: cntr, parse("=")
local inst=subinstr("`inst'", "=", "", .)
di "Endogenous Variables: `endog'"
di "Instrumental Variables: `inst'"
* At this point we have all of the main pieces of elements we would need to run our ivreg.
ivreg `dep_var' `exog' (`endog' = `inst')
end
myivreg2 y x1 x2 (w0 w1 w2 = z1 z2 z3)
myivreg2 y x1 x2 (w0 w1 w2 = z1 z2 z3), test
* Now does not work.
* In Stata when programming commands we often want to be able to be able to develop our own syntax for our own commands.
* The command "syntax" is often at providing a method for parsing command inputs into easily managed forms.
* However, sometimes we may want syntax options not permitted by the command "syntax".
* If we were to program our own instrumental variable regression that looked like Stata's these limitations might cause problems.
* myivreg y x1 x2 (w1 w2 w3 = z1 z2 z3)
* This is one method how this kind of parsing could be accomplished using the gettoken command.
* Gettoken parses or breaks a macro when a certain character or set of characters is input.
* Let's see it in action
local example1 abc def ghi,jkl
di "`example1'"
* We can see that the local example1 is a disorganized string
* We can seperate it into before and after the space with the following commands.
gettoken before after: example1
di "`before'"
di "`after'"
* The default parse character is " "
* We can specify an alternative parse character or set of characters using the parse option.
gettoken before after: example1, parse(",")
di "`before'"
di "`after'"
* Notice that the parse character is included in the after parse
* Let's see gettoken in: myivreg y x1 x2 (w1 w2 w3 = z1 z2 z3)
cap program drop myivreg
program define myivreg
di "The local 0 contains the entire command line: `0'"
gettoken main options: 0 , parse(",")
gettoken dep_exog brackets: 0, parse("(")
gettoken dep_var exog : dep_exog
di "Dependent Variables: `dep_var'"
di "Exogenous Variables: `exog'"
gettoken cntr: brackets, parse(")")
local cntr=subinstr("`cntr'", "(", "", .)
di "Inside brackets: `cntr'"
gettoken endog inst: cntr, parse("=")
local inst=subinstr("`inst'", "=", "", .)
di "Endogenous Variables: `endog'"
di "Instrumental Variables: `inst'"
* At this point we have all of the main pieces of elements we would need to run our ivreg.
ivreg `dep_var' `exog' (`endog' = `inst')
end
* Let's generate some dummy data
clear
set obs 100
* This will generate some strange data:
* x2 -> z3 -> w0 -> z2 -> w1 -> z1 -> w2 -> x1 -> y
* Where -> means causes. This is not a valid iv regression data
foreach v in x2 z3 w0 z2 w1 z1 w2 x1 y {
gen `v' = rnormal() `corr_vars'
local corr_vars ="+`v'"
}
myivreg y (w0=z1)
myivreg y x1 x2 (w0 w1 w2 = z1 z2 z3)
* Everthing seems to be working well.
myivreg y x1 x2 (w0 w1 w2 = z1 z2 z3), test
* However, adding the option "test" does not cause a problem. Is this a problem. Not neccessarily.
* We just have not specified any code to ensure that unused syntax is accounted for.
* Combining the gettoken command with the syntax command can accomplish this.
cap program drop myivreg2
program define myivreg2
di "The local 0 contains the entire command line: `0'"
* This is a new bit of code that ensures that no options will be included
syntax anything
gettoken dep_exog brackets: anything, parse("(")
gettoken dep_var exog : dep_exog
di "Dependent Variables: `dep_var'"
di "Exogenous Variables: `exog'"
gettoken cntr: brackets, parse(")")
local cntr=subinstr("`cntr'", "(", "", .)
di "Inside brackets: `cntr'"
gettoken endog inst: cntr, parse("=")
local inst=subinstr("`inst'", "=", "", .)
di "Endogenous Variables: `endog'"
di "Instrumental Variables: `inst'"
* At this point we have all of the main pieces of elements we would need to run our ivreg.
ivreg `dep_var' `exog' (`endog' = `inst')
end
myivreg2 y x1 x2 (w0 w1 w2 = z1 z2 z3)
myivreg2 y x1 x2 (w0 w1 w2 = z1 z2 z3), test
* Now does not work.
Subscribe to:
Posts (Atom)