In an experiment involving social policy, a question has arisen as to what statistical method to use. The outcome is 0/1 -- let's say, whether a criminal defendant shows up for trial or not. Then there are a control group and 4 treatment groups (thus, 5 indicator variables showing which group each defendant is in). So the point is to predict the 0/1 outcome based on which treatment group the defendant is in.

One possibility is to use ANOVA, but I'm not sure of that due to the assumption that the outcome is normally distributed, which an indicator variable is not.

I'd like to simulate a dataset such that the outcome (0/1) is as follows: Control group, 20% chance of not showing up (label that as 1, so 20% of the controls should show 1 in the outcome column and 80% should show zeros); treatment group 1, 18% chance of not showing up), treatment group 2, 16% chance of not showing up; treatment group 3, 14% chance of not showing up; and treatment group 4, 12% chance of not showing up. Then I'd like to be able to compare the results from ANOVA as well as a permutation test's

p-values.

Stata Code:

* Define a program called sampler to generate our data and run our tests.
cap program drop sampler
program define sampler , rclass
* The first argument of the sampler command is the sample
* size to be generated.
gl samp_size = `1'
* Start generating data
**clear**
set **obs** 5
* A token observation for each group defined by treatment status
**gen** treat = _n
* Generate the numbers in each group size
**gen** grp_size = ${samp_size} * ${ntreat1} if treat==1
replace grp_size = ${samp_size} * ${ntreat2} if treat==2
replace grp_size = ${samp_size} * ${ntreat3} if treat==3
replace grp_size = ${samp_size} * ${ntreat4} if treat==4
replace grp_size = ${samp_size} * ${ntreat5} if treat==5
tab treat, **gen**(treat)
* Generate a variable to store the true probability that a defendent
* will show up for trial.
**gen** p_show_up = ${treat1_trial} if treat==1
replace p_show_up = ${treat2_trial} if treat==2
replace p_show_up = ${treat3_trial} if treat==3
replace p_show_up = ${treat4_trial} if treat==4
replace p_show_up = ${treat5_trial} if treat==5
* Now let's **expand** our token observations to a full data set
**expand** grp_size
* Now we have the individuals make the actual decisions to show up
**gen** show_up = rbinomial(1,p_show_up)
* We can **do** an simple anova by using a multiple regression
**reg** show_up treat2-treat5
**test** treat2=treat3=treat4=treat5=0
return scalar F05 = `=r(p)<.05'
return scalar F10 = `=r(p)<.1'
**test** treat2=0
return scalar t2_05 = `=r(p)<.05'
return scalar t2_1 = `=r(p)<.1'
**test** treat3=0
return scalar t3_05 = `=r(p)<.05'
return scalar t3_1 = `=r(p)<.1'
**test** treat4=0
return scalar t4_05 = `=r(p)<.05'
return scalar t4_1 = `=r(p)<.1'
**test** treat5=0
return scalar t5_05 = `=r(p)<.05'
return scalar t5_1 = `=r(p)<.1'
**end**
* Define a program for defining an exact statistics for false rejection by
* randomly sorting the dependent variable.
cap program drop perm
program define perm, rclass
* Perm takes three arguments
* command: the command to be run after the dependent variables
* have been scrambled
* dep_var: a list of dependent variables to be scrambled
* tests: a list of tests to be run after running the command
**preserve**
syntax [anything], COmmand(string) DEp_var(varlist) tests(string)
foreach v in `dep_var' {
tempvar sorter orderer
**gen** `sorter' = rnormal()
**gen** `orderer' = _n
sort `sorter'
replace `v' = `v'[`orderer']
di "reorder `v'"
}
`command'
local t=1
gl return_list
foreach v in `tests' {
**test** `v'
return scalar t`t'_05 = `=r(p)<.05'
return scalar t`t'_10 = `=r(p)<.1'
gl return_list ${return_list} t`t'_05=r(t`t'_05) t`t'_10=r(t`t'_10)
local t=1+`t'
}
**restore**
**end**
* Begin Estimation
* Treatment one is the control group
* Define the fraction of subjects in each group
gl ntreat1 = 1/5
gl ntreat2 = 1/5
gl ntreat3 = 1/5
gl ntreat4 = 1/5
gl ntreat5 = 1/5
* Probability that a criminal defendent shows up for trial
gl treat1_trial = .20
gl treat2_trial = .18
gl treat3_trial = .16
gl treat4_trial = .14
gl treat5_trial = .12
sampler 5500
* Let's see how the permutation test works
perm, co(**reg** show_up treat2-treat5) de(show_up) ///
tests("treat2=treat3=treat4=treat5=0" "treat2=0" ///
"treat3=0" "treat4=0" "treat5=0")
simulate ${return_list}, **rep**(10000): ///
perm, co(**reg** show_up treat2-treat5) de(show_up) ///
tests("treat2=treat3=treat4=treat5=0" "treat2=0" ///
"treat3=0" "treat4=0" "treat5=0")
sum
/* Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
t1_05 | 10000 .0507 .2193954 0 1
t1_10 | 10000 .1002 .3002815 0 1
t2_05 | 10000 .0524 .2228435 0 1
t2_10 | 10000 .1015 .3020048 0 1
t3_05 | 10000 .054 .226029 0 1
-------------+--------------------------------------------------------
t3_10 | 10000 .1049 .3064398 0 1
t4_05 | 10000 .0519 .2218362 0 1
t4_10 | 10000 .1008 .3010788 0 1
t5_05 | 10000 .0514 .2208233 0 1
t5_10 | 10000 .1018 .3024002 0 1
These numbers are looking pretty good. I would suspect any bias in rejection
rates is very small from these estimates. */
gl save_values F05=r(F05) F10=r(F10) t2_05=r(t2_05) t2_1=r(t2_1) ///
t3_05=r(t3_05) t3_1=r(t3_1) t4_05=r(t4_05) ///
t4_1=r(t4_1) t5_05=r(t5_05) ///
b2=_b[treat2] b3=_b[treat3] b4=_b[treat4] b5=_b[treat5]
* Now let's use a Monte Carlo resampler to see how frequently we reject the null.
simulate ${save_values} , **rep**(5000): sampler 5500
sum
/* Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
F05 | 5000 .999 .0316101 0 1
F10 | 5000 .9994 .02449 0 1
t2_05 | 5000 .2672 .4425419 0 1
t2_1 | 5000 .381 .4856811 0 1
t3_05 | 5000 .7202 .4489457 0 1
-------------+--------------------------------------------------------
t3_1 | 5000 .8114 .3912297 0 1
t4_05 | 5000 .9686 .1744137 0 1
t4_1 | 5000 .9846 .1231498 0 1
t5_05 | 5000 .9994 .02449 0 1
b2 | 5000 -.0200947 .0163725 -.0754545 .0527273
-------------+--------------------------------------------------------
b3 | 5000 -.0399916 .0161132 -.1045455 .0254545
b4 | 5000 -.0601151 .0158372 -.1154545 -.0045455
b5 | 5000 -.0801233 .0155707 -.1354546 -.0254545
Looking at the estiamates of the coeficients b2-b5 the estimates seem
to be unbiased since their divergence from -.02, -.04, -.06, and -.08
is very small. However, we can **test** this as well.
*/
**reg** b2
**test** _cons==-.02
**reg** b3
**test** _cons==-.04
**reg** b4
**test** _cons==-.06
**reg** b5
**test** _cons==-.08
* All of the p values are well above the 10% rejection rate.
Formatted By EconometricsbySimulation.com