## Friday, August 10, 2012

### Selection and Bias

* Selection bias is often regarded as a severe condition that if diagnosed, significantly limits the believability of any study.  In this simulation I will show that while selection bias can bias coefficients the type and severity of the selection can create widely different results.

* Generally speaking it is a neccessary assumption that selection is random in order to achieve consistent estimates.

* In this post I will look at several different potential selection mechanisms and examine how these different mechanisms may bias estiamtes.

* I: Selection is based on one or more explanatory variables.  This might be the case if you were wondering what the returns to years of education were but people with more years of education were less likely to respond to the survey.

clear
set obs 100000
* Set the number of observations that will be generated.

gen x1 = rnormal()
gen x2 = runiform()
gen u = runiform()

gen s = rbinomial(1,x2)
* Selection is based on x2.  The more years of education in terms of a 0 to 1 scale that a person has the more likely the person is to opt out of the survey.

gen y = 1 + 1.5*x1 + 2*x2 + 10*(u-.5) if s == 1
* y is some income index

reg y x1 x2
* When selection is based on years experience alone there is no detectable bias in estimates.

* This type of selection is called "selection at random" in the statics literature.  This is due to selection not being a source of bias when it is based on observable characteristics.  However, this is a poor name for selection because even when based on observable characteristics, selection is not at random and does not have many of the convenient properties of selection truly random, such as the sample being an unbiased estimator of the mean x2.

sum x2 if s==1

* II: Selection is based on the unobserved shock term u.  Imagine that some random of the target population gets hired by private corporations that pay more (thus larger u) but require that employees do not respond to unapproved surveys.

clear
set obs 100000

gen x1 = rnormal()
gen x2 = runiform()
gen u = runiform()

gen s = rbinomial(1,1-u)

gen y = 1 + 1.5*x1 + 2*x2 + 10*(u-.5) if s == 1

reg y x1 x2
* Once again there is no bias on the coefficients of x1 and x2 however there is a bias in the constant estimate.  This is because the sample that responds to the survey on average has lower expected wage than those who do not respond.

* Thus E(u|s=1)>0

* III: The consistency of the results for mechanism I is a bit misleading.  For one, we are assuming that x2 (education) is uncorrelated with unobserved error u (such as ability, family background, or geographic heterogeneity).  All of these factors are likely to cause difficulties when estimating the returns to education.  However, these problems exist absent of selection.  What we are interested in is how does selection bias estimates.  If we relax the implicit assumption of constant returns to education then it is easier to see how selection biases these results.

clear
set obs 100000

gen x1 = rnormal()
gen x2 = runiform()
gen u = rnormal()
gen v = runiform()

* Imagine now that there is some unobserved heterogeneity v (hard-workingness) which affects the returns to education.
gen r = (2*v)

sum r
* The expected value of r is 1.

gen y = 1 + 1.5*x1 + 2*x2*r + 10*u

reg y x1 x2
* Without selection, the random coefficient (r) does not bias the average partial effect of x2 on y.
* One important reason is because cor(x2,r)=cov(x2,r)=0 and
*  cov(x2,r) = E(x2*r) - E(x2)E(r) =  E(x2*r) - E(x2) = 0
* => E(x2*r) = E(x2)

gen s = rbinomial(1,v)
* Let's imagine that people who are hard working, are more likely to make the time to answer surveys.

reg y x1 x2 if s == 1
* Section on v however does bias the coefficient on x2.

* This is because E(r|s=1)>1

* IV: Finally we will exam is selection based on outcome variable y.  Imagine that people who have less y (income) are less likely to be willing to respond to the survey.  This will cause selection to be correlated with the explanatory variables x1 and x2 as well as the error u.

clear
set obs 100000

gen x1 = rnormal()
gen x2 = rnormal()
gen u = rnormal()

gen y = 1 + 1.5*x1 + 2*x2 + 7.5*u

sum y

* Now we will create an index of y from 0 to 1 from which selection will occur
gen yp = (y-r(min))/(r(max)-r(min))

gen s = rbinomial(1,yp)

reg y x1 x2 if s == 1
* Section on v however does bias the estimates even if there exists an instrumental variable.

* This is because Corr(u,s)>0 and Corr(x, s)>0.  This alone does not imply Corr(x,u|s)!=0 however in this case it happens to be the case.

cor u x1 x2 if s == 1

* In summary, selection bias can bias the results of OLS estimates if not taken into consideration (II,III,IV).  However, this bias may be small enough as to be trivial in many cases (II, IV).  In contrast when the standard assumption on the constant nature of the coefficients is relaxed then selection can become much more of a biasing factor (III).

1. This comment has been removed by a blog administrator.

2. Hey Francis,

I recently read in Lomax (2007) that the effects of violation of assumptions in ANOVA/ANCOVA isn't well known. Recently, I've had several discussions with numerous colleagues regarding the use of ANOVA with observational data; more specifically, panel data with t = 3 and i = 500 (although there were only 170 individuals by the last panel). I tried to first explain how using an endogenous IV is problematic (comparing two groups), and then how ignoring the correlation of the measures over time and the clustering of observations within hierarchical units would also be problematic. I've not done too much in terms of simulations, so I was wondering if you had any thoughts on how people could set up simulation studies for cases like this to better advocate for the use of more appropriate modeling techniques (the results of the "analyses" also appear under a heading "EVIDENCE OF IMPACT" in the report that they prepared).