Sunday, July 22, 2012

The Importance of Independence


# Simulation in R
# It is a common assumption in most econometric models that data is drawn randomly from the population.

# However, this is often not the case due to data often being collected for purposes outside of econometric analysis.

# In addition, collecting random data might be infeasible or impossible in some instances.

# How would you survey randomly among potential terrorists for instance?

# There are a number of problems non-random sampling can cause.

# First let's draw 1000 standard normal xs with mean 0 and standard deviation 1.
x <- rnorm(1000,0,1)

# Let's imagine that we by chance sellected the 250 x's which are in the "middle"

# First let's sort x

xsort <- sort(x)

# Now let's sample non-randomly from the sorted "middle" of the population

xmiddle_sample <- xsort[400:599]

# Now let's sample alternatively from the random "middle" of the population

x_sample <- x[400:599]

sd(xmiddle_sample)
# We can see the standard deviation is about 1/6 of what it should be (close to 1)

sd(x_sample)
# In the random sub-sample the standard deviation is much closer to 1

mean(xmiddle_sample)
# The mean of the middle sample is close to the true mean.  But this is by chance.

mean(x_sample)
# The mean of the random sample also works well.

# However, having a non-random sample can bias mean estimates as well as estimates of variation.
xlower_sample <- xsort[1:200]

sd(xlower_sample)
mean(xlower_sample)
# The standard deviation is still too small but now the mean is way off.

# So how does all of this translate into OLS?

# It depends to some extent on what elements are non-random.

# In Y = XB + u, if the sample of X is non-randomly sorted and selected from as above then the only problem will be that the variance of X is small making it hard to identify it's effect on Y.

# If however, the u term is the one that is sorted on and selected from then it is going to make our estimated standard errors too small causing us to overrect the null.  In addition, it might also cause us to bias the constant.

# Finally, if sorting and selecting is based on a combination of X and u (perhaps sorting on Y for instance) then non-random selection will additionally make our estimates of the coefficients biased.

# Let's see how this works.  Let's start with 10,0000 observations.

X <- rnorm(10000)
u <- rnorm(10000)*3

Y <- 2+X+u

# The coefficient on X is 1

# Let's bind the data together in a data frame
data.rand <- as.data.frame(cbind(Y,X,u))

summary(lm(Y~X, data=data.rand))

# First let's randomly sample form data as a benchmark

sample.rand <- data.rand[1:1000,]

summary(lm(Y~X, data=sample.rand))

# We can see a random subsample from the population does not pose us too many problems.

# We loose a little accuracy and the stength of our t values diminish.

# Now, let's see what happens when we sort on X then select from x

data.Xsort <- data.rand[order(data.rand$X),]

# We can see that in data.Xsort the X variable is sorted (we will only look at the first 40)

data.Xsort[1:40,]

# Now we select a middle lower 1000 observations

sample.Xsort <- data.Xsort[3001:4000,]

summary(lm(Y~X, data=sample.Xsort))

# We can see that as predicted, though the sample size is the sample as previously, and though the estimate of the coefficient on X is still good, there is not enough variation in X to reject the null.

# Let's see how it works when the data is selected and sorted by u.

data.usort <- data.rand[order(data.rand$u),]

sample.usort <- data.usort[3001:4000,]

summary(lm(Y~X, data=sample.usort))

# As predicted the constant is strongly biased.

# In addition the standard errors are too small.  This would cause us to overreject the null.

# It is not possible to see this with this data as is because all of the variables have an effect on Y.

# But let us add a handful of variables to see if we reject.

# Loops through i=1 through i=20
for(i in 1:20) {
  # Generate the new variable name z1 z2 etc
  newvar <- paste("z",i,sep="")
  # Assign the new variable to the sample.usort
  sample.usort[newvar]<-rnorm(1000)
}

# We can see we now have 20 random new variables
sample.usort[1:5,]

summary(lm(Y~X+
             z1+z2+z3+z4+z5+z6+z7+z8+z9+z10+
             z11+z12+z13+z14+z15+z16+z17+z18+z19+z20
                , data=sample.usort))

# It does not seem to be overrejecting too horribly in general.  But I am sure if we did a Monte Carlo repeated simulation we would find the rejection rate to be greater than .1 at the .1 level .05 at the .05 level etc.

# Finally let us look at what happens if the data is sorted by Y (which is a combination of X and u)

data.Ysort <- data.rand[order(data.rand$Y),]

sample.Ysort <- data.Ysort[3001:4000,]
cor(sample.Ysort$X,sample.Ysort$u)
# Now Y and X are negatively correlated with each other.  Why?
plot(1:100, sample.Ysort$u[1:100], col = "red")
  points(1:100, sample.Ysort$Y[1:100], col = "blue")
  points(1:100, sample.Ysort$x[1:100], col = "black")
dev.copy(png,'independence.png')

# In order to be in the range 3001-4000 in generaly either X has to be low but u high, or u low and X high, or u and X both a little low.
sample.Ysort[1:40,]

# If you were to take a different sample of Y then you would find a different correlation between X and u.

# The important point is that by selecting the sample on Y then (assuming X has real effects) then you are by neccessity causing a correlation between the error and x.
summary(lm(Y~X, data=sample.Ysort))

# Thus choosing your sampling mechanism is critical.

1 comment: