* This simulation is inspired by the work of the blog FiveThirtyEight (http://fivethirtyeight.blogs.nytimes.com/).
* In order to make the simulation more interesting, I spent some time collecting real data.
* I have collected poll data from RealClearPolitics.com on nearly all 50 states and DC.
* Download the excel file posted at
( post a comment if there is a problem with downloading the file from dropbox.)
* Then load it into Stata.
* You will find that it has information listed by poll then by state alphabetically.
* The variable Sample is the number of people sampled while the MoE is the margin of error. This is basically half the confidence interval.
* You will also find variables Obama and Romney which is the percent of people indicating that they would vote for that condidate.
* The final bit of interesting information is the Electoral_College information which counts the number of votes associated with that state.
* The first problem you will notice is that some states do not have information.
* I have filled in the Romney and Obama values based on RCP's likely vote values.
* The second problem you will note is that the MoE and Sample are considered Text while what we need is Numbers.
destring Sample, replace force
destring MoE, replace force
* Let us assume that the states that do not have poll information are either solidly Obama or solidly Romney.
* Let's simply choose a sample size and MoE for missing values that will make the percentages not change.
replace Sample = 500 if Sample==.
replace MoE = 1 if MoE==.
* Let's first recover the Standard Errors: MoE = 1.96 * SE -> SE = MoE/1.96
gen se = MoE/1.96
* The first thing I would like to do is get the State average across the polls for both the variance estiamte and the means.
* If we assume that each poll is equally valid then to calculate the State average.
* For Obama: Total Mean=Sum(Obama/100*Sample)/Total Sample
* Same for Romney
* So let's first calculate the Total Sample by State
bysort State: egen Sample_Total = sum(Sample)
* Let's calculate the number of people per poll who declared Obama/Romney
gen NObama = Obama/100*Sample
gen NRomney = Romney/100*Sample
* Note: this number is usually adjusted for representation within the total population. Thus, we get numbers that are not whole as a result of this calculation.
* Now the total voters by State:
bysort State: egen NObama_Total = sum(NObama)
bysort State: egen NRomney_Total = sum(NRomney)
* The last calculation is simply Nvoters/Total_Voters
gen NS_Obama = NObama_Total/Sample_Total
gen NS_Romney = NRomney_Total/Sample_Total
* Now the trickier part is calculating the total variance by the state.
* First let's remember how the se is calculated: se=sd/n^.5
* We want to solve for the "individual" standard deviation => thus sd=se*n^.5 => sd^2 = var = se^2*n
* The reason we want to have the variance estimate is because variances can be added together assuming observations are independent which should be true in polls.
* First let's calculate our var
gen var_hat = se^2*Sample
* Once we have our variances we would like to sum them together to find our total sample variance
bysort State: egen var_total = sum(var_hat)
* Finally we would like to turn our total variance estimate back into a standard error.
* Thus: se=(var^.5)/(total_sample^.5)
* Finally, let's create a count of the number of polls by state
bysort State: gen Npolls = _N
* Now we have created all of the variables we need at this level. We will collapse the data to the state level.
collapse se_total Sample_Total NS_Obama NS_Romney Electoral_Count Npolls, by(State)
rename NS_Obama obama
rename NS_Romney romney
* Now let's calculate the new margins
gen obama_romney = obama-romney
* We can see the majority of the states swing Romney
gen obama_electoral = Elector if (obama_romney>0)
* However, just based on State averages we would expect Obama to take 307 electoral votes.
gen romney_electoral = Elector if (obama_romney<0 p="p">sum romney_electoral
* For Romney, the comparable number is 233. Hmm, well does not look good for Romney.
* Does our analysis end here? It need not! Thus far we have calculated the most likely outcome.
* Let's now try to figure out how frequently overall Obama will win (of all possible outcomes).
* For this, we will use simulation!
cap program drop election2012
program define election2012, rclass
* This simulation, as far as simulations go will be pretty straightforward.
* First let's rescale our voters so they they must either vote Obama or Romney and so that the number of interest is only likelihood of voting Obama.
gen obama_p = obama/(obama+romney)
* I am not sure how these pollsters generate their margin of error or more particularly what it is relative to.
* This will be the only tricky part of the simulation.
* I would like it to be possible to specify the correlation in errors between states.
* I will only allow one correlation to exist.
gen cor_e = rnormal() if _n==1
* cor_e is a common error among all states
* rho is coughly a correlation in error across states.
* if a first argument is not set then there is assumed to be no correlation in errors.
local rho = 0`1'
gen e = cor_e*`rho' + rnormal()*(1-`rho')^.5
* This creates correlation but it messes up standard errors.
* Fortunately, standard errors are easy to "fix".
replace e=e / r(sd)
local se_scalar = 1+0`2'
* The `2' is a scalar for the random error.
* We may think that the polls are more or less volatile than how people actually vote.
* If they are more volatile then `2' should be less than 0 or if they are more then `2' should be greater
* Now we will add in our random variance component
gen obama_final = obama_p + e*(se_total/100)*`se_scalar'
* Note the se is in percent units but obama_p is in proportion so we need to divide by 100.
* Now, we simply see how many times obama got 50% or more and count his electoral votes
gen obama_elect = Elector if obama_final>.5
local obama_elect_count = r(N)*r(mean)
return scalar obama_elect_count = `obama_elect_count'
gen romney_elect = Elector if obama_final<.5
local romney_elect_count = r(N)*r(mean)
return scalar romney_elect_count = `romney_elect_count'
* Now return a dummy indicating if Obama wins this round (gets more electoral votes):
local obama_win = (`obama_elect_count'>`romney_elect_count')
* Let's say ties go to Romney
return scalar obama_win = `obama_win'
* End Simulation programming
di "In one run of the simulation Obama got " r(obama_elect_count)
di "In one run of the simulation Romney got " r(romney_elect_count)
simulate, rep(1000): election2012
* Repeating the simulation 1000 times, I get that Obama wins 91% of the time.
hist obama_elect_count, nodraw name(Obama, replace) xtitle(Obama Electoral College Votes)
hist romney_elect_count, nodraw name(Romney, replace) xtitle(Romeny Electoral College Votes)
graph combine Obama Romney, ycommon xcommon cols(1) title(Obama vs Romney)
* Now let's add a fairly strong correlation in errors .4
simulate, rep(1000): election2012 .4
* Repeating the simulation 1000 times, I get that Obama wins 79% of the time. This is because we are in effect increasing the total system error by making it covariant causing the outcome to approach 50%
* Finally, let's double the standard errors arguing that polls suffer for random measurement error.
simulate, rep(1000): election2012 .4 2
* Repeating the simulation 1000 times, now Obama wins only 66.5% of the time.
* Is this something to worry Obama supporters? No, not really.
* If we increase the variance of the error then of course everything moves towards the mean - 50% victory.
* But realistically, it is probably not justified to double the standard deviation of the error.