Tuesday, November 6, 2012

2012 Election! Simulation and predictions. Obama wins 91% of simulated outcomes.

Original Code

* This simulation is inspired by the work of the blog FiveThirtyEight (http://fivethirtyeight.blogs.nytimes.com/).

* In order to make the simulation more interesting, I spent some time collecting real data.

* I have collected poll data from RealClearPolitics.com on nearly all 50 states and DC.

* Download the excel file posted at
( post a comment if there is a problem with downloading the file from dropbox.)

* Then load it into Stata.

* You will find that it has information listed by poll then by state alphabetically.

* The variable Sample is the number of people sampled while the MoE is the margin of error.  This is basically half the confidence interval.

* You will also find variables Obama and Romney which is the percent of people indicating that they would vote for that condidate.

* The final bit of interesting information is the Electoral_College information which counts the number of votes associated with that state.

* The first problem you will notice is that some states do not have information.

* I have filled in the Romney and Obama values based on RCP's likely vote values.

* The second problem you will note is that the MoE and Sample are considered Text while what we need is Numbers.

destring Sample, replace force
destring MoE, replace force

* Let us assume that the states that do not have poll information are either solidly Obama or solidly Romney.

* Let's simply choose a sample size and MoE for missing values that will make the percentages not change.

replace Sample = 500 if Sample==.
replace MoE = 1 if MoE==.

* Let's first recover the Standard Errors: MoE = 1.96 * SE -> SE = MoE/1.96

gen se = MoE/1.96

* The first thing I would like to do is get the State average across the polls for both the variance estiamte and the means.

* If we assume that each poll is equally valid then to calculate the State average.
*  For Obama: Total Mean=Sum(Obama/100*Sample)/Total Sample
* Same for Romney

* So let's first calculate the Total Sample by State
bysort State: egen Sample_Total = sum(Sample)

* Let's calculate the number of people per poll who declared Obama/Romney
gen NObama = Obama/100*Sample
gen NRomney = Romney/100*Sample

* Note: this number is usually adjusted for representation within the total population.  Thus, we get numbers that are not whole as a result of this calculation.

* Now the total voters by State:
bysort State: egen NObama_Total = sum(NObama)
bysort State: egen NRomney_Total = sum(NRomney)

* The last calculation is simply Nvoters/Total_Voters
gen NS_Obama = NObama_Total/Sample_Total
gen NS_Romney = NRomney_Total/Sample_Total

* Now the trickier part is calculating the total variance by the state.
* First let's remember how the se is calculated: se=sd/n^.5
* We want to solve for the "individual" standard deviation => thus sd=se*n^.5 => sd^2 = var = se^2*n
* The reason we want to have the variance estimate is because variances can be added together assuming observations are independent which should be true in polls.

* First let's calculate our var
gen var_hat = se^2*Sample

* Once we have our variances we would like to sum them together to find our total sample variance
bysort State: egen var_total = sum(var_hat)

* Finally we would like to turn our total variance estimate back into a standard error.
* Thus: se=(var^.5)/(total_sample^.5)
gen se_total=var_total^.5/Sample_Total^.5

* Finally, let's create a count of the number of polls by state
bysort State: gen Npolls = _N

* Now we have created all of the variables we need at this level.  We will collapse the data to the state level.
collapse se_total Sample_Total NS_Obama NS_Romney  Electoral_Count Npolls, by(State)

rename NS_Obama obama
rename NS_Romney romney

* Now let's calculate the new margins
gen obama_romney = obama-romney
sum obama_romney

* We can see the majority of the states swing Romney
gen obama_electoral = Elector if (obama_romney>0)
sum obama_electoral
di r(mean)*r(N)
* However, just based on State averages we would expect Obama to take 307 electoral votes.

gen romney_electoral = Elector if (obama_romney<0 p="p">sum romney_electoral
di r(mean)*r(N)
* For Romney, the comparable number is 233.  Hmm, well does not look good for Romney.

* Does our analysis end here?  It need not!  Thus far we have calculated the most likely outcome.

* Let's now try to figure out how frequently overall Obama will win (of all possible outcomes).

* For this, we will use simulation!
cap program drop election2012
program define election2012, rclass

  * This simulation, as far as simulations go will be pretty straightforward.
  * First let's rescale our voters so they they must either vote Obama or Romney and so that the number of interest is only likelihood of voting Obama.
  gen obama_p = obama/(obama+romney)
  * I am not sure how these pollsters generate their margin of error or more particularly what it is relative to.

  * This will be the only tricky part of the simulation.
  * I would like it to be possible to specify the correlation in errors between states.
  * I will only allow one correlation to exist.
  gen cor_e = rnormal() if _n==1
  * cor_e is a common error among all states

  * rho is coughly a correlation in error across states.
  * if a first argument is not set then there is assumed to be no correlation in errors.
  local rho = 0`1'

  di `rho'

  gen e = cor_e[1]*`rho' + rnormal()*(1-`rho')^.5

  * This creates correlation but it messes up standard errors.
  * Fortunately, standard errors are easy to "fix".
  sum e
  replace e=e / r(sd)

  local se_scalar = 1+0`2'
  * The `2' is a scalar for the random error.
  * We may think that the polls are more or less volatile than how people actually vote.
  * If they are more volatile then `2' should be less than 0 or if they are more then `2' should be greater

  * Now we will add in our random variance component
  gen obama_final = obama_p + e*(se_total/100)*`se_scalar'
  * Note the se is in percent units but obama_p is in proportion so we need to divide by 100.

  * Now, we simply see how many times obama got 50% or more and count his electoral votes
  gen obama_elect = Elector if obama_final>.5
  sum obama_elect
  local obama_elect_count = r(N)*r(mean)
    return scalar obama_elect_count = `obama_elect_count'

  gen romney_elect = Elector if obama_final<.5
  sum romney_elect
  local romney_elect_count = r(N)*r(mean)
    return scalar romney_elect_count = `romney_elect_count'

  * Now return a dummy indicating if Obama wins this round (gets more electoral votes):
  local obama_win = (`obama_elect_count'>`romney_elect_count')
  * Let's say ties go to Romney
return scalar obama_win = `obama_win'

* End Simulation programming

  di "In one run of the simulation Obama got " r(obama_elect_count)
  di "In one run of the simulation Romney got " r(romney_elect_count)

  simulate, rep(1000): election2012
  * Repeating the simulation 1000 times, I get that Obama wins 91% of the time.
  hist obama_elect_count, nodraw name(Obama, replace) xtitle(Obama Electoral College Votes)
  hist romney_elect_count, nodraw name(Romney, replace) xtitle(Romeny Electoral College Votes)
  graph combine Obama Romney, ycommon xcommon cols(1) title(Obama vs Romney)


* Now let's add a fairly strong correlation in errors .4
  simulate, rep(1000): election2012 .4
  * Repeating the simulation 1000 times, I get that Obama wins 79% of the time.  This is because we are in effect increasing the total system error by making it covariant causing the outcome to approach 50%

* Finally, let's double the standard errors arguing that polls suffer for random measurement error.
  simulate, rep(1000): election2012 .4 2
  * Repeating the simulation 1000 times, now Obama wins only 66.5% of the time.
  * Is this something to worry Obama supporters?  No, not really.
  * If we increase the variance of the error then of course everything moves towards the mean - 50% victory.
  * But realistically, it is probably not justified to double the standard deviation of the error.