* This simulation is inspired by the work of the blog FiveThirtyEight (http://fivethirtyeight.blogs.nytimes.com/).

* In order to make the simulation more interesting, I spent some time collecting real data.

* I have collected poll data from RealClearPolitics.com on nearly all 50 states and DC.

* Download the excel file posted at

* http://stata-bloggers.com/EconometricsBySimulation/2012_11_05_PollsRCP.xlsx

* http://stata-bloggers.com/EconometricsBySimulation/2012_11_05_PollsRCP.csv

( post a comment if there is a problem with downloading the file from dropbox.)

* Then load it into Stata.

* You will find that it has information listed by poll then by state alphabetically.

* The variable Sample is the number of people sampled while the MoE is the margin of error. This is basically half the confidence interval.

* You will also find variables Obama and Romney which is the percent of people indicating that they would vote for that condidate.

* The final bit of interesting information is the Electoral_College information which counts the number of votes associated with that state.

* The first problem you will notice is that some states do not have information.

* I have filled in the Romney and Obama values based on RCP's likely vote values.

* The second problem you will note is that the MoE and Sample are considered Text while what we need is Numbers.

destring Sample, replace force

destring MoE, replace force

* Let us assume that the states that do not have poll information are either solidly Obama or solidly Romney.

* Let's simply choose a sample size and MoE for missing values that will make the percentages not change.

replace Sample = 500 if Sample==.

replace MoE = 1 if MoE==.

* Let's first recover the Standard Errors: MoE = 1.96 * SE -> SE = MoE/1.96

gen se = MoE/1.96

* The first thing I would like to do is get the State average across the polls for both the variance estiamte and the means.

* If we assume that each poll is equally valid then to calculate the State average.

* For Obama: Total Mean=Sum(Obama/100*Sample)/Total Sample

* Same for Romney

* So let's first calculate the Total Sample by State

bysort State: egen Sample_Total = sum(Sample)

* Let's calculate the number of people per poll who declared Obama/Romney

gen NObama = Obama/100*Sample

gen NRomney = Romney/100*Sample

* Note: this number is usually adjusted for representation within the total population. Thus, we get numbers that are not whole as a result of this calculation.

* Now the total voters by State:

bysort State: egen NObama_Total = sum(NObama)

bysort State: egen NRomney_Total = sum(NRomney)

* The last calculation is simply Nvoters/Total_Voters

gen NS_Obama = NObama_Total/Sample_Total

gen NS_Romney = NRomney_Total/Sample_Total

* Now the trickier part is calculating the total variance by the state.

* First let's remember how the se is calculated: se=sd/n^.5

* We want to solve for the "individual" standard deviation => thus sd=se*n^.5 => sd^2 = var = se^2*n

* The reason we want to have the variance estimate is because variances can be added together assuming observations are independent which should be true in polls.

* First let's calculate our var

gen var_hat = se^2*Sample

* Once we have our variances we would like to sum them together to find our total sample variance

bysort State: egen var_total = sum(var_hat)

* Finally we would like to turn our total variance estimate back into a standard error.

* Thus: se=(var^.5)/(total_sample^.5)

gen se_total=var_total^.5/Sample_Total^.5

* Finally, let's create a count of the number of polls by state

bysort State: gen Npolls = _N

* Now we have created all of the variables we need at this level. We will collapse the data to the state level.

collapse se_total Sample_Total NS_Obama NS_Romney Electoral_Count Npolls, by(State)

rename NS_Obama obama

rename NS_Romney romney

* Now let's calculate the new margins

gen obama_romney = obama-romney

sum obama_romney

* We can see the majority of the states swing Romney

gen obama_electoral = Elector if (obama_romney>0)

sum obama_electoral

di r(mean)*r(N)

* However, just based on State averages we would expect Obama to take 307 electoral votes.

gen romney_electoral = Elector if (obama_romney<0 p="p">sum romney_electoral

di r(mean)*r(N)

* For Romney, the comparable number is 233. Hmm, well does not look good for Romney.

* Does our analysis end here? It need not! Thus far we have calculated the most likely outcome.

* Let's now try to figure out how frequently overall Obama will win (of all possible outcomes).

* For this, we will use simulation!

cap program drop election2012

program define election2012, rclass

preserve

* This simulation, as far as simulations go will be pretty straightforward.

* First let's rescale our voters so they they must either vote Obama or Romney and so that the number of interest is only likelihood of voting Obama.

gen obama_p = obama/(obama+romney)

* I am not sure how these pollsters generate their margin of error or more particularly what it is relative to.

* This will be the only tricky part of the simulation.

* I would like it to be possible to specify the correlation in errors between states.

* I will only allow one correlation to exist.

gen cor_e = rnormal() if _n==1

* cor_e is a common error among all states

* rho is coughly a correlation in error across states.

* if a first argument is not set then there is assumed to be no correlation in errors.

local rho = 0`1'

di `rho'

gen e = cor_e[1]*`rho' + rnormal()*(1-`rho')^.5

* This creates correlation but it messes up standard errors.

* Fortunately, standard errors are easy to "fix".

sum e

replace e=e / r(sd)

local se_scalar = 1+0`2'

* The `2' is a scalar for the random error.

* We may think that the polls are more or less volatile than how people actually vote.

* If they are more volatile then `2' should be less than 0 or if they are more then `2' should be greater

* Now we will add in our random variance component

gen obama_final = obama_p + e*(se_total/100)*`se_scalar'

* Note the se is in percent units but obama_p is in proportion so we need to divide by 100.

* Now, we simply see how many times obama got 50% or more and count his electoral votes

gen obama_elect = Elector if obama_final>.5

sum obama_elect

local obama_elect_count = r(N)*r(mean)

return scalar obama_elect_count = `obama_elect_count'

gen romney_elect = Elector if obama_final<.5

sum romney_elect

local romney_elect_count = r(N)*r(mean)

return scalar romney_elect_count = `romney_elect_count'

* Now return a dummy indicating if Obama wins this round (gets more electoral votes):

local obama_win = (`obama_elect_count'>`romney_elect_count')

* Let's say ties go to Romney

return scalar obama_win = `obama_win'

restore

end

* End Simulation programming

election2012

di "In one run of the simulation Obama got " r(obama_elect_count)

di "In one run of the simulation Romney got " r(romney_elect_count)

preserve

simulate, rep(1000): election2012

sum

* Repeating the simulation 1000 times, I get that Obama wins 91% of the time.

hist obama_elect_count, nodraw name(Obama, replace) xtitle(Obama Electoral College Votes)

hist romney_elect_count, nodraw name(Romney, replace) xtitle(Romeny Electoral College Votes)

graph combine Obama Romney, ycommon xcommon cols(1) title(Obama vs Romney)

restore

* Now let's add a fairly strong correlation in errors .4

preserve

simulate, rep(1000): election2012 .4

sum

* Repeating the simulation 1000 times, I get that Obama wins 79% of the time. This is because we are in effect increasing the total system error by making it covariant causing the outcome to approach 50%

restore

* Finally, let's double the standard errors arguing that polls suffer for random measurement error.

preserve

simulate, rep(1000): election2012 .4 2

sum

* Repeating the simulation 1000 times, now Obama wins only 66.5% of the time.

* Is this something to worry Obama supporters? No, not really.

* If we increase the variance of the error then of course everything moves towards the mean - 50% victory.

* But realistically, it is probably not justified to double the standard deviation of the error.

restore

Hi, the excel files are no longer available.

ReplyDeleteHi - can you repost the data files somewhere?

ReplyDelete