Thursday, December 1, 2016

Efficiently Saving and Sharing Data in R

After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat file).

It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.

In this post I test four different file formats available to R users. These formats are comma separated values csv (write.csv()), object representation format as a ASCII txt (dput()), a serialized R object (saveRDS()), and a Stata file (write.dta() from the foreign package). For reference, rds files seem to be identical to Rdata files except that they deal with only one object rather than potentially multiple.

In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:

Numeric Formats

  • Index 1 to N - ex. 1,2,3,4,...
  • Whole Numbers - ex. 30, 81, 73, 5, ...
  • Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
  • Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...

Text Formats

  • String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
  • String coded factor variables with 16 characters coded as strings
  • String coded factor variables with 64 characters coded as strings
  • Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
  • Factor coded variables with 16 characters
  • Factor coded variables with 64 characters
  • String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
  • String variables with random 16 characters
  • String variables with random 64 characters

What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (1:N).

In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.

As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.

Raw Files

Let's first look at space taken when saving uncompressed files.

Figure 1: File Size

File Size

Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The dataframe object is the data.frame composed off all of the variables. From the height of the dataframe, we can see that rds is overall the winner. Looking at the other variable values we can see only that csv appear to consistently underperform for most file formats except for random strings.

Figure 2: File Size Log scaled

File Size Logged

In Figure 2 we can see that rds is consistently outperforming all of the other formats with the one exception of index in which the txt encoding simply reads 1:10000. Apparently even serializing to bytes can't beat that.

Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of csv to compress factor strings dramatically penalizes the efficiency of csv relative to the other formats.

File Compression

But data is rarely shared in uncompressed formats. How does compression change things?

Figure 3: Zipped File Sizes Logged

File Size Zipped Logged

We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as rds. Comma delineated csv files are a bit of an exception with factor variables suffering under csv. Yet random strings perform slightly better under csv than other formats. Interesting rds files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.

So, should we stick with whatever format we prefer?

Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.

But an equally important concern, is how long it takes write and read different file formats.

Reading Speed

In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.

Figure 4: Reading and unzipping average speeds

File Size Zipped Logged

From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB csv file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.

In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (rds), files could be 100 times larger than those used in this simulation and still open in less than a minute.


Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.

This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.

Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in rds formats in addition to the outdated formats they currently exist in.

And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.

Yet, R is free and its database format is public domain. Any user could download R, open a rds or Rdata file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.

Tuesday, November 8, 2016

Trump and Clinton Supporters Agree on Relative Morality ... Mostly

When looking at the endless scandals swirling around the heads of the two rival candidates Hillary Clinton and Donald Trump, it can seem like the two candidates are equally tainted. Many people throw up their hands pleading for some other option.

How can we evaluate the alleged actions of these two candidates?

Is there some kind of objective way to do so?

And how does the decision to support a candidate affect the perception of an offense?

In order to attempt to address these questions I recruited 137 submissions on Amazon Mechanical Turk who submitted around 1100 five way "relative offense" rankings of five randomly matched actions from a list of 150. 46 submissions  were from individuals supporting Clinton, 37 from individuals supporting Trump, and 54 other or not supplied. Ranking was from 1 "Least Offensive" to 5 "Most Offensive".

Within any item set of five actions only one action could be matched with an individual ranking.
 For each action the mean number of times that action was classified under each ranking was calculated. That value was multiplied by the action ranking and summed across all levels to create an index.

The smallest index are the least offensive to non-offensive actions while the highest rankings are those actions respondents considered the most offensive.

Table 1: This table shows all 150 actions ranked from least offensive on average to most offensive as rated by all respondents. The Trump, Clinton, and Other columns are how each of these respective supporters rank the actions. A .5 means that two actions were ranked the same. The Trump_Clinton column is calculated by taking the rankings of Trump supporters for an action and subtracting the rank for those actions for Clinton supporters. Values in which Trump supporters and Clinton supporters diverge by more than 30 ranks are highlighted.
1522.53Eating meat.
21215-1Posting a much younger picture of yourself on a dating website.
36.55.581Not wearing your seat belt when driving.
415.52613.5Dating someone of a different race.
52.5152.5-12.5Donald TrumpSwearing in public.
68.5172.5-8.5Offering to pay someone 10 cents for every paper they deliver.
76.51217.5-5.5Stealing candy from a baby.
8122510-13Jumping in line in front of someone else when waiting for customer service.
92.52721-24.5Not tipping a waiter after average service.
1041433.5-10Hoard the armrest when sitting next to a stranger on a plane or movie theater.
1160.55.5555Having sex with someone of the same gender.
122783819Trey RadelBuying illegal drugs for personal consumption.
1325.51128.514.5Using toilet paper to vandalizing a stranger's house.
1437132524Taking the cab hailed for someone else.
152927122Leaving chewing gum under a public table.
161357.520-44.5Exaggerating the size of your penis in order to convince someone to have sex with you.
1725.544521.5Illegally copying copyrighted music.
181816322Not voting.
192294813Illegally streaming copyrighted videos.
201721.530.5-4.5Publishing with permission a different person's work as your own.
213051.512-21.5Hunting a wild animal.
228.52045-11.5Using a bad or ineffective preparedness test to screen potential students.
2357.536.51221Sticking gum in someone's hair.
244340.5222.5Having sex with your landlord to pay the rent.
2520750.513Not flushing a public toilet after pooping.
2646.536.51410Hillary ClintonContinuing to stay married to your spouse after that person had multiple affairs.
27.53233.519-1.5Paying a prostitute for sex.
27.510.51827-7.5Spending money on something you do not need such as entertainment when you know there are people dying of hunger next door.
2915.5669-50.5Donald TrumpEarning 1000 times more money per hour as your lowest paid full-time employees.
3046.53977.5Voting illegally twice.
3110.52940.5-18.5IRSUsing public funds to pay for professional development conferences which may not be very productive.
32212726-6Hillary ClintonRefusing to release the transcripts of paid speeches you gave.
33286117.5-33Earning 1000 times more money per hour as your lowest paid contractor.
343440.523-6.5Lying about damages in order to keep a tenant's security deposit.
3563642.5-1FBISelling guns with GPS trackers in them to a dangerous illegal organization with the intention of using the data to prosecute the organization.
361919650Urinating on a toilet seat and not cleaning it.
3774.5612413.5Donald TrumpMisrepresenting your success to convince others to invest in you.
386551.528.513.5Donald TrumpReposting images from a white supremacist group.
3923.55440.5-30.5Writing false review's on a business's website because you are angry with them.
4049.531.530.518Sneaking out of a resturant without paying for a meal.
411421.581-7.5Copying a fellow student's homework.
423557.539-22.5Hillary ClintonA politician lying or falsely representing one quarter (25%) of his/her public statements.
433236.558-4.5Stealing products from a convenience store. (Shoplifting)
445577.542.5-22.5Donald TrumpFalsely reporting the magnitude of your donations to charities to make yourself look better.
45.552.545.5527Donald TrumpPublicly shaming someone for being fat or ugly.
45.581.5691612.5Refusing to renounce the endorsement of a white supremacist group.
4723.54372-19.5Speeding through a school zone.
4852.549453.5Promising special interest groups to vote for them if they donate to your campaign.
4979245355Masturbating in a public bathroom.
5057.56136-3.5Donald TrumpLying about the number of floors in a building you own in order to charge higher rent.
51417160-30Using spray paint to vandalize a stranger's car.
524136.5684.5Smoking in an area around non-smokers where "no smoking" signs are posted.
5374.5424832.5Spitting in someone's soup when they are not looking.
54416455.5-23Publishing without permission a different person's work as your own.
554411250.5-68Publicly renouncing homosexuality while secretly having sex with gay prostitutes.
5674.5484826.5Publicizing false claims about a person because you are angry with that person.
5766.54755.519.5Chris ChristieUsing your position to control how public funds are used in to punish a political rival.
583251.563-19.5Providing alcohol to minors (a person less than 18 years old).
5992443548A public official hiring a less qualified friend over a well qualified stranger.
6074.55642.518.5A fully mobile person refusing to give up a seat to a disabled or infirm person.
6146.59333.5-46.5Publicizing someone's address in an attempt to intimidate someone else.
6239108929Donald TrumpJudging a person on their appearance.
6352.56861-15.5Selling illegal drugs.
6474.531.57543Hillary ClintonImproperly storing national secrets.
6552.56477.5-11.5Donald TrumpA politician lying or falsely representing three quarters (75%) of his/her public statements.
6681.5307051.5Donald TrumpEnjoying firing people from their occupation.
676886.554-18.5Stealing large sums of money from a company you work for.
6846.510057-53.5Donald TrumpEarning over 10 million dollars in a year and not paying any federal income taxes.
6960.58964-28.5Donald TrumpFiling bankruptcy in order to avoid paying the people who worked for you.
706351.583.511.5Sabotaging a competitors work.
7181.573738.5Setting up fake accounts for your customers in order to increase profits.
7281.5239358.5Publicly exposing yourself.
73100.510137-0.5Donald TrumpEnter the occupied changing room of members of the opposite sex without consent.
7449.58377.5-33.5Clinton/TrumpA politician accepting money from special interest groups.
75377499-37A police officer accepting a bribe in exchange for not writing a ticket.
76106.5816725.5Stealing someone's car.
Driving drunk or high.
7866.577.580-11IRSUsing public funds designated for tax collection to produce a parody video.
Using racial epitaphs.
8057.533.59424Cheating on a test.
8190597431Not giving food to a starving person in front of you.
82638562-22A politician lying or falsely representing half (50%) of his/her public statements.
83957286.523Anthony WeinerA married person sending photographs of his/her genitalia to someone who is not that person's spouse.
84111.5678244.5Giving a blind person inaccurate change because they cannot tell the difference.
85106.5847122.5Urging someone who is attempting to remain sober to take a drink.
86937083.523Without authorization setting up and using a credit card in another person's name.
87100.545.510655Masturbating in public.
8884.586.577.5-2Using public resources entrusted to your care to enrich yourself.
89108756933Mocking a disabled person for a physical handicap.
9084.577.5857Bribing a public official to enrich oneself.
91109105.5593.5Donald TrumpSetting up a fake educational institution in order to enrich yourself.
923794111.5-57Torturing criminals as punishment for crimes.
939088952Bill ClintonCheating on your wife/husband.
9469.510291-32.5Stalking someone you know in order to intimidate that person.
958855111.533Doctor KevorkianKilling someone who is in pain and going to die within the next six months and wants help dying.
969890.586.57.5Tonya HardingHiring someone to break the leg of a rival athlete.
9774.590.5101-16Using threats of lawsuits to silence a woman who claims to have been assaulted by you.
98114829832 Hillary ClintonUsing a private email server resulting in the risk of compromised national security.
998797107.5-10Writing laws to prevent people who do not support you from voting.
10074.5107105-32.5Requiring a starving person attend your religious gathering before providing giving food.
1019798.5104-1.5Donald TrumpUsing your superior physical strength to force a person to kiss you.
10286103.5107.5-17.5Stealing someone's identity in order to commit a crime.
10310380113.523Burning down your house to claim the insurance money.
104111.511490-2.5Using public resources entrusted to your care to enrich an ally or friend.
105100.511597-14.5Preventing someone from registering their child in your school because of that person's skin color.
1061161091007Adolf HitlerEncouraging racially motivated violence to promote your political aspirations.
107104120102-16Bribing a public official to protect oneself.
10874.5118115-43.5Rejecting someone's rental application because of that person's skin color.
10990124.5109-34.5Bill CosbySneaking a drug into someone's food or drink in order to force that person to have sex with you.
11011912888-9Killing a healthy and well behaved pet for because you don't have to the time to take care of it.
111100.5105.5110-5Stealing someone's needed pain medication for personal use.
112139.51239616.5Killing a healthy and well behaved pet because you find the animal annoying.
113143103.59239.5Edward SnowdenPublicizing national security secrets which might result in lives being lost.
11411495.5121.518.5Donald TrumpThreatening to sue someone in order to keep them from telling the truth.
115132124.51037.5A person of influence encouraging a crowd to physically attack someone or a group of people he/she does not like.
11669.5117127-47.5George W BushTorturing terrorists with the hope of gathering information about future terrorist plots.
11712177.5143.543.5Ordering the assassination of a dictator.
118105120126-15Recording someone having sex without their knowledge.
119114122121.5-8Restricting the use of life saving technology in order to increase profits.
120136108116.528Misrepresenting the effectiveness of a life saving drug in order to increase profits.
12113398.513134.5Killing someone who is in pain and going to die within the next six months but does not want to die.
122117110146.57Not reporting an instance of known child abuse when you are legally mandated to report.
123130.5136.5113.5-6Physically abusing your spouse.
124128.511212516.5Sending someone's spouse to the front line to die in order to marry the surviving window.
125122135120-13Hiding information about the health risks of deadly product you sell.
126124.5134123.5-9.5Stealing large sums of money from a mentally infirm client.
127110144135-34A medical doctor refusing to treat a patient who needs urgent care because they are unable to pay.
128119129123.5-10Significantly raising the price of a life saving drug in order to increase profits.
129139.5131116.58.5Viewing child pornography.
130141.51381183.5Killing a healthy and well behaved pet for fun.
131119143128-24Someone with a known dangerous sexually transmitted disease having unprotected sex with someone who is unaware of the condition.
132123133131-10Using your superior physical strength to hold an unwilling person while you grabbed their genitalia.
133128.511613112.5Brock TurnerTaking and sharing naked pictures of a person who is unconscious.
134124.5131136.5-6.5Brock TurnerHave involuntary sex with a person you find unconscious.
135.5130.5112136.518.5Vladimir PutinAssassinating a political rival.
135.5127145129-18Physically forcing your spouse to have sex with you.
13795149.5141.5-54.5Harry S TrumanDropping an atomic bomb on a civilian city controlled by a rival nation
138148.51391199.5Allowing the executing of a convict if you have secret knowledge of the person's innocence.
139.513812014918A 40 year old having sex with a fifteen year old.
139.5148.5136.514812Using your power to pressure an employee to have sex with you.
141137140141.5-3Not reporting an instance of known child sexual abuse.
142148.512613822.5Killing innocent people to further your political agenda.
143144.5142139.52.5Adolf HitlerEncouraging people who follow you to attack and kill people you do not like.
144144.5127139.517.5Killing your spouse to claim the life insurance premium.
145134.5141146.5-6.5Killing someone for their money.
146126131145-5George WashingtonPossessing a personal slave who has no human rights.
147134.5147134-12.5Donald TrumpGroping without warning the genitalia of someone else.
148148.51461332.5Making child pornography.
149141.5148143.5-6.5Killing innocent people for personal reasons such as curiosity or entertainment.
150146149.5150-3.5Killing somebody to protect your reputation.
* Please note that I am making no claim to the veracity of the scandal.
** Upon viewing this list I realize that there are a number of scandals I forgot to include related to Hillary. I am unconvinced that any of them would have been ranked very high on the list. However they should have been included. Sorry team Trump.

General Differences
Looking at the table we can see that generally it agrees with out intuition with innocuous or generally non-offensive actions being ranked at the top, while actions that involve doing significant harm to others being ranked at the bottom.

Interestingly the rankings between Trump and Clinton supporters agree generally with a 84% correlation in ranking values.

From the differences in rankings of actions between Clinton and Trump supporters we start to get an idea on how these two different groups think. The largest difference is 68 ranks of difference with Trump supporters much more accepting than Clinton supporters of "Publicly renouncing homosexuality while secretly having sex with gay prostitutes". Clinton supporters seem less tolerant of deception in general with exaggerations of penis size and the writing of false reviews on business webpages being ranked much more offensive than their Trump counterparts.

Clinton supporters also find it generally more offensive to do harm to others such as dropping the atomic bomb on civilian populations and torturing terrorists or criminals.

The two camps have fiercely different perspectives on money. Clinton supporters are very concerned with the perceived injustices caused by unequal access to resources. These supporters are much more prone to rank as more offensive inequality in earnings as well as politicians accepting public donations. These supporters find it much more offensive for a doctor to refuse treatment on the basis of insufficient funds.

There also appears to be a difference in how concerned Trump supporters are with racial inequality with Trump supporters much less concerned with the act of rejecting an applicant due to the color of skin or minding the use of racial epitaphs.

Perhaps unsurprisingly for the largest scandals Trump supporters and Clinton supporters seem to disagree with how offensive the actions of their candidates are. Clinton supports find the surprise groping of genitalia one of the worse actions someone can take while Trump supporters place it a bit lower on the list. For Trump supporters, improperly storing national secrets and using a private email server are ranked as much more offensive than for Clinton supporters.

While Clinton and Trump supporters mostly agree in generally how they rank objectionable actions, they do seem to disagree in some areas that seem consistent with differences in popular representation. Clinton supporters being concerned with economic, social, and political justice. Trump supporters being concerned with protecting economic rights as well as individual freedoms such as the right to offend others through word or deed.

Thursday, April 14, 2016

Calculating Average Consumption From One Week of Purchases

A number of large surveys have attempted to quantify consumer consumption from a limited period
of time observed. This task can be fairly complex as it is fraught with potentially large difficulties directly observing who is consuming what. Rather than this expensive method some researchers have attempted to substitute more easily observed purchase patterns inferring that in general house holds are going to consume what they purchase.

In order to aid in this analysis researchers collect data on both what is purchased and over what period of time it is to be consumed, for instance today (1) or over the next week (7).

Yet purchase patterns can be difficult to work with. Typically household consumption does not map perfectly to household consumption. For one, households can consume stocks from previous weeks. Likewise, households can purchase food to be held in stock for future weeks.

In order to adjust for missing consumption levels we want to adjust consumption to account for both the food items that will not all be consumed the week of observations
$$ C_{current.purchases} = C_{purchase} \frac{}{}$$

as well as the food items that were purchased the previous week and consumed this week. We can calculate the probabilities of observing an individual outcome in the following way:
$$ P_{observing.purchase} = \frac{observation.period}{}$$

We can note that the probability of observing a particular purchase if greater than 1 need only be set to one since if this is the case it is likely that this particular purchase will appear one or more times in our data.

Now we can combine (1) and (2) by dividing the current purchases by the likelihood of observing those purchases.

$$ E(C_{current.purchases}) = C_{purchase} \frac{}{}/\frac{observation.period}{}$$
$$=C_{purchase} \frac{}{observation.period}$$

If the probability is less than 1 otherwise we can use equation (1).

Finally in order to calculate average consumption we take the daily average for our estimated expected consumption levels? Right?

Not even close. This only begins to capture the problem as we have multiple purchases often on different days consumed in different patterns throughout the week.

In order to get us closer to the appropriate level of estimated consumption we need to both infer the missing consumption as well as spread out the observed consumption so that when we look at daily averages good A purchased on day 1 with an expected consumption period of 1 week will also be included with good B purchased on day 7.

In order to explore how to estimate consumption from only observing a limited period of time I have written a simulation testing four methods of estimation. The true consumption level for any individual is 1 unit. If there are multiple goods consumed than that 1 unit of consumption is spread across all goods so that every day only one unit is consumed.

Using only one good we get the following results. M1 is just taking the mean consumption if we divide quantity of goods purchased by number of days expected to consume. M2 is adjusting consumption by the inverse of the likelihood of observing that consumption. M3 is spreading consumption across all of the days of the week observed. M4 is both adjusting by likelihood of observations and spreading consumption across days of the week observed.

Table 1:Sim # is the simulation number while # Items is the number of different food items purchased while C Spread is the number of days consumption of that item is spread over. All values are simulated 250 times.

Sim# ItemsC SpreadM1M2M3M4

Notice that with only 1 item consumed M1 and M3 are equivalent and M2 and M4 are equivalent. We can see that expected consumption for M2 and M4 provide much better estimates than for M1 and M3 when the consumption is spread out for goods for more than the observation period of one week on average.

Things get much more difficult when we include other goods in our calculation.

Table 2: Equivalent to Table 1 except now multiple items are being purchased at different periods (identified as # Items). In this the C Spread only refers to the first item. The remaining items are drawn randomly from the possible consumption spreads with much greater weight applied to lower consumption levels.
Sim# ItemsC SpreadM1M2M3M4

When consuming multiple items simultaneously, the importance of spreading consumption out across all days observed becomes increasingly important. This is because daily consumption need be calculated as the sum of goods consumed that each day averaged across the number of days observed. Thus we see that while in Table 1 M2 does very well. In Table 2 M3 and M4 do much better than either M1 or M2 and M4 does slightly better than any other method at approximating total consumption.

Figure 1: Estimator performance given different item consumption spreads. The above values are for the estimator value averaged across between 1 and 6 items consumed with only the first item being at that particular spread value. M is method 1 through 4 described above.
It is worth noting that all of the methods underestimate total consumption though M4 does the best at adjusting for the missing data problem.

There are some things to consider when estimating consumption data in this way. One important thing is that if consumption tends to be for goods consumed over a long period of time then using anything but directly dividing by the period of time expected to be consumed over is going to give some pretty lumpy values.

For instance, imagine someone buys four liters of oil which they expect to consume over the next 30 days. Sure on average in order to account for the oil not observed for the many other similar people  who bought their oil on previous periods you may want to divide the oil not by the thirty days (4L/30days) but by the probably adjustment value equation (3).

Thus you get (4L/7days). Averaging across four similar people who did not happen to purchase oil you approximate the population consumption level. (1L/7days*1/4=1/28). Thus on average for the population estimate, you are pretty close. However for that one guy in your data you now have one person who looks like they are consuming 4/7 of a liter of oil per day.

When screening your data for outliers this oil consumption positively pops out of the page at you. So you figure it is some kind of recording error and replace it using population estimates.

But the problem here is entirely created by the method used to infer consumption levels. If instead you had taken the consumer at his/her reported level and said that average consumption for that individual is 4L/30days or 2/15 liters per day then you would never need to substitute out this particular outlier because it would not exist in the data in the first place.

If you would like to review the R simulation used to generate these results you can find it here.

Monday, March 7, 2016

For Whom Will the Michigan Mitt Swing?

Tomorrow, March 8th, Michigan with 130 delegates gets to vote one which of the Democratic candidates, Hillary Clinton or Bernie Sanders should be the Democratic presidential nominee. Michigan is an important state because is represents a large number of delegates.

Michigan has a also been in the news frequently this election season with the poisoning of water in Flint as a result of changes in how the city sources its water. Both democratic candidates have spent a considerable time in the state.

In a recent post I predicted outcomes within states based on the proportion of contributions within those states which have given to either the Sanders campaign or the Clinton campaign. Based on the donor rates within Michigan I predicted a 65% share of the vote would go to Sanders.

As Clinton has adopted a "stay the course", Obama 2.0, campaign strategy, Democrats in Michigan may be more likely to vote for her relative to Democrats in other parts of the country who have not seen the recent growth rates Michigan has experienced.
Figure 1: A map of counties supporting Sanders or Clinton. Donations are mapped to zip code level. Zoom to larger map to see donations indicated as either S for Sanders or C for Clinton. Size of letters correspond to number of donations from that zip code.
From Figure 1 we can see that by just looking at the number of contributions coming in by county we would expect Michigan to strongly support Bernie Sanders. However, population density is not well captured by county maps. We can see though that there is a strong level of support in the areas surrounding Detroit for both candidates.

Figure 2: Number of contributions in Michigan by contribution size for both candidates Clinton and Sanders.Note that because the majority of funds are too small to itemize, these estimates underestimates the total funds contributed for Sanders by 70-80% while underestimating funds contributed to Clinton by only 10-20%.
We can see that in terms of total number of contributions, Sanders is strongly outraising Clinton in Michigan by a factor of 2 to 1. However, as has been noted previously, large/wealthy donors disproportionately back Hillary Clinton above all other candidates. When it comes to large sponsors giving more than $1000 to Clinton in Michigan, she has hundreds while Sanders has 15 (too few to appear on the figure).

Figure 3: Total itemized funds contributed by contribution size. Note that because the majority of funds are too small to itemize, these estimates underestimates the total funds contributed for Sanders by 70-80% while underestimating funds contributed to Clinton by only 10-20%.
Though these contributions represent a small proportion of the total contributions to either candidate they do represent a large portion for the total funds contributed in Michigan. From Figure 2, we can see that those few large contributors make up a large portion of the funds donated in the state.

Who are these large donors?

Figure 4: Industrial backing of donors in Michigan.
Like the country at large, business executives and lawyers are Clinton's largest backers while health care workers, engineers, artists, academics, and the self-employed form a broad coalition of support for Sanders.

Related Articles: 
Clinton's Lack of Public Support Made up by Super-PACs
Analysis: Clinton backed by Big Money: Sanders by Small
Overwhelming Growth In National Support for Bernie Sanders Mapped
Big Business Backs Hillary: Small Bernie
Hillary 1993: Largest Drop in Girl Names EVER; Chelsea Distant Second
As First Lady, Popularity of Babies Named "Hillary" Dropped by an Unprecedented 90%
Hillary Clinton's Biggest 2016 Rival: Herself
The Simple Reason Sanders Is Winning
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America's Poor
What it means to be a US Veteran Today

Sunday, March 6, 2016

Clinton's Lack of Public Support Made up by Super-PACs

Hillary Clinton, with only $30 million raised in February far below the $43 million raised by her rival Bernie Sanders, Clinton is falling desperately short of public backing.

Fortunately, she has friends in high places. These friends are increasing their backing of her through the quasi-legal independent campaigning structures some of which are known as Super-PACs.

These organizations are a mixed batch many of them working for the collective interest of special interest groups such as the National Nurses United For Patient Protection Super PAC which backs Bernie Sanders or other Super-PACs such as the "League of Conservative Voters, Inc" has spent for instance $162,115.70 supporting Hillary Clinton.

These packs are free to support without fiscal limit any candidate thought are legally required to act independent agents not in contact with of individual campaigns.

It is easy to understand how Super-PACs could be justified legally. If there are organizations that support particular special interests then shouldn't these organizations have the right to back whatever candidate is also supporting those positions?

However, where things get tricky is when Candidates construct Super-PACs for the express purpose of skirting election laws. A famous case called Citizens United vs FEC in 2010 effectively reversed years of campaign finance reform law. An interesting note is that Citizens United Super PAC LLC has so spent $140k supporting Clinton's campaign. (Interestingly this same organization reports spending 512k opposing her.)

Anyways, the long and short of it is that Super wealthy donors who are prohibited from donating more than the legal limit to campaigns can set up Super-PACs in order to skirt election laws and back particular candidates. I do not know to what extent this is happening for the current campaign. However, it is important to recognize that a Super-PAC backed by a union composed of thousands of members (such as the nurse PAC supporting Sanders) is distinctly different than the typical organizations people concerned with Super-PACs are talking about.

Related Articles: 
Overwhelming Growth In National Support for Bernie Sanders Mapped
Big Business Backs Hillary: Small Bernie
Hillary 1993: Largest Drop in Girl Names EVER; Chelsea Distant Second
As First Lady, Popularity of Babies Named "Hillary" Dropped by an Unprecedented 90%
Hillary Clinton's Biggest 2016 Rival: Herself
Analysis: Clinton backed by Big Money: Sanders by Small
Legally Rig An Election: A Citizen's Guide to Gerrymandering 
Nevada:Sanders has 6x the Supporters as Clinton
The Simple Reason Sanders Is Winning
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America's Poor
What it means to be a US Veteran Today

Saturday, March 5, 2016

Prediction: 64% Sanders Wins Majority of Pledged Deligates

There are many ways so predict the future. All of them have a fair degree of uncertainty. Nate Silver at FiveThityEight uses a measure of ethnicity and political leanings to predict how well Sanders will do in different states. This seems like a sound method to me though it is not the only way to make predictions.

For the last month I have been playing with campaign contributions data and have seen a strong and steady increase in support for Bernie Sanders across the nation. I have mapped it county by county and the results are quite dramatic.

Yet, what does grassroots support really mean? Does it translate to votes?

In February, with only four states having voted, it was impossible to say how contributions translated to votes. But Super-Tuesday changed all of that!

With fifteen states having voted, we can now see if financial support maps to voting support.
Figure 1: This figure shows a relationship between percent of support for all times reported as of January 31st coming from that state with percent of vote (delegates when not available) coming from that state at the primary.
From Figure 1, we can see there appears to be a pretty strong relationship between percent of vote actually cast and percent of support coming in for that candidate. Let's try a formal model:
$$Vote = \beta_0 + \beta_1 ContSanders + \beta_2 Primary + \beta_3 Closed $$ Vote is the actual vote in the state. The explanatory variables are percent contributing to Sanders (ContSander). Primary and Closed refer to the difference between primary vs caucus voting systems and closed vs open. For closed systems only registered democrats are allowed to vote.

Looking at Table 1, with 79% of the variance explained (r2), we can see that percent of support coming in for a candidate from a state at the end of February is a very good predictor of how the vote will go. Increasing the number of explanatory variables increases the r2 to 86%.

Table 1: The regression of  Vote on ContSanders is V1 while V2 and V3 allow the inclusion of the explanatory variables Primary and Closed.

sigp < 0.001%p < 0.01%p < 0.01%
* Coefficient significant at 10%, ** at 1%, and *** at 0.1%

When examining the coefficient on ContSanders it is useful to reflect that while this value is statistically very different from zero, the point value estimate is reasonably close to 1. This is the target number we would like if we were to directly interpret proportion of supporters in a states as a good indicator of proportion of population in state supporting Sanders. This interpretation does not make sense since most of the contributions to Sanders campaign (72% of them) are not itemized (and thus included in this analysis) because they are less than the FEC threshold of $200 while a much smaller number (about 12%) of those for the Clinton campaign are not itemized.

A priori I did not have any hypothesis as to how the Primary vs Caucus method was going to play out though I did expect those states with Closed voting to be less likely to vote for Sanders as he is strongly favored among independents.

From these numbers coefficients we can now make predictions about how the rest of the states would vote if all of the states voted today (well really March 1st, Super Tuesday). The results give a point estimate of 42% of primary locations going to Sanders with a total expected number of pledged delegates of 1740 to that of Clinton 2285. So Clinton is expected to win??

But wait! Not quite so fast!

The election will not be held tomorrow. The momentum has been strongly with Sanders and it should be expected to stay strongly with Sanders.
Figure 2: Percent of contributions going to Sanders relative to that of Clinton in the South. Interestingly DC is the largest proportional supporter of Hillary. That is because it is the state/district which best exemplifies Hillary's primary backers, wealthy. States with black outlines have yet to vote.
Figure 3: Percent of contributions going to Sanders relative to that of Clinton in the Northeast. States with black outlines have yet to vote.
Figure 4: Percent of contributions going to Sanders relative to that of Clinton in the Midwest. States with black outlines have yet to vote.
Figure 5: Percent of contributions going to Sanders relative to that of Clinton in the West. States with black outlines have yet to vote.

From the Figure 2 through 5, we can see that the growth in support for Sanders has been steadily increasing in all areas of the country. The regions most friendly to Clinton are the South and the Northeast while those most friendly to Sanders are the West and the Midwest.

The South happens to be the region least supportive of Sanders campaign though it has had more votes than all other states combined. Thus we may be getting a distorted picture of how the primary season may go based on how these first few states have voted.

If we fit a simple line to each state then assume that growth in support will continue at a steady pace until the primary at that state.
Figure 6: Predicted support at time of primary mapped against support at end of January. ZZ are democrats abroad.
Predicting support based on historic rate of donations predicts that almost all states will have greater support for Sanders than they did at the end of January. States which have experienced more growth in support for Sanders or has later primaries tend to end up further to the right on the graph. The diagonal line is what happens if there was no growth in support for Sanders over time.

Using these new expected support levels at the time of the primaries that have already happened we can fit a new model.

Table 2:  This tables shows the results of using expected proportion of Sanders supporters as predictors for election results rather than actual support as of the end of January.
sigp < 0.001%p < 0.01%p < 0.01%

From Table 2, we can see that using predicted Sanders support rather than that last observed at the end of January gives us a slightly higher r2. However, those of us familiar with estimation will immediately realize that we have introduced a new level of uncertainty into the data. This is because we are using an estimated value to estimate yet another value.

Ignoring estimation uncertainty, using the best fit model I predict Sanders will get 57% of the pledged delegates. However, point estimates in statistics are almost never true. In order to estimate the error in the process I simulate randomly sampling from the distribution of possible coefficients for the Intercept, ContSanderP, Primary, and Closed coefficients and predict delegate distribution. 72% of the time Sanders is expected to get the majority of pledged delegates.

Yet, I have ignored the error in estimating the support for Sanders. Rather than doing something more complicated instead what I do is increase the standard error on all coefficients by a factor of 150% and simulate the delegate distribution again. Under this situation I predict Sanders will take the majority of the pledged delegates 64% of the time. (Note the more you increase the standard error the more watered down the predictions become until all you have is a 50-50 chance of Sanders winning.)

It will come as no surprise to anybody that I am an avid Bernie Sanders supporter. The level of corruption and deceit that seem endemic to the Clinton campaign combined with the consistent upright behavior and spot of messages of the Sanders campaign makes my endorsement of Sanders very easy. I might have considered supporting one of the Republican candidates, however Trump seems to be cleaning house.

It might come as some surprise that a self-described economist would openly support Sanders. However, the exaggerated claims that "economists" are opposed to Sanders does not add up when you look at the actual fiscal support Sanders has received from economists. As of the end of January, Sanders has logged 155 contributions from economists compared with Clinton's 189. That is to say 45% of contributions made to either campaign from economists have gone to the Sanders campaign.

So it was frustrating for me to see that Sanders seemed to be already getting behind in the pledged delegates for these first primary states. However, a few nights ago I built the models and crunched the numbers and was much relieved to find that not only was Sanders predicted to do well, but win the popular vote, and the majority of pledged delegates!

I know there is much uncertainty in any kind of predictions, especially one as surprising as this election season. Thus, I caution reading to much into this prediction or really any predictions that are coming out. Frankly, this model using contribution data seems to fit the data remarkably well and the results are encouraging. But even if the model predicted Clinton would take the pledged delegates and the popular vote, I would also strongly caution reading too much into such predictions.

Only 15 states have voted representing only 25% of the pledged delegates. All the while those who learn about Sanders seem to like him more and more while for Clinton the phenomenon seems to be going the other way.

As for my code. I am happy to release it however it is not in a good condition right now. I will have to revise it for public posting. That might take a few days but I wanted to get this out right now.

End Note:
I could imagine someone saying, "What do pledged delegates matter anyways since so many delegates are determined by unpledged or super-delegates?"

I can tell you this for certain if Clinton does not win by taking the majority of pledged delegates but rather though internal party politics then it is highly unlikely the majority of Sanders supporters are going to support her in the general election. Already, many of us are put off by the games the DNC have been playing by first restricting the debate schedule so as to minimize air time of democratic challengers to Clinton then by trying to cut Sanders off from access to the voter database.

This behavior coupled with ongoing ethical and potential legal violations by the Clinton campaign have given Sanders supporters a very strong dislike for underhanded tricks. Using party insiders to win the nomination against the will of the electorate, would be seen as intolerable.

Just for fun I have included the following list of predicted outcomes and actual outcomes as well as the predicted number of pledged delegates if all pledged delegates where distributed proportionately to that of the vote. Notice that the predicted outcomes for any given state can vary quite significantly. However, it is only across states that we hope to come up with a cumulative expected outcome that may be reasonable.

StatePrimary DatePredictionTRUEPledged SandersPledged ClintonN