Thursday, April 14, 2016

Calculating Average Consumption From One Week of Purchases

A number of large surveys have attempted to quantify consumer consumption from a limited period
of time observed. This task can be fairly complex as it is fraught with potentially large difficulties directly observing who is consuming what. Rather than this expensive method some researchers have attempted to substitute more easily observed purchase patterns inferring that in general house holds are going to consume what they purchase.

In order to aid in this analysis researchers collect data on both what is purchased and over what period of time it is to be consumed, for instance today (1) or over the next week (7).

Yet purchase patterns can be difficult to work with. Typically household consumption does not map perfectly to household consumption. For one, households can consume stocks from previous weeks. Likewise, households can purchase food to be held in stock for future weeks.

In order to adjust for missing consumption levels we want to adjust consumption to account for both the food items that will not all be consumed the week of observations
(1)
$$ C_{current.purchases} = C_{purchase} \frac{days.remaining.in.observation.period}{days.expected.to.consume}$$

as well as the food items that were purchased the previous week and consumed this week. We can calculate the probabilities of observing an individual outcome in the following way:
(2)
$$ P_{observing.purchase} = \frac{observation.period}{days.expected.to.consume}$$

We can note that the probability of observing a particular purchase if greater than 1 need only be set to one since if this is the case it is likely that this particular purchase will appear one or more times in our data.

Now we can combine (1) and (2) by dividing the current purchases by the likelihood of observing those purchases.

(3)
$$ E(C_{current.purchases}) = C_{purchase} \frac{days.remaining.in.observation.period}{days.expected.to.consume}/\frac{observation.period}{days.expected.to.consume}$$
$$=C_{purchase} \frac{days.remaining.in.observation.period}{observation.period}$$

If the probability is less than 1 otherwise we can use equation (1).

Finally in order to calculate average consumption we take the daily average for our estimated expected consumption levels? Right?

Not even close. This only begins to capture the problem as we have multiple purchases often on different days consumed in different patterns throughout the week.


In order to get us closer to the appropriate level of estimated consumption we need to both infer the missing consumption as well as spread out the observed consumption so that when we look at daily averages good A purchased on day 1 with an expected consumption period of 1 week will also be included with good B purchased on day 7.

In order to explore how to estimate consumption from only observing a limited period of time I have written a simulation testing four methods of estimation. The true consumption level for any individual is 1 unit. If there are multiple goods consumed than that 1 unit of consumption is spread across all goods so that every day only one unit is consumed.

Using only one good we get the following results. M1 is just taking the mean consumption if we divide quantity of goods purchased by number of days expected to consume. M2 is adjusting consumption by the inverse of the likelihood of observing that consumption. M3 is spreading consumption across all of the days of the week observed. M4 is both adjusting by likelihood of observations and spreading consumption across days of the week observed.

Table 1:Sim # is the simulation number while # Items is the number of different food items purchased while C Spread is the number of days consumption of that item is spread over. All values are simulated 250 times.

Sim# ItemsC SpreadM1M2M3M4
1111.001.001.001.00
2121.001.001.001.00
3131.001.001.001.00
4151.001.001.001.00
5161.001.001.001.00
6171.001.001.001.00
7180.881.010.881.01
8190.740.950.740.95
91100.701.000.701.00
101150.420.890.420.89
111200.351.010.351.01

Notice that with only 1 item consumed M1 and M3 are equivalent and M2 and M4 are equivalent. We can see that expected consumption for M2 and M4 provide much better estimates than for M1 and M3 when the consumption is spread out for goods for more than the observation period of one week on average.

Things get much more difficult when we include other goods in our calculation.

Table 2: Equivalent to Table 1 except now multiple items are being purchased at different periods (identified as # Items). In this the C Spread only refers to the first item. The remaining items are drawn randomly from the possible consumption spreads with much greater weight applied to lower consumption levels.
Sim# ItemsC SpreadM1M2M3M4
12210.720.720.900.91
13220.710.710.900.91
14230.710.710.910.92
15250.640.650.870.89
16260.660.670.830.84
17270.600.600.800.81
18280.590.620.780.83
19290.590.650.740.83
202100.580.650.700.81
212150.580.710.640.83
222200.540.690.590.83
23310.620.620.860.88
24320.610.620.880.89
25330.580.590.870.88
26350.530.530.830.84
27360.550.550.810.82
28370.500.510.780.79
29380.520.530.780.82
30390.520.540.750.81
313100.500.540.720.80
323150.490.550.680.82
333200.470.530.630.76
34410.580.580.840.85
35420.550.550.850.86
36430.520.520.840.85
37450.480.480.810.82
38460.490.490.800.82
39470.460.460.770.78
40480.460.480.760.79
41490.480.490.760.81
424100.480.500.730.79
434150.450.490.700.79
444200.440.470.680.78
45510.560.570.850.86
46520.520.520.840.85
47530.490.490.830.84
48550.450.450.790.80
49560.450.460.780.79
50570.450.450.780.79
51580.440.450.760.78
52590.440.450.760.79
535100.440.450.740.79
545150.440.460.710.77
555200.420.450.690.76
56610.520.530.830.85
57620.490.490.830.85
58630.470.480.820.83
59650.440.450.790.81
60660.440.450.800.81
61670.430.440.780.79
62680.450.460.760.79
63690.440.450.770.80
646100.420.440.730.78
656150.430.450.740.82
666200.410.430.700.76

When consuming multiple items simultaneously, the importance of spreading consumption out across all days observed becomes increasingly important. This is because daily consumption need be calculated as the sum of goods consumed that each day averaged across the number of days observed. Thus we see that while in Table 1 M2 does very well. In Table 2 M3 and M4 do much better than either M1 or M2 and M4 does slightly better than any other method at approximating total consumption.

Figure 1: Estimator performance given different item consumption spreads. The above values are for the estimator value averaged across between 1 and 6 items consumed with only the first item being at that particular spread value. M is method 1 through 4 described above.
It is worth noting that all of the methods underestimate total consumption though M4 does the best at adjusting for the missing data problem.

There are some things to consider when estimating consumption data in this way. One important thing is that if consumption tends to be for goods consumed over a long period of time then using anything but directly dividing by the period of time expected to be consumed over is going to give some pretty lumpy values.

For instance, imagine someone buys four liters of oil which they expect to consume over the next 30 days. Sure on average in order to account for the oil not observed for the many other similar people  who bought their oil on previous periods you may want to divide the oil not by the thirty days (4L/30days) but by the probably adjustment value equation (3).

Thus you get (4L/7days). Averaging across four similar people who did not happen to purchase oil you approximate the population consumption level. (1L/7days*1/4=1/28). Thus on average for the population estimate, you are pretty close. However for that one guy in your data you now have one person who looks like they are consuming 4/7 of a liter of oil per day.

When screening your data for outliers this oil consumption positively pops out of the page at you. So you figure it is some kind of recording error and replace it using population estimates.

But the problem here is entirely created by the method used to infer consumption levels. If instead you had taken the consumer at his/her reported level and said that average consumption for that individual is 4L/30days or 2/15 liters per day then you would never need to substitute out this particular outlier because it would not exist in the data in the first place.

If you would like to review the R simulation used to generate these results you can find it here.

Monday, March 7, 2016

For Whom Will the Michigan Mitt Swing?

Tomorrow, March 8th, Michigan with 130 delegates gets to vote one which of the Democratic candidates, Hillary Clinton or Bernie Sanders should be the Democratic presidential nominee. Michigan is an important state because is represents a large number of delegates.

Michigan has a also been in the news frequently this election season with the poisoning of water in Flint as a result of changes in how the city sources its water. Both democratic candidates have spent a considerable time in the state.

In a recent post I predicted outcomes within states based on the proportion of contributions within those states which have given to either the Sanders campaign or the Clinton campaign. Based on the donor rates within Michigan I predicted a 65% share of the vote would go to Sanders.

As Clinton has adopted a "stay the course", Obama 2.0, campaign strategy, Democrats in Michigan may be more likely to vote for her relative to Democrats in other parts of the country who have not seen the recent growth rates Michigan has experienced.
Figure 1: A map of counties supporting Sanders or Clinton. Donations are mapped to zip code level. Zoom to larger map to see donations indicated as either S for Sanders or C for Clinton. Size of letters correspond to number of donations from that zip code.
From Figure 1 we can see that by just looking at the number of contributions coming in by county we would expect Michigan to strongly support Bernie Sanders. However, population density is not well captured by county maps. We can see though that there is a strong level of support in the areas surrounding Detroit for both candidates.

Figure 2: Number of contributions in Michigan by contribution size for both candidates Clinton and Sanders.Note that because the majority of funds are too small to itemize, these estimates underestimates the total funds contributed for Sanders by 70-80% while underestimating funds contributed to Clinton by only 10-20%.
We can see that in terms of total number of contributions, Sanders is strongly outraising Clinton in Michigan by a factor of 2 to 1. However, as has been noted previously, large/wealthy donors disproportionately back Hillary Clinton above all other candidates. When it comes to large sponsors giving more than $1000 to Clinton in Michigan, she has hundreds while Sanders has 15 (too few to appear on the figure).

Figure 3: Total itemized funds contributed by contribution size. Note that because the majority of funds are too small to itemize, these estimates underestimates the total funds contributed for Sanders by 70-80% while underestimating funds contributed to Clinton by only 10-20%.
Though these contributions represent a small proportion of the total contributions to either candidate they do represent a large portion for the total funds contributed in Michigan. From Figure 2, we can see that those few large contributors make up a large portion of the funds donated in the state.

Who are these large donors?

Figure 4: Industrial backing of donors in Michigan.
Like the country at large, business executives and lawyers are Clinton's largest backers while health care workers, engineers, artists, academics, and the self-employed form a broad coalition of support for Sanders.

Related Articles: 
Clinton's Lack of Public Support Made up by Super-PACs
Analysis: Clinton backed by Big Money: Sanders by Small
Overwhelming Growth In National Support for Bernie Sanders Mapped
Big Business Backs Hillary: Small Bernie
Hillary 1993: Largest Drop in Girl Names EVER; Chelsea Distant Second
As First Lady, Popularity of Babies Named "Hillary" Dropped by an Unprecedented 90%
Hillary Clinton's Biggest 2016 Rival: Herself
The Simple Reason Sanders Is Winning
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America's Poor
What it means to be a US Veteran Today

Sunday, March 6, 2016

Clinton's Lack of Public Support Made up by Super-PACs

Hillary Clinton, with only $30 million raised in February far below the $43 million raised by her rival Bernie Sanders, Clinton is falling desperately short of public backing.

Fortunately, she has friends in high places. These friends are increasing their backing of her through the quasi-legal independent campaigning structures some of which are known as Super-PACs.

These organizations are a mixed batch many of them working for the collective interest of special interest groups such as the National Nurses United For Patient Protection Super PAC which backs Bernie Sanders or other Super-PACs such as the "League of Conservative Voters, Inc" has spent for instance $162,115.70 supporting Hillary Clinton.

These packs are free to support without fiscal limit any candidate thought are legally required to act independent agents not in contact with of individual campaigns.

It is easy to understand how Super-PACs could be justified legally. If there are organizations that support particular special interests then shouldn't these organizations have the right to back whatever candidate is also supporting those positions?

However, where things get tricky is when Candidates construct Super-PACs for the express purpose of skirting election laws. A famous case called Citizens United vs FEC in 2010 effectively reversed years of campaign finance reform law. An interesting note is that Citizens United Super PAC LLC has so spent $140k supporting Clinton's campaign. (Interestingly this same organization reports spending 512k opposing her.)

Anyways, the long and short of it is that Super wealthy donors who are prohibited from donating more than the legal limit to campaigns can set up Super-PACs in order to skirt election laws and back particular candidates. I do not know to what extent this is happening for the current campaign. However, it is important to recognize that a Super-PAC backed by a union composed of thousands of members (such as the nurse PAC supporting Sanders) is distinctly different than the typical organizations people concerned with Super-PACs are talking about.

Related Articles: 
Overwhelming Growth In National Support for Bernie Sanders Mapped
Big Business Backs Hillary: Small Bernie
Hillary 1993: Largest Drop in Girl Names EVER; Chelsea Distant Second
As First Lady, Popularity of Babies Named "Hillary" Dropped by an Unprecedented 90%
Hillary Clinton's Biggest 2016 Rival: Herself
Analysis: Clinton backed by Big Money: Sanders by Small
Legally Rig An Election: A Citizen's Guide to Gerrymandering 
Nevada:Sanders has 6x the Supporters as Clinton
The Simple Reason Sanders Is Winning
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America's Poor
What it means to be a US Veteran Today

Saturday, March 5, 2016

Prediction: 64% Sanders Wins Majority of Pledged Deligates

There are many ways so predict the future. All of them have a fair degree of uncertainty. Nate Silver at FiveThityEight uses a measure of ethnicity and political leanings to predict how well Sanders will do in different states. This seems like a sound method to me though it is not the only way to make predictions.

For the last month I have been playing with campaign contributions data and have seen a strong and steady increase in support for Bernie Sanders across the nation. I have mapped it county by county and the results are quite dramatic.

Yet, what does grassroots support really mean? Does it translate to votes?

In February, with only four states having voted, it was impossible to say how contributions translated to votes. But Super-Tuesday changed all of that!

With fifteen states having voted, we can now see if financial support maps to voting support.
Figure 1: This figure shows a relationship between percent of support for all times reported as of January 31st coming from that state with percent of vote (delegates when not available) coming from that state at the primary.
From Figure 1, we can see there appears to be a pretty strong relationship between percent of vote actually cast and percent of support coming in for that candidate. Let's try a formal model:
$$Vote = \beta_0 + \beta_1 ContSanders + \beta_2 Primary + \beta_3 Closed $$ Vote is the actual vote in the state. The explanatory variables are percent contributing to Sanders (ContSander). Primary and Closed refer to the difference between primary vs caucus voting systems and closed vs open. For closed systems only registered democrats are allowed to vote.

Looking at Table 1, with 79% of the variance explained (r2), we can see that percent of support coming in for a candidate from a state at the end of February is a very good predictor of how the vote will go. Increasing the number of explanatory variables increases the r2 to 86%.

Table 1: The regression of  Vote on ContSanders is V1 while V2 and V3 allow the inclusion of the explanatory variables Primary and Closed.

V1V2V3
(Intercept)-24.317*-16.122-11.988
(Intercept)_se{10.175}{10.551}{10.352}
ContSanders1.155***1.118***1.134***
ContSanders_se{0.165}{0.155}{0.147}
Primary-8.118-10.711*
Primary_se{4.642}{4.710}
Closed-6.883
Closed_se{4.474}
r20.7900.8330.862
r2adj0.7740.8050.825
fstat48.929.923
pstat000
sigp < 0.001%p < 0.01%p < 0.01%
* Coefficient significant at 10%, ** at 1%, and *** at 0.1%

When examining the coefficient on ContSanders it is useful to reflect that while this value is statistically very different from zero, the point value estimate is reasonably close to 1. This is the target number we would like if we were to directly interpret proportion of supporters in a states as a good indicator of proportion of population in state supporting Sanders. This interpretation does not make sense since most of the contributions to Sanders campaign (72% of them) are not itemized (and thus included in this analysis) because they are less than the FEC threshold of $200 while a much smaller number (about 12%) of those for the Clinton campaign are not itemized.

A priori I did not have any hypothesis as to how the Primary vs Caucus method was going to play out though I did expect those states with Closed voting to be less likely to vote for Sanders as he is strongly favored among independents.

From these numbers coefficients we can now make predictions about how the rest of the states would vote if all of the states voted today (well really March 1st, Super Tuesday). The results give a point estimate of 42% of primary locations going to Sanders with a total expected number of pledged delegates of 1740 to that of Clinton 2285. So Clinton is expected to win??

But wait! Not quite so fast!

The election will not be held tomorrow. The momentum has been strongly with Sanders and it should be expected to stay strongly with Sanders.
Figure 2: Percent of contributions going to Sanders relative to that of Clinton in the South. Interestingly DC is the largest proportional supporter of Hillary. That is because it is the state/district which best exemplifies Hillary's primary backers, wealthy. States with black outlines have yet to vote.
Figure 3: Percent of contributions going to Sanders relative to that of Clinton in the Northeast. States with black outlines have yet to vote.
Figure 4: Percent of contributions going to Sanders relative to that of Clinton in the Midwest. States with black outlines have yet to vote.
Figure 5: Percent of contributions going to Sanders relative to that of Clinton in the West. States with black outlines have yet to vote.

From the Figure 2 through 5, we can see that the growth in support for Sanders has been steadily increasing in all areas of the country. The regions most friendly to Clinton are the South and the Northeast while those most friendly to Sanders are the West and the Midwest.

The South happens to be the region least supportive of Sanders campaign though it has had more votes than all other states combined. Thus we may be getting a distorted picture of how the primary season may go based on how these first few states have voted.

If we fit a simple line to each state then assume that growth in support will continue at a steady pace until the primary at that state.
Figure 6: Predicted support at time of primary mapped against support at end of January. ZZ are democrats abroad.
Predicting support based on historic rate of donations predicts that almost all states will have greater support for Sanders than they did at the end of January. States which have experienced more growth in support for Sanders or has later primaries tend to end up further to the right on the graph. The diagonal line is what happens if there was no growth in support for Sanders over time.

Using these new expected support levels at the time of the primaries that have already happened we can fit a new model.

Table 2:  This tables shows the results of using expected proportion of Sanders supporters as predictors for election results rather than actual support as of the end of January.
V1V2V3
(Intercept)-22.886*-15.49-11.352
(Intercept)_se{9.539}{10.37}{10.178}
ContSandersP1.053***1.02***1.032***
ContSandersP_se{0.144}{0.14}{0.133}
Primary-6.92-9.484*
Primary_se{4.63}{4.688}
Closed-6.864
cCosed_se{4.435}
r20.8050.8350.865
r2adj0.790.8080.828
fstat53.630.423.4
pstat000
sigp < 0.001%p < 0.01%p < 0.01%

From Table 2, we can see that using predicted Sanders support rather than that last observed at the end of January gives us a slightly higher r2. However, those of us familiar with estimation will immediately realize that we have introduced a new level of uncertainty into the data. This is because we are using an estimated value to estimate yet another value.

Ignoring estimation uncertainty, using the best fit model I predict Sanders will get 57% of the pledged delegates. However, point estimates in statistics are almost never true. In order to estimate the error in the process I simulate randomly sampling from the distribution of possible coefficients for the Intercept, ContSanderP, Primary, and Closed coefficients and predict delegate distribution. 72% of the time Sanders is expected to get the majority of pledged delegates.

Yet, I have ignored the error in estimating the support for Sanders. Rather than doing something more complicated instead what I do is increase the standard error on all coefficients by a factor of 150% and simulate the delegate distribution again. Under this situation I predict Sanders will take the majority of the pledged delegates 64% of the time. (Note the more you increase the standard error the more watered down the predictions become until all you have is a 50-50 chance of Sanders winning.)

Conclusion
It will come as no surprise to anybody that I am an avid Bernie Sanders supporter. The level of corruption and deceit that seem endemic to the Clinton campaign combined with the consistent upright behavior and spot of messages of the Sanders campaign makes my endorsement of Sanders very easy. I might have considered supporting one of the Republican candidates, however Trump seems to be cleaning house.

It might come as some surprise that a self-described economist would openly support Sanders. However, the exaggerated claims that "economists" are opposed to Sanders does not add up when you look at the actual fiscal support Sanders has received from economists. As of the end of January, Sanders has logged 155 contributions from economists compared with Clinton's 189. That is to say 45% of contributions made to either campaign from economists have gone to the Sanders campaign.

So it was frustrating for me to see that Sanders seemed to be already getting behind in the pledged delegates for these first primary states. However, a few nights ago I built the models and crunched the numbers and was much relieved to find that not only was Sanders predicted to do well, but win the popular vote, and the majority of pledged delegates!

I know there is much uncertainty in any kind of predictions, especially one as surprising as this election season. Thus, I caution reading to much into this prediction or really any predictions that are coming out. Frankly, this model using contribution data seems to fit the data remarkably well and the results are encouraging. But even if the model predicted Clinton would take the pledged delegates and the popular vote, I would also strongly caution reading too much into such predictions.

Only 15 states have voted representing only 25% of the pledged delegates. All the while those who learn about Sanders seem to like him more and more while for Clinton the phenomenon seems to be going the other way.

As for my code. I am happy to release it however it is not in a good condition right now. I will have to revise it for public posting. That might take a few days but I wanted to get this out right now.

End Note:
I could imagine someone saying, "What do pledged delegates matter anyways since so many delegates are determined by unpledged or super-delegates?"

I can tell you this for certain if Clinton does not win by taking the majority of pledged delegates but rather though internal party politics then it is highly unlikely the majority of Sanders supporters are going to support her in the general election. Already, many of us are put off by the games the DNC have been playing by first restricting the debate schedule so as to minimize air time of democratic challengers to Clinton then by trying to cut Sanders off from access to the voter database.

This behavior coupled with ongoing ethical and potential legal violations by the Clinton campaign have given Sanders supporters a very strong dislike for underhanded tricks. Using party insiders to win the nomination against the will of the electorate, would be seen as intolerable.

Appendix:
Just for fun I have included the following list of predicted outcomes and actual outcomes as well as the predicted number of pledged delegates if all pledged delegates where distributed proportionately to that of the vote. Notice that the predicted outcomes for any given state can vary quite significantly. However, it is only across states that we hope to come up with a cumulative expected outcome that may be reasonable.

StatePrimary DatePredictionTRUEPledged SandersPledged ClintonN
1IA2/1/2016465022243950
2NH2/9/201662611595030
3NV2/20/2016444717204105
4SC2/27/2016452614293757
5AL3/1/2016382010332547
6AR3/1/2016303110223473
7CO3/1/20165959392713735
8GA3/1/2016382829648820
9MA3/1/20166149453625573
10MN3/1/2016616247309623
11OK3/1/2016535621183111
12TN3/1/2016513322335575
13TX3/1/201646347511927077
14VA3/1/20164535345215702
15VT3/1/2016978614012563
16ZZ3/1/201686NA112920
17KS3/5/201659NA19142997
18LA3/5/201637NA19322853
19NE3/5/201644NA11141724
20ME3/6/201663NA1693829
21MI3/8/201665NA854511472
22MS3/8/201656NA2016854
23FL3/15/201639NA8313128191
24IL3/15/201666NA1025421651
25MO3/15/201659NA42296384
26NC3/15/201658NA624511057
27OH3/15/201657NA826110091
28AZ3/22/201670NA532210288
29ID3/22/201677NA1851677
30UT3/22/201658NA19142859
31AK3/26/201672NA1241891
32HI3/26/201661NA15102823
33WA3/26/201685NA861528682
34WI4/5/201670NA60267374
35WY4/9/201673NA104696
36NY4/19/201638NA9515255636
37CT4/26/201659NA32239237
38DE4/26/201644NA9121119
39MD4/26/201646NA445114771
40PA4/26/201656NA1058416231
41RI4/26/201659NA14102494
42IN5/3/201670NA58255346
43WV5/10/201668NA2091632
44KY5/17/201662NA34212921
45OR5/17/201687NA53815170
46PR6/5/201620NA1248566
47CA6/7/201681NA38392113314
48MT6/7/201685NA1831658
49ND6/7/201692NA171427
50NJ6/7/201653NA675913513
51NM6/7/201691NA3136658
52SD6/7/201685NA173714
53DC6/14/201610NA2189298
Total:22491691

Wednesday, March 2, 2016

"To Pie or Not To Pie" That is the question! Graph theory

In several recent posts I have attempted to convey the nature of how the current primary season is funded (on the Democratic side). In order to assist in conveying this information I have employed several different analytical angles and graphical strategies all generated in my favorite statistical package, R. These graphs have included histograms, maps, bar-plots, box-plots, and yes, dare I say it pie charts.

I wrote my most recent post and I was surprised to find that despite its inflammatory content, the only comments I received on it were criticizing my use of pie charts.

One article linked to the comment opened, "The pie chart is easily the worst way to convey information ever developed in the history of data visualization."

The article commenced to list some very reasonable information as to why pie charts are not an effective method of conveying information. They do mention that there is a slight benefit when comparing large differences because "their only real use is to let people know what a fraction looks like."

But is this true?

The article states that charts are used because:
- Charts are a way to take information and make it more understandable.
- In general, the point of charts are to make it easier to compare different sets of data.
- The more information a chart is able to convey without increasing complexity, the better.

All of these points are great but fail to capture the two primary reasons I use charts:
- Stimulate interest in the reader.
- Provide a visual aid by which readers can understand and take away key information.

So with these graphing objectives in mind, lets look at the following graphs all produced from the same data.

Figure 1: Campaign finance pie chart. Post Code
Figure 2: Campaign finance histogram chart. Post Code
Figure 3: Campaign finance map. Post Code (not yet provided)
Figure 4: Campaign finance barplot. Post Code
Figure 5: Itemized contribution size over time, boxplot. Post Code
Figure 6: Cumulative contribution over time. Notice the steep jumps in Clinton campaign reflects the effect of large donors while the smoothness in the Sanders campaign reflects the flow of numerous small donors. Post Code
A keen eye will immediately notice that all except the fist figure are generated using ggplot2, my favorite R graphing package. ggplot2 goes out of its way not to provide a pie chart rendering tool as they strongly discourage its use. Though there is a bit of a workaround using polarized coordinates and bar plots which I decided not to use.

From looking at all six figures we can see that each of them is clearly trying to communicate the same information in a different way. Figures 1 and 2 are concerned about size of contributions, while Figure 3 provides geographic mapping of the number of contributions. Figure 4 reorganizes the information by industry category rather than contribution size while Figures 5 and 6 are more concerned with how donations change over time.

Now, looking over these figures, I have to ask, which of them even comes close to conveying the same information a effectively as the pie graphs in figure I conveys this information?

The histogram, Figure 2, provides almost the same information yet you have to spend a considerable amount of effort looking at the Figure then do some mental math multiplying size of donation to frequency of donation in order to mentally come up with values that almost resemble Figure 1.

I could instead generate a density map to try to attempt to convey the same information.
Figure 7: Density curve of campaign contribution size. Code
Yet this does not capture the information I would like to convey (Figure 7). From this graph you may mistakenly assume that for the Clinton campaign small contributions are more important than large ones. However, this is not the case as we know from Figure 1. The problem with a density graph like this is that it is measuring the density which is the number of contributions. This does not reflect in any obvious way how important those contributions are.
Figure 8: Contribution size/importance plot. This is the same plot as a density plot (Figure 7) but rather than counting the number of contributions at each amount it calculates to total value of those contributions. Code


We get much closer to the information I am attempting to describe in Figure 1 with Figure 8. Figure 8 shows us that there are certain peak quantities most frequently donated with the two different campaigns. One quantity is around the $2700 mark for the Clinton campaign (the maximum allowable without using Super-PACs) while the other is the less than $100 area for the Sanders campaign.

Looking at Figure 8 we can gather basically the same information as that of the pie-chart. Maybe a little more as we can see that there are certain peak values (200,500,1000,2000,2700) which are more likely donor values. Yet, I would argue that this information is not really important. It might even be a distraction to the main point of the original post (FALSE: Clinton Funded by "Grassroots").

Not only is the information potentially a distraction, but it requires additional analysis on the part of the reader to figure out what information the chart is trying to convey. A pie-chart on the other hand is an amazingly simple chart that anybody who has familiarity with pies or charts can easy read and understand when comparing large differences in proportions. Thus readers can in a glance get a full and easy to remember understanding of the information that is being transmitted.

Conclusion


Here we have it! One pie-chart that efficiently conveys certain types of information against seven other figures which struggle to convey the same information as what the pie-chart easily conveys.

My final suggestion therefore is that people start thinking more about what they are attempting to communicate with their charts and less about what the chart gurus are telling us to do.

Building effective graphics is like writing effective pros. Know what you want to say and figure out the easiest and most straightforward way of saying it, period.

Tuesday, March 1, 2016

FALSE: Clinton Funded by "Grassroots"

The blatant distortions of reality put forth by the Clinton campaign are so offensive as to be laughable at times. In the victory speech of Hillary Clinton in South Carolina she spent a significant portion of it talking about how her campaign is financed by "grassroots".

Well, looking at the breakdown of funding for her campaign, only about 12% of her funds are from individuals contributing less than $200 while the vast majority of her funding (77%) is from individuals contributing $1000 or more.

If you are going to tell me that a movement is 77% funded by people giving $1000 or more is "grassroots", I am going to have to ask, "what grass are you smoking?" The only way you can call such a top-heavy movement "grassroots" is if you are growing grass in Koch brothers' back yard!



Obviously some small portion of the Clinton campaign is funded by small donors. However, for the campaign to misrepresent itself as "grassroots" powered by "small-donors" is frankly a complete falsehood especially when compared with a true grass roots funded campaign.



From the second graph we can see what a true "grassroots" campaign looks like. This is the Sanders campaign which has received only 10% of its total funds from individuals giving $1000 or more and 72% of its funds from people giving less than $200.


You might think that the Clinton campaign only looks bad when compared with publicly backed campaigns such as that of Bernie Sanders. However, this is not the case either. As I have noted previously, the Clinton campaign has far more big sponsors than all other candidates currently campaigning combined. And this is not counting the millions of funds paid into the Clinton Super-PAC.

But don't take my word for it. Run the analysis yourself.

Related Articles: 
Clinton Many More Rich Supporters Than All Other Candidates Combined
Big Business Backs Hillary: Small Bernie
Analysis: Clinton backed by Big Money: Sanders by Small
Overwhelming Growth In National Support for Bernie Sanders Mapped
Hillary 1993: Largest Drop in Girl Names EVER; Chelsea Distant Second
As First Lady, Popularity of Babies Named "Hillary" Dropped by an Unprecedented 90%
Hillary Clinton's Biggest 2016 Rival: Herself
Legally Rig An Election: A Citizen's Guide to Gerrymandering 
Nevada:Sanders has 6x the Supporters as Clinton
The Simple Reason Sanders Is Winning
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America's Poor
What it means to be a US Veteran Today

Friday, February 26, 2016

Clinton Many More Rich Supporters Than All Other Candidates Combined

With over 26,600 supporters giving at or above the federal individual contribution maximum of $2,700, Hillary Clinton far exceeds the number of contributions by wealthy donors than any other candidate (Table 1: Huge). No only that but the number of huge contributions for Hillary Clinton exceeded the number by her chief rival Bernie Sanders by a factor of 60 to 1.

Table 1: This table shows the number of campaign contributions by size of the contribution.
Candidate Itemized TotalHuge ($2700+) Large ($1000-2699) Med ($200-1000) Small ($1-200) Unitimized ($1-200)*
Clinton 106,285,874.12 120,495,96423,620 15,307 33,906 152,640 568,404
Sanders 26,509,365.75 93,883,341394 3,442 29,374 319,374 2,694,959
Carson 23,995,461.23 57,001,5771,137 4,722 24,411 159,917 1,320,245
Cruz 34,821,534.81 54,070,6454,078 4,738 23,909 181,351 769,964
Kasich 7,459,832.14 8,401,5051,837 1,327 1,977 1,524 37,667
Rubio 26,915,622.40 32,371,1755,699 6,183 11,193 36,019 218,222
Trump 1,773,225.61 7,407,238194 289 2,432 2,634 225,360
Unitimized support is assumed given at $25 per contribution.

Yet, Clinton is not only getting more support among wealthy people than her Democratic Socialist rival. Clinton also has 77% more support among wealthy people than all other candidates combined.

In complete contrast, challenger Bernie Sanders has far more support among small contributors with nearly 5x as many small contributions (less than $200 in total) as Hillary Clinton.

Among Republicans the candidate which comes closest to the grassroots support that Sanders has is Ben Carson who has less than half the number of unitimized donations as Sanders.
Figure 1: A collection of histograms showing number of contributions grouped by size of contribution.
Figure 1, shows this information is yet another way.

The easiest and clearest way of reading this information is that Hillary is backed by big money and Sanders by small.

Related Articles: 
Overwhelming Growth In National Support for Bernie Sanders Mapped

Big Business Backs Hillary: Small Bernie
Hillary 1993: Largest Drop in Girl Names EVER; Chelsea Distant Second
As First Lady, Popularity of Babies Named "Hillary" Dropped by an Unprecedented 90%
Hillary Clinton's Biggest 2016 Rival: Herself
Analysis: Clinton backed by Big Money: Sanders by Small
Legally Rig An Election: A Citizen's Guide to Gerrymandering 
Nevada:Sanders has 6x the Supporters as Clinton
The Simple Reason Sanders Is Winning
Cause of Death: Melanin | Evaluating Death-by-Police Data
Obama 2008 received 3x more media coverage than Sanders 2016
The Unreported War On America's Poor
What it means to be a US Veteran Today