Friday, March 22, 2019

Data Fun - Inspired by Darasaurus

After my recent post on Anscombe's Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice's Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to fit a specified mean and variance for X and Y. Their method randomly applies small disturbances to individual observations to gradually move the data to match a target preference set.

Inspired by their use of point images which match a specific parameter set, I have done generated some of my own. For all of them X has a mean and variance of 1 and 11. While y has a mean and variance of 1 and 1. The correlation between X and Y is set to 0 which causes some distortion in the images. More on that in the post.
Figure 1: Shows a graph of 12 data sets each with 15 transitional data sets. The mean, variance, and correlations of X and Y are held constant throughout the sets and transitions.

Data Source

I generated the data myself using Mobilefish's upload photo and record clicks webapp. The source images are from images I found online.

The only slight trick to using the data generated by Mobilefish was that the y cooridates are typically tracked from the top of the page with software, yet most statistical graphing software plots with y starting from the bottom of the graph.

Raw Images

The raw data when plotted look like their source material..

New Images: Force Cor(X,Y)=0

When we force the correlation of X and Y to be zero certain point distributions become distorted.

For Bart and the Cat forcing cor(X,Y) has noticable distortions while for the flower minimal distortions seem to have been introduced.

New Images: Force Cor(X,Y)<>0

It gets even worse when we impose a constant correlation between X and Y. The following shows the distortions to the flower when we change b1, keeping Cor(X,Y) constant and fixing the Y plot limits.
Figure 8: Shows the effect on Var(Y) that changing b1, when all other factors are held constant.

Slight changes to the Anscombe-Generator Code

In order to generate graphs that had cor(X,Y)=0 I had to modify my previous code to allow variation in Y that was completely independent of X. The problem with my code was that if b1=1, my calculation used SSE = (b1^2 * var(X))*n in order to infer how large the variation in u needed to be (varianceu = (SSE/corXY^2 - SSE)/n). This backwards inference does not work if b1=0.

So, just for the special case of corXY=0 I have included an additional error term E which is helpful in the even that b1=0.


The thought of use points to make recognizable images had not occurred to me until I viewed Justin Matejka and George Fitzmaurice's Datasaurus work. I hope that in making a slightly more efficient distribution manipulator I will allow new and better datasets to be generated which will help students understand the importance of graphical exploration of their data.


My code as well as the slight change to Anscombe's code can be found here. The standarized data sets can be found here. They are the ones that end with std.csv

Tuesday, March 19, 2019

The importance of Graphing Your Data - Anscombe's Clever Quartet!

Francis Anscombe's seminal paper on "Graphs in Statistical" analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe's quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.

I found myself easily convinced by the strength of his arguments yet also curious as to how he produced the sample data that fit his statistical argument so perfectly. Given that he had only 11 points of data, I am drawn to think he played around with the data by hand till it fit his needs. This is suggested by the lack of precision on the statistics of the generated data (Anscombe's quartet).

If he could do it by hand, I should be able to do it through algorithm!

The benefits of having such an algorithm would be that I generate an arbitrary number of datasets and data that exactly fit specific sample parameters. I tried a few different methods of producing the data that I wanted.

Method 1 - randomly draw some points then select the remaining - fail

One method was to select just the last point or two from a set of data, say I wanted  to draw 11 X points with with mean 9 and variance 11 as found in the data. I attempted to draw 10 points then adjust the mean and variance by selectively drawing the 11th point. This approach however quickly fails as it relies too much on the 11th point. Say the mean from the first draws was unusually low with a mean of 8. In order to weight the sample mean back to 9 the 11th point would therefore need to be 19 in order to balance the x values at 9. Then you have to somehow figure out how to manage the variance which you know is already going to be blown up by the presence of my 11th value.

Method 2 - use optimization to select points which match the desired outcome - fail

Next I tried some search algorithms trying to use computation to search for possible values that fit my needed data. This was a highly problematic attempt that failed to produce any useful results.

Method 3 - brute force, randomly generate data - fail

The intent of the approach was to get data close to target parameters, then modifying individual data points to match desired properties.

Method 4 - modify random data to meet parameter specifications

Fortunately, after a little reflection I realized the smarter approach was to make use of what I know about means and variances as well as correlations to modifying the sample to fit my desired outcome. For instance, no matter what x I started with (so long as x had any variation) I could adjust it to fit my needs. If the mean of x needs to be mu_X. Then we can force it to be that:
$$ (1) X = X-mean(X) + \mu_X $$

Slightly more challenging, we could modify the variance of x by scaling the demeaned values of the sample. Since we know that
$$ (2) Var(aX)=a^2 * Var(X) $$
Define a to be a multiplicative scalar for x
$$(3) a = (\sigma^2_X/Var(X))^{1/2}$$

Using identities we can figure out how to modify the error term u in order to always return the desired regression values as well as the correct correlations (for more explanation see first fifty lines of notes in coding file).

Through use of such an algorithm we can feed in any draw of X and any dependency between X and U and we will get the same regression results:
Mean(X) = 9, Var(X) = 7.5, B0 = 3, B1 = .5, COR(X,Y)=.8.

Sample Data - Using Ascombe's Parameters

Sample data drawn to generate the following graphs can be found here.

The statistical results in R are displayed as follows. These results are designed to be exactly identical regardless of how the data is generated.

Table 1:

Call: lm(formula = y ~ x, data = xy8)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00000    0.51854   5.785 5.32e-07 ***
x            0.50000    0.05413   9.238 3.18e-12 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared:   0.64, Adjusted R-squared:  0.6325 
F-statistic: 85.33 on 1 and 48 DF,  p-value: 3.18e-12

Since we cannot see any difference from looking at standard descriptive statistics, lets see how the data looks when graphed.
Figure 1: Graphs 1-4 are recreations of Anscome's Quartet. 5-6 are new.
Figures 1-4 are recreations of Anscombe's Quartet. Figure 1 is what we are often times thinking our data should look like in our heads. Figure 2 is a situation in which there is a nonlinear relationship between x and y which should be examined. Figure 3 could present a problem since there is no variation in x except one observation which drives all of the explanatory value of the regression. Figure 4 is similar except now there is variation in x and in y. However the relationship between the values is distorted by the presence of a single powerful outlier.

Figures 5-8 are figures I came up with. Figure 5 features a weak linear relationship between x and y which is exaggerated by a single outlier. Figure 6 is a negative log. Figure 7 is a example of heteroskedasticity. Figure 8 is an example of x taking only one of two values. 

Anscome emphasizes that the funkiness of the data does not necessarily mean the inference is not valid. That said, ideally removing a single point of data should not significantly change inference. Yet, researchers should know what their data looks like.

As for figure 7, generally we do not expect heroskedastic errors to present inference bias. Rather they suggest that using heteroskedasticity robust or White-Huber standard errors might improve the efficiency of our estimates (generally speaking).

Sample Data - Using Negative Slope Parameters

Sample data is drawn from the same parameters except that now the slope is negative. 

Table 2: 
Call: lm(formula = y ~ x, data = xy1)

    Min      1Q  Median      3Q     Max 
-2.3862 -0.6586 -0.2338  0.5721  3.6159 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00000    0.51854   5.785 5.32e-07 ***
x           -0.50000    0.05413  -9.238 3.18e-12 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared:   0.64, Adjusted R-squared:  0.6325 
F-statistic: 85.33 on 1 and 48 DF,  p-value: 3.18e-12

We can see that changing the slope to negative does not change any of the other statistics.

Figure 2: Same as figure 1 except B1 = -0.5


Graph your data! If not presenting graphs in your final analysis at least graph it in the exploration phase. Ideally, presenters of data and analysis have some mastery of tools of data exploration and interaction which can presented with data (such as interactive data interfaces Shiny or Tableau).

Such supplementary data found in graphs will likely not be the basis of whether the arguments you are making through statistics are valid, but they will add credibility.


Find my code for generating exact linear relationships between XY regardless of the dependency of the errors U and X (u|x).

Friday, March 8, 2019

Mass Shooting Data - Tableau

Some experimentations with Mass Shooting data visualization using Tableau
Here is a related graph/animation generated in R

Wednesday, September 26, 2018

Strategizing Retirement Investments (In the US)

Major caveat! I have no investment training or finance training and all of my calculations are back of the envelope calculations put together from what information I can gather online. In addition, given that investment planning usually spans decades, massive uncertainties exist in tax schemes and expected rate of returns. Please consult a professional!

That said, as someone who has an income you are trying to decide if and how much to invest in your retirement account. Most people have several options: tax deferred retirement options such as traditional IRA, 401(k), 457, or 403(b). Some employers also offer matching programs which match contributions at a rate of 50%, 100%, or more to a limit each year. As tax deferred retirement accounts, money contributed to them are not taxed at the time of contribution (except for Social Security tax 6.2 and Medicare 2.9).

In addition to these programs there is the option of investing in a Roth IRA with income that has already been taxed but will suffer no additional capital gains tax. Finally, there is the option of investing directly (with pretaxed earnings) such as purchasing stocks, mutual funds etc. While direct investments give you the freedom to withdraw money at any time, they suffer the penalty of being liable for capital gains taxes (15% on nominal gains for long term investments).

Each of these investment choices has its relative merits. One thing to consider is tax bracket.  Each earning level of taxes are paid at a specific bracket. For example, for a couple the first 19k of earnings is paid at a 10% rate, while the next 58k of earnings is paid at 12%, while the next 88k is taxed at 22%, and so on with increases to 24%, 32%, 35%, and finally 37% at total earnings above 600k (see tax tables). Yet the combined effect of these changing marginal tax rates is to turn the effective tax rate into a globally non-linear function (Figure 1).

For the individual the federal tax rate for a median household income level $32K or for a married couple making $59k is about 11.5%. The overall effect of tax bracket structure is to overstate the tax obligation that most individuals and couples are required to pay. For example, a household making 103k (75 percentile) while paying 22% on their last dollar earned are overall only paying only around 14% of their total income in income tax. Even households making 250k (97 percentile) while their tax bracket is officially 24% on their last dollar earned are only paying a tax rate a little above 19% on overall income.

Figure 1: Overall Federal Income Tax Rate without Social Security and Medicare
That said, nobody likes to see their income disappear to taxes so I am not trying to say taxes are too high or too low. I am just noting that tax brackets alone for the vast majority of the population, are quite inaccurate tools at estimating tax liability. If we include the social security tax and Medicare (Figure 2), federal taxes become both more painful overall and less progressive (higher earners paying higher taxes). This figure is even more non-linear with important portions of the figure (earnings above 128k) experiencing a lower tax rate growth than earnings up to that point. Curiously, an individual income earner above 128k actually experiences a decline in overall tax rate until earning up to the next tax bracket of 157.5k.
Figure 2: Overall Federal Income Tax With Medicare and Social Security

But enough of this! About 75% of households are earning 110k or less. Those earning above that rate are likely already talking with financial managers in order to figure out how to plan their futures.

The overall point of the above discussion is that tax differed savings plans might be quite beneficial for retirees as many retirees expect to be at a lower income level than they were while working. Income drawn from such plans is taxed at their current income level rather than the level they had at the time of saving. Social Security and Medicare tax is always paid immediately upon all wage earnings (and thus constant under all scenarios) so it will not be included in the following analysis.

In order to analyze the benefit of choosing different investment mechanisms, I looked at three different factors, 1-overall effective lifetime income, 2-total investment worth at the end of retirement (inheritance available), and 3-annual effective income. The first two seem like they are pretty important for most people but arguably annual effective income is probably the most important factor. Ideally, retirees looking to enjoy their retirement will not suffer a minimal loss in purchasing power after retirement.

In order to look at different investment strategies, I simulated two earning levels, 55k per year gross and 110k per year gross. Workers work for 35 years with real annual salary increases of 0.5%. After 35 years of work, workers consider the next 20 years as years living on secondary income sources (1/3 of base income) as well as retirement savings withdrawn at a rate of 4% per year.

In order to make the numbers make sense, I attempted to simulate “real” money rather than nominal (including inflation). As a result I assume that tax brackets remain the same in terms of “real” earnings. Annual raises are modest as a result as well. In my simulation, I include a real investment rate of return of 5% and a capital gains tax of 20% on all direct investment withdraws. This does not take into account in the cost of such investments (which is tax free) while including a higher rate of capital gains tax to take into account the false effect inflation has on direct investments (and the corresponding tax on those investments).

In addition of the tax differed retirement accounts (traditional IRA or a 401k without matching – 401km00), the after tax but not capital gains taxed retirement accounts (Roth IRA), and the direct investment mechanisms (Direct), I also included two investment options  in which employers match contributions by either 50% or 100% (401km05 and 401km10).

Generally, all of the investments are considered on the basis of “investing” a certain amount of money 5.5k or 11k. There was one scenario in which I allowed for a greater investment quantity in order to allow the simulation of investing the same amount of after tax income (i401k00b). When the earning level was 55k and the investment was 5.5k for i401k00b the 401(k) investment level was instead 6.5k. And when the earning level was 110k and the investment was 11k for i401k00b the 401(k) investment level was instead 14k.

In my analysis I am concerned with "effective income" which I define as:  
effective income = all income – tax – investment contributions (only made for years working).

Gross Base Income 55k per Year
Under the different investment scenarios, total lifetime earnings vary between 2.3m and 2.8m in the case when employers provide matching funds (Table 3). When no matching funds are provided, lifetime earnings are more similar between scenarios with lifetime earnings between 2.3m and 2.5m. Differences are more exaggerated when considering total assets at the end of the retirement period included in total lifetime earnings with the highest level obtained under scenario i401km00b (no matching but investments made at a rate that are higher than 5.5k, but instead meant to achieve after tax 5.5k investment per year).
Figure 3: Total Lifetime Effective Earning and End of Retirement Assets for Gross Income of 55k per Year
Rather than look at lifetime effective income and total assets it might be better to look at per year income (Table 4). Without contribution matching, all of the scenarios result in a significant drop in effective income. The most dramatic drop is in the "No Saving" scenario in which the effective income of the worker drops from near the global maximum for the scenario to the global minimum of the scenario as the worker enters retirement without any additional income sources outside of the 1/3 base income (SS income or whatever) provided under all retirement scenarios.

Looking at the effective income while working levels, there are three levels that the simulated worker experiences. The highest being "No Savings", which results in the highest possible current effective income level. The second are the pretax saving schemes such as 401km10, 401km05, and 401km00. These schemes provide highest effective income during working years because they reduce the after tax income not by 5.5k but 5.5k after taxes which is 4.8k in year 1. The scenario in which the highest matching funds are available is that which produces the highest long run returns. The third effective income level on working years is the direct investment scenario, the Roth IRA scenario, and the 401km00b scenario (which is custom calibrated to achieve this). In this income scenario 401(k) investment even though it is taxed upon dismemberment produced higher earnings than that of the Roth in retirement though the difference does not appear to be large in this scenario.
Figure 4: Annual effective income under different investment mechanisms 55k gross income per year. Note a jitter is added to lines to aid in deciphering overlap.

Gross Base Income 110k per Year with 11k annual investment
Figure 5: Total Lifetime Effective Earning and End of Retirement Assets for Gross Income of 55k per Year

Overall, doubling the household income and the investment level results very similar relative distributions of Total Lifetime Effective Earnings or Total Assets at retirement.
Figure 6: Annual effective income under different investment mechanisms 55k gross income per year. Note a jitter is added to lines to aid in deciphering overlap.

Likewise when looking at lifetime smoothed effective income, the overall returns from different investment mechanisms remain similar. That said the difference between the investment strategy using the Roth IRA and increased investment using 401km00b (401 tax differed investment without any match at 11k after tax investment level which is 14k before tax) resulted in the same level of effective income during working years but significantly higher effective income in retirement.

That said, Roth IRA funds are more flexible than those of tax differed funds which suffer a 10% additional penalty if withdrawn before age 59.5. In addition, if you believe that your current income level is likely to be lower than your future income level, it probably means that you should consider investing more in your Roth over that of a tax differed investment instruments.

Comparing 55k to 110k per Year
The 110k per year scenario is basically a doubling of the 55k per year scenario in terms of both income and investment (except in a slight difference with the 401km00b scenario). Comparing the average annual effective income under these two scenarios we can infer the cost of higher income taxes in the 110k scenario. The average ratio is 1.94 which indicates a difference of .06 or 6%. Looking closely at Figures 1 or 2 we can see that the difference in overall tax rates between households making 55k per year and 110k per looks to be about 6% as well. Using larger tax differed investments (401km00b) does appear to be a way to achieve a smaller tax penalty with a decreasing in effective tax rate of closer to 4%.

Table 1: Shows the ratio of average annual effective income (same as total lifetime income). If the ratio were 2 then there would be no tax different between earners at 55k per year and 110k per year.

NoSavings 1.94
401km00 1.95
401km00b 1.96
401km05 1.93
401km10 1.92
Roth 1.94
Direct 1.93
Average 1.94

If there is a 401k or other tax differed investment option in which matching is available, I would suggest putting the maximum your finances allow. If matching funds are not available, in this simulation investing pretax income in the 401k equal to that you would invest in the Roth IRA appears to yield a slightly better outcome. This is generally because while these funds will be taxed when withdrawn, having that much more funds grows that much faster than the after tax funds otherwise available.

Finally, in almost all of the scenarios, there is a significant loss of effective income upon retirement. If you would like to mitigate the risk of that happening, combining investment mechanisms is likely to be the best strategy. This might mean relying upon direct investments if for instance you have already filled your Traditional IRA or Roth IRA and have no other investment mechanisms provided by your employer.

Feel free to explore the simulation on your own with your own earnings and investment options! Code that produced this analysis can be found on Github

Wednesday, September 12, 2018

Plotting the Impact of Atlantic Hurricanes on the US

With hurricane Florence bearing down with expected to be devastating force, it might be a good time to reflect on the history of Atlantic Hurricanes on the United States. As an easy if not full proof source for hurricane data I drew on public data collated through Wikipedia's list of costliest Atlantic hurricanes as well as linked articles. On the page it lists 45 hurricanes going back to Hurricane Betsy in 1965, with 30 of them since 2000 totaling 803 billion dollars in estimated damages and 6,591 US fatalities. While this is quite a large number of fatalities, the death toll of Atlantic Hurricanes outside of the US is significantly larger with nearly 30,000 deaths estimated.

Within the US, the track record for deaths from hurricanes tend to be quite low with all hurricanes recorded having a fatality rate of less than two hundred with the exception of Maria (2017 - Trump administration) and Katrina (2005 - GW Bush administration).

Figure 1: US Fatalities by Year.

I have also scaled the y-axis by log10 to make it easier to examine less deadly storms.

Figure 2: US Fatalities by Year Log10 scaled.

Maria and Katrina being the most deadly of storms does not indicate that they are necessarily the most damaging of storms. 

Figure 3: Damages in the Billions by Year.
Harvey and Katrina are reported to have done 125 billions dollars each followed by Maria doing damage of 91.2 billion, Sandy at 68.7, and Irma at 64.8.

So why is it that Harvey (2017 - Trump), a storm that did as much damage as Katrina resulted in only 106 deaths while Maria a storm that did 25% less damage resulted in a US citizen total death toll comparable to the sum of all 44 other Hurricanes on record (2982 vs. 3609)?

Figure 4: Total deaths by Year.

We might attempt to predict which storms might be the most deadly using the storm measurements highest wind speed recorded and lowest pressure recorded.

Figure 5: Wind speed by US_Fatalities
Figure 6: Lowest air pressure recorded in terms of mbar (whatever that is).

As we can see that Maria and Katrina are both on the higher end of  the wind speed distribution and the lower end of the air pressure distribution, we can assert that a more powerful storm is a necessary component for the kind of death tolls associated with these storms.

What factors must be present for a severe storm which otherwise would be associated with a death toll of less than 200 people to being a tremendous human tragedy, such as in the case of Katrina or even more so Maria are yet unknown given the sparsity of data available to the public.

From the data it appears that Atlantic hurricanes have grown more damaging and more deadly for US citizens.

Hopefully, despite the catastrophic blunders of the handling of Katrina and Maria by the Bush and Trump administrations respectively, future hurricanes such as Florence will be dealt with in a more responsible manner.

Code and data can be found here