Friday, March 22, 2019

Data Fun - Inspired by Darasaurus

After my recent post on Anscombe's Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice's Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to fit a specified mean and variance for X and Y. Their method randomly applies small disturbances to individual observations to gradually move the data to match a target preference set.

Inspired by their use of point images which match a specific parameter set, I have done generated some of my own. For all of them X has a mean and variance of 1 and 11. While y has a mean and variance of 1 and 1. The correlation between X and Y is set to 0 which causes some distortion in the images. More on that in the post.
Figure 1: Shows a graph of 12 data sets each with 15 transitional data sets. The mean, variance, and correlations of X and Y are held constant throughout the sets and transitions.

Data Source

I generated the data myself using Mobilefish's upload photo and record clicks webapp. The source images are from images I found online.

The only slight trick to using the data generated by Mobilefish was that the y cooridates are typically tracked from the top of the page with software, yet most statistical graphing software plots with y starting from the bottom of the graph.

Raw Images

The raw data when plotted look like their source material..

New Images: Force Cor(X,Y)=0

When we force the correlation of X and Y to be zero certain point distributions become distorted.

For Bart and the Cat forcing cor(X,Y) has noticable distortions while for the flower minimal distortions seem to have been introduced.

New Images: Force Cor(X,Y)<>0

It gets even worse when we impose a constant correlation between X and Y. The following shows the distortions to the flower when we change b1, keeping Cor(X,Y) constant and fixing the Y plot limits.
Figure 8: Shows the effect on Var(Y) that changing b1, when all other factors are held constant.

Slight changes to the Anscombe-Generator Code

In order to generate graphs that had cor(X,Y)=0 I had to modify my previous code to allow variation in Y that was completely independent of X. The problem with my code was that if b1=1, my calculation used SSE = (b1^2 * var(X))*n in order to infer how large the variation in u needed to be (varianceu = (SSE/corXY^2 - SSE)/n). This backwards inference does not work if b1=0.

So, just for the special case of corXY=0 I have included an additional error term E which is helpful in the even that b1=0.


The thought of use points to make recognizable images had not occurred to me until I viewed Justin Matejka and George Fitzmaurice's Datasaurus work. I hope that in making a slightly more efficient distribution manipulator I will allow new and better datasets to be generated which will help students understand the importance of graphical exploration of their data.


My code as well as the slight change to Anscombe's code can be found here. The standarized data sets can be found here. They are the ones that end with std.csv

Tuesday, March 19, 2019

The importance of Graphing Your Data - Anscombe's Clever Quartet!

Francis Anscombe's seminal paper on "Graphs in Statistical" analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe's quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.

I found myself easily convinced by the strength of his arguments yet also curious as to how he produced the sample data that fit his statistical argument so perfectly. Given that he had only 11 points of data, I am drawn to think he played around with the data by hand till it fit his needs. This is suggested by the lack of precision on the statistics of the generated data (Anscombe's quartet).

If he could do it by hand, I should be able to do it through algorithm!

The benefits of having such an algorithm would be that I generate an arbitrary number of datasets and data that exactly fit specific sample parameters. I tried a few different methods of producing the data that I wanted.

Method 1 - randomly draw some points then select the remaining - fail

One method was to select just the last point or two from a set of data, say I wanted  to draw 11 X points with with mean 9 and variance 11 as found in the data. I attempted to draw 10 points then adjust the mean and variance by selectively drawing the 11th point. This approach however quickly fails as it relies too much on the 11th point. Say the mean from the first draws was unusually low with a mean of 8. In order to weight the sample mean back to 9 the 11th point would therefore need to be 19 in order to balance the x values at 9. Then you have to somehow figure out how to manage the variance which you know is already going to be blown up by the presence of my 11th value.

Method 2 - use optimization to select points which match the desired outcome - fail

Next I tried some search algorithms trying to use computation to search for possible values that fit my needed data. This was a highly problematic attempt that failed to produce any useful results.

Method 3 - brute force, randomly generate data - fail

The intent of the approach was to get data close to target parameters, then modifying individual data points to match desired properties.

Method 4 - modify random data to meet parameter specifications

Fortunately, after a little reflection I realized the smarter approach was to make use of what I know about means and variances as well as correlations to modifying the sample to fit my desired outcome. For instance, no matter what x I started with (so long as x had any variation) I could adjust it to fit my needs. If the mean of x needs to be mu_X. Then we can force it to be that:
$$ (1) X = X-mean(X) + \mu_X $$

Slightly more challenging, we could modify the variance of x by scaling the demeaned values of the sample. Since we know that
$$ (2) Var(aX)=a^2 * Var(X) $$
Define a to be a multiplicative scalar for x
$$(3) a = (\sigma^2_X/Var(X))^{1/2}$$

Using identities we can figure out how to modify the error term u in order to always return the desired regression values as well as the correct correlations (for more explanation see first fifty lines of notes in coding file).

Through use of such an algorithm we can feed in any draw of X and any dependency between X and U and we will get the same regression results:
Mean(X) = 9, Var(X) = 7.5, B0 = 3, B1 = .5, COR(X,Y)=.8.

Sample Data - Using Ascombe's Parameters

Sample data drawn to generate the following graphs can be found here.

The statistical results in R are displayed as follows. These results are designed to be exactly identical regardless of how the data is generated.

Table 1:

Call: lm(formula = y ~ x, data = xy8)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00000    0.51854   5.785 5.32e-07 ***
x            0.50000    0.05413   9.238 3.18e-12 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared:   0.64, Adjusted R-squared:  0.6325 
F-statistic: 85.33 on 1 and 48 DF,  p-value: 3.18e-12

Since we cannot see any difference from looking at standard descriptive statistics, lets see how the data looks when graphed.
Figure 1: Graphs 1-4 are recreations of Anscome's Quartet. 5-8 are new.
Figures 1-4 are recreations of Anscombe's Quartet. Figure 1 is what we are often times thinking our data should look like in our heads. Figure 2 is a situation in which there is a nonlinear relationship between x and y which should be examined. Figure 3 could present a problem since there is no variation in x except one observation which drives all of the explanatory value of the regression. Figure 4 is similar except now there is variation in x and in y. However the relationship between the values is distorted by the presence of a single powerful outlier.

Figures 5-8 are figures I came up with. Figure 5 features a weak linear relationship between x and y which is exaggerated by a single outlier. Figure 6 is a negative log. Figure 7 is a example of heteroskedasticity. Figure 8 is an example of x taking only one of two values. 

Anscome emphasizes that the funkiness of the data does not necessarily mean the inference is not valid. That said, ideally removing a single point of data should not significantly change inference. Yet, researchers should know what their data looks like.

As for figure 7, generally we do not expect heroskedastic errors to present inference bias. Rather they suggest that using heteroskedasticity robust or White-Huber standard errors might improve the efficiency of our estimates (generally speaking).

Sample Data - Using Negative Slope Parameters

Sample data is drawn from the same parameters except that now the slope is negative. 

Table 2: 
Call: lm(formula = y ~ x, data = xy1)

    Min      1Q  Median      3Q     Max 
-2.3862 -0.6586 -0.2338  0.5721  3.6159 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00000    0.51854   5.785 5.32e-07 ***
x           -0.50000    0.05413  -9.238 3.18e-12 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared:   0.64, Adjusted R-squared:  0.6325 
F-statistic: 85.33 on 1 and 48 DF,  p-value: 3.18e-12

We can see that changing the slope to negative does not change any of the other statistics.

Figure 2: Same as figure 1 except B1 = -0.5


Graph your data! If not presenting graphs in your final analysis at least graph it in the exploration phase. Ideally, presenters of data and analysis have some mastery of tools of data exploration and interaction which can presented with data (such as interactive data interfaces Shiny or Tableau).

Such supplementary data found in graphs will likely not be the basis of whether the arguments you are making through statistics are valid, but they will add credibility.


Find my code for generating exact linear relationships between XY regardless of the dependency of the errors U and X (u|x).

Friday, March 8, 2019

Mass Shooting Data - Tableau

Some experimentations with Mass Shooting data visualization using Tableau
Here is a related graph/animation generated in R