Friday, March 22, 2019

Data Fun - Inspired by Darasaurus


After my recent post on Anscombe's Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice's Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to fit a specified mean and variance for X and Y. Their method randomly applies small disturbances to individual observations to gradually move the data to match a target preference set.

Inspired by their use of point images which match a specific parameter set, I have done generated some of my own. For all of them X has a mean and variance of 1 and 11. While y has a mean and variance of 1 and 1. The correlation between X and Y is set to 0 which causes some distortion in the images. More on that in the post.
Figure 1: Shows a graph of 12 data sets each with 15 transitional data sets. The mean, variance, and correlations of X and Y are held constant throughout the sets and transitions.

Data Source

I generated the data myself using Mobilefish's upload photo and record clicks webapp. The source images are from images I found online.

The only slight trick to using the data generated by Mobilefish was that the y cooridates are typically tracked from the top of the page with software, yet most statistical graphing software plots with y starting from the bottom of the graph.

Raw Images

The raw data when plotted look like their source material..



New Images: Force Cor(X,Y)=0

When we force the correlation of X and Y to be zero certain point distributions become distorted.



For Bart and the Cat forcing cor(X,Y) has noticable distortions while for the flower minimal distortions seem to have been introduced.

New Images: Force Cor(X,Y)<>0

It gets even worse when we impose a constant correlation between X and Y. The following shows the distortions to the flower when we change b1, keeping Cor(X,Y) constant and fixing the Y plot limits.
Figure 8: Shows the effect on Var(Y) that changing b1, when all other factors are held constant.

Slight changes to the Anscombe-Generator Code

In order to generate graphs that had cor(X,Y)=0 I had to modify my previous code to allow variation in Y that was completely independent of X. The problem with my code was that if b1=1, my calculation used SSE = (b1^2 * var(X))*n in order to infer how large the variation in u needed to be (varianceu = (SSE/corXY^2 - SSE)/n). This backwards inference does not work if b1=0.

So, just for the special case of corXY=0 I have included an additional error term E which is helpful in the even that b1=0.

Summary

The thought of use points to make recognizable images had not occurred to me until I viewed Justin Matejka and George Fitzmaurice's Datasaurus work. I hope that in making a slightly more efficient distribution manipulator I will allow new and better datasets to be generated which will help students understand the importance of graphical exploration of their data.

Code

My code as well as the slight change to Anscombe's code can be found here. The standarized data sets can be found here. They are the ones that end with std.csv

No comments:

Post a Comment