Tuesday, March 19, 2019

The importance of Graphing Your Data - Anscombe's Clever Quartet!


Francis Anscombe's seminal paper on "Graphs in Statistical" analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe's quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.

I found myself easily convinced by the strength of his arguments yet also curious as to how he produced the sample data that fit his statistical argument so perfectly. Given that he had only 11 points of data, I am drawn to think he played around with the data by hand till it fit his needs. This is suggested by the lack of precision on the statistics of the generated data (Anscombe's quartet).

If he could do it by hand, I should be able to do it through algorithm!

The benefits of having such an algorithm would be that I generate an arbitrary number of datasets and data that exactly fit specific sample parameters. I tried a few different methods of producing the data that I wanted.

Method 1 - randomly draw some points then select the remaining - fail

One method was to select just the last point or two from a set of data, say I wanted  to draw 11 X points with with mean 9 and variance 11 as found in the data. I attempted to draw 10 points then adjust the mean and variance by selectively drawing the 11th point. This approach however quickly fails as it relies too much on the 11th point. Say the mean from the first draws was unusually low with a mean of 8. In order to weight the sample mean back to 9 the 11th point would therefore need to be 19 in order to balance the x values at 9. Then you have to somehow figure out how to manage the variance which you know is already going to be blown up by the presence of my 11th value.

Method 2 - use optimization to select points which match the desired outcome - fail

Next I tried some search algorithms trying to use computation to search for possible values that fit my needed data. This was a highly problematic attempt that failed to produce any useful results.

Method 3 - brute force, randomly generate data - fail

The intent of the approach was to get data close to target parameters, then modifying individual data points to match desired properties.

Method 4 - modify random data to meet parameter specifications

Fortunately, after a little reflection I realized the smarter approach was to make use of what I know about means and variances as well as correlations to modifying the sample to fit my desired outcome. For instance, no matter what x I started with (so long as x had any variation) I could adjust it to fit my needs. If the mean of x needs to be mu_X. Then we can force it to be that:
$$ (1) X = X-mean(X) + \mu_X $$

Slightly more challenging, we could modify the variance of x by scaling the demeaned values of the sample. Since we know that
$$ (2) Var(aX)=a^2 * Var(X) $$
Define a to be a multiplicative scalar for x
$$(3) a = (\sigma^2_X/Var(X))^{1/2}$$

Using identities we can figure out how to modify the error term u in order to always return the desired regression values as well as the correct correlations (for more explanation see first fifty lines of notes in coding file).

Through use of such an algorithm we can feed in any draw of X and any dependency between X and U and we will get the same regression results:
Mean(X) = 9, Var(X) = 7.5, B0 = 3, B1 = .5, COR(X,Y)=.8.

Sample Data - Using Ascombe's Parameters


Sample data drawn to generate the following graphs can be found here.

The statistical results in R are displayed as follows. These results are designed to be exactly identical regardless of how the data is generated.

Table 1:

Call: lm(formula = y ~ x, data = xy8)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00000    0.51854   5.785 5.32e-07 ***
x            0.50000    0.05413   9.238 3.18e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared:   0.64, Adjusted R-squared:  0.6325 
F-statistic: 85.33 on 1 and 48 DF,  p-value: 3.18e-12


Since we cannot see any difference from looking at standard descriptive statistics, lets see how the data looks when graphed.
Figure 1: Graphs 1-4 are recreations of Anscome's Quartet. 5-8 are new.
Figures 1-4 are recreations of Anscombe's Quartet. Figure 1 is what we are often times thinking our data should look like in our heads. Figure 2 is a situation in which there is a nonlinear relationship between x and y which should be examined. Figure 3 could present a problem since there is no variation in x except one observation which drives all of the explanatory value of the regression. Figure 4 is similar except now there is variation in x and in y. However the relationship between the values is distorted by the presence of a single powerful outlier.

Figures 5-8 are figures I came up with. Figure 5 features a weak linear relationship between x and y which is exaggerated by a single outlier. Figure 6 is a negative log. Figure 7 is a example of heteroskedasticity. Figure 8 is an example of x taking only one of two values. 

Anscome emphasizes that the funkiness of the data does not necessarily mean the inference is not valid. That said, ideally removing a single point of data should not significantly change inference. Yet, researchers should know what their data looks like.

As for figure 7, generally we do not expect heroskedastic errors to present inference bias. Rather they suggest that using heteroskedasticity robust or White-Huber standard errors might improve the efficiency of our estimates (generally speaking).

Sample Data - Using Negative Slope Parameters


Sample data is drawn from the same parameters except that now the slope is negative. 

Table 2: 
Call: lm(formula = y ~ x, data = xy1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3862 -0.6586 -0.2338  0.5721  3.6159 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.00000    0.51854   5.785 5.32e-07 ***
x           -0.50000    0.05413  -9.238 3.18e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared:   0.64, Adjusted R-squared:  0.6325 
F-statistic: 85.33 on 1 and 48 DF,  p-value: 3.18e-12

We can see that changing the slope to negative does not change any of the other statistics.

Figure 2: Same as figure 1 except B1 = -0.5

Summary

Graph your data! If not presenting graphs in your final analysis at least graph it in the exploration phase. Ideally, presenters of data and analysis have some mastery of tools of data exploration and interaction which can presented with data (such as interactive data interfaces Shiny or Tableau).

Such supplementary data found in graphs will likely not be the basis of whether the arguments you are making through statistics are valid, but they will add credibility.

CODE

Find my code for generating exact linear relationships between XY regardless of the dependency of the errors U and X (u|x).

No comments:

Post a Comment