Francis Anscombe's seminal paper
on "Graphs in Statistical" analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe's quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.
I
found myself easily convinced by the strength of his arguments yet also curious
as to how he produced the sample data that fit his statistical argument so
perfectly. Given that he had only 11 points of data, I am drawn to think he played around with the data by hand till it fit his needs. This is suggested by the lack of precision on the statistics of the generated data (
Anscombe's quartet).
If he could do it by hand, I should be able to do it through algorithm!
The benefits of having such an algorithm would be that I generate an arbitrary number of datasets and data that exactly fit specific sample parameters.
I tried a few different methods of producing the data that I wanted.
Method 1 - randomly draw some points then select the remaining - fail
One method was to select just the last point or two from a set of data, say I wanted to draw 11 X points with with mean 9 and
variance 11 as found in the data. I attempted to draw 10 points then adjust the
mean and variance by selectively drawing the 11th point. This
approach however quickly fails as it relies too much on the 11th point. Say the
mean from the first draws was unusually low with a mean of 8. In order to
weight the sample mean back to 9 the 11th point would therefore need to be 19
in order to balance the x values at 9. Then you have to somehow figure out how
to manage the variance which you know is already going to be blown up by the
presence of my 11th value.
Method 2 - use optimization to select points which match the desired outcome - fail
Next I tried some search algorithms trying to use
computation to search for possible values that fit my needed data. This was a
highly problematic attempt that failed to produce any useful results.
Method 3 - brute force, randomly generate data - fail
The intent of the approach was to get data close to target parameters, then modifying individual data points to match desired properties.
Method 4 - modify random data to meet parameter specifications
Fortunately, after a little reflection I realized the
smarter approach was to make use of what I know about means and variances as
well as correlations to modifying the sample to fit my desired outcome. For
instance, no matter what x I started with (so long as x had any variation) I
could adjust it to fit my needs. If the mean of x needs to be mu_X. Then we
can force it to be that:
$$ (1) X = X-mean(X) + \mu_X $$
Slightly more challenging, we could modify the variance of x
by scaling the demeaned values of the sample. Since we know that
$$ (2) Var(aX)=a^2 * Var(X) $$
Define a to be a
multiplicative scalar for x
$$(3) a = (\sigma^2_X/Var(X))^{1/2}$$
Using identities we can figure out how to modify the
error term u in order to always return the desired regression values as well as
the correct correlations (for more explanation see first fifty lines of notes in
coding file).
Through use of such an algorithm we can feed in any draw of X and any dependency
between X and U and we will get the same regression results:
Mean(X) = 9, Var(X) = 7.5, B0 = 3, B1 = .5, COR(X,Y)=.8.
Sample Data - Using Ascombe's Parameters
Sample data drawn to generate the following graphs can be found
here.
The statistical results in R are displayed as follows. These results are designed to be exactly identical regardless of how the data is generated.
Table 1:
Call: lm(formula = y ~ x, data = xy8)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.00000 0.51854 5.785 5.32e-07 ***
x 0.50000 0.05413 9.238 3.18e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared: 0.64, Adjusted R-squared: 0.6325
F-statistic: 85.33 on 1 and 48 DF, p-value: 3.18e-12
Since we cannot see any difference from looking at standard descriptive statistics, lets see how the data looks when graphed.
|
Figure 1: Graphs 1-4 are recreations of Anscome's Quartet. 5-8 are new. |
Figures 1-4 are recreations of Anscombe's Quartet. Figure 1 is what we are often times thinking our data should look like in our heads. Figure 2 is a situation in which there is a nonlinear relationship between x and y which should be examined. Figure 3 could present a problem since there is no variation in x except one observation which drives all of the explanatory value of the regression. Figure 4 is similar except now there is variation in x and in y. However the relationship between the values is distorted by the presence of a single powerful outlier.
Figures 5-8 are figures I came up with. Figure 5 features a weak linear relationship between x and y which is exaggerated by a single outlier. Figure 6 is a negative log. Figure 7 is a example of heteroskedasticity. Figure 8 is an example of x taking only one of two values.
Anscome emphasizes that the funkiness of the data does not necessarily mean the inference is not valid. That said, ideally removing a single point of data should not significantly change inference. Yet, researchers should know what their data looks like.
As for figure 7, generally we do not expect heroskedastic errors to present inference bias. Rather they suggest that using heteroskedasticity robust or White-Huber standard errors might improve the efficiency of our estimates (generally speaking).
Sample Data - Using Negative Slope Parameters
Sample data is drawn from the same parameters except that now the slope is negative.
Table 2:
Call: lm(formula = y ~ x, data = xy1)
Residuals:
Min 1Q Median 3Q Max
-2.3862 -0.6586 -0.2338 0.5721 3.6159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.00000 0.51854 5.785 5.32e-07 ***
x -0.50000 0.05413 -9.238 3.18e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.257 on 48 degrees of freedom
Multiple R-squared: 0.64, Adjusted R-squared: 0.6325
F-statistic: 85.33 on 1 and 48 DF, p-value: 3.18e-12
We can see that changing the slope to negative does not change any of the other statistics.
|
Figure 2: Same as figure 1 except B1 = -0.5 |
Summary
Graph your data! If not presenting graphs in your final analysis at least graph it in the exploration phase. Ideally, presenters of data and analysis have some mastery of tools of data exploration and interaction which can presented with data (such as interactive data interfaces Shiny or Tableau).
Such supplementary data found in graphs will likely not be the basis of whether the arguments you are making through statistics are valid, but they will add credibility.
CODE
Find my code for generating exact linear relationships between XY regardless of the dependency of the errors
U and X (u|x).