Thursday, October 9, 2014

Waterfall and 3D plotting exploration

Taking the very cool 'waterfall graph' code posted by Robert Grant I have added some features (resistance to distributions with sparse data at some areas) as well as the ability to heat map the bivariate distribution based on a third variable z. Find the code on github.

Overall I find the graphs produced from this code to be beautiful and fascinating though I am not sure if I would really use them as a form of data exploration. In addition, I am not sure if I would expect anybody to be able to understand what I am communicating. But let us first see some graphs associated with different distributions before we jump to any conclusions.
Figure 1: Plotting data points.
The primary motivation for a density graph like this is Figure 1. We can see there is something going on between variables x and y but it is really hard to be able to see what it is. I could make the dots smaller(currently cex=.2) or more transparent (currently=10%) but there will always be the problem that I have 500,000 points and plotting them ultimately will result in some loss of information either in the core where the color becomes solid or near the tails where the color will blend into the background.

Figure 2: A standard 3D density graph created with the MASS package.
Of course there is no way of plotting this much information without some kind of a loss of information. Plotting a 3D graph of densities is one way of reducing information to a more manageable level. The MASS package provides some tools to be able to do this (Figure 2: Reference).

Figure 3: For different ways of slicing the same data.

Let us keep these previous graphs in mind as we explore Robert Grants waterfall graph. The bivariate distribution x and y plotted on the previous graphs can be seen in Figure 3 mapped from four different perspectives (the bottom two of the perspectives are reverse order so the bottom left hand panel is not the same). One immediately notices that while there are some similarities there are also a lot of differences.

Visibly the waterfall graph seems to have more information with many more lines as some capitalization on transparencies. The four 3D renderings of the data is much easier to identify as the same information from different angles. However, the waterfall graph seems to be less of a rotation or transposition of the same object. This puzzled me for a while until I realized that the waterfall graph is not communicating the same information at the 3D graph in its current form. Notice the top left graph. Notice how all of the peaks seem to be equivalent in the waterfall graph yet very pointed in the 3D rendering.

Figure 4: Distance between slices is proportional to number of observations.
This is because a new density curve is mapped for each level of the slice. However, slices are based on equal length cuts from the y range (or the x). Thus some cuts such as those near the center have a lot of observations while those near the ends have very few observations information. In order to adjust for this divergence of information I have included as an option that the distance between the density curves be proportional to the number of observations within each slice (Figure 4). These Figures do not have quite the same feel as the last set probably because they do not look so much like mountains however these figures contain much more information with those density curves near the ends being compressed and those near the center being stretched apart communicating that these extreme values are rare.
Figure 5: Transparency is proportional to sparcity
Figure 6: Height is proportional to number of observations.
I also wrote an option to allow the sparsely populated slices to be more transparent (Figure 5). In some ways this is a more intuitive graph. Conceptually you can think of the opacity of each slide as being filled in a way by the number of observations. Finally I wrote a different option to have the figures height vary by population (Figure 6). Not surprisingly this ended up producing a graph very similar to Figure 2. That was surprising is noting how completely different this rendering of the data appeared than the other graphs.

Now that we have a few different ways of communicating the same information, let us see which set of graphs seems to get us to a place of understanding the relationship between our two variables of interest. First let us note how I generated the data (y~N(0,1) and x~3*N(0,1)+y^2). Thus x and y are uncorrelated but still dependent. All of the figures communicate some of this information. I might personally prefer Figure 1 because the figure sufficiently communicates a significant representation of the data. I think adding the 3d graphs does not add significantly to Figure 1. But this is because everything peaks smoothly in the dark area. If however there were say a dip in density within the dark region then Figure 1 would not be able to warn us of that dip unless it was severe.

Comparing Figure 1, Figure 2, and Figure 3, 4, 5, 6 I am not sure if I would favor the waterfall graphs (except 6 which hardly counts). While beautiful they do not communicate to me clearly what they are representing. However, perhaps this is just an issue of developing an internal intuition of what is going on. To that end, let us explore some less complex data relationships.
Figure 7: No relationship, positive correlation, negative correlation, linear dependency.


Figure 7 (TL, TR, BR, RL) shows what no=correlation (x=N(0,1),y=N(0,1)), positive (x=N(0,1)+.5y), negative (x=N(0,1)-.5y), and perfect coliniearity (x=y). Does this help with out intuition? Maybe a little bit.
Figure 8: It is possible to include an additional variable z which is used to select color from multiple RGB specified colors.
As an additional option, I coded into the function the option of taking a third variable which acts as a color variable. In order for this to work properly you must specify matrix with two or more rows each with 3 or 4 rgb value colors. As the third variable z varies from low to high it will automatically color the slice appropriately. I have set z as a function or x and y with noise (Figure 8). Figure 8 varriest between three different colors (yellow, teal, darker blue). The top two graphs are with x,y with z as color while the bottom two are instead with either z and x with y as color or z and y with x as color. Because this is made up data I do not get much out of it. However, I could imagine someone being able to find meaning in these graphs.

Overall, I would have to say that I am unconvinced that the waterfall graph is an effective substitute for a 3D graph. However, there is no reason to believe that this should be the only criteria for defining such a graph! So far we have been acting as if both x and y were random variables.

But often one of the variables in not random, or we are comfortable acting as if it not random when evaluating the other variable. In addition, we might not care how sparsely populated our slices are, we are mostly concerned with how our distributions change over time. Take income information and age. We might not care how many x year old people their are. But we may care at each age category how the distribution of income lays.
Figure 9: Loge wages against age. Top left youngest is closest, top right oldest is closest, bottom left age against log wages, bottom right is same as top left except height is proportional to observations.
Figure 9 shows us exactly this information. Using IPUMS data (citation below) for 2005 just looking at age and log wage income excluding zeros and non-reported values we see the distribution of wage income for different ages with the youngest in fron and oldest in back on the top left graph. On the top right age order is reversed. We can see that as people get older wages tend to increase and tighten into middle age and then plateau before falling a bit and widenning once again in older life. The bottom left graph is the same information except now wage is the slice and age is the density. It looks like among older folks their wages tend to be higher while younger ones seem to heavily represent the lowest wage earning categories. The final panel is showing us the same information as the first panel except that now height is proportional to density within the slice. We can see that the high representation of middle-aged baby boomers dominates this graph and makes other comparisons difficult.

From this real world example, I therefore think that the waterfall slice graphing framework is very useful. it is not a replacement for the 3D graph but rather an alternative representation of a different feature of the data. If you would like to find the code to create the graphs used in this article please check out my github repo. And if you find this post helpful, please leave me some feedback!

IPUMS Citation:

Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Schroeder, and Matthew Sobek. Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database]. Minneapolis, MN: Minnesota Population Center [producer and distributor], 2010.

1 comment:

  1. Hey, these graphs are great! Does anybody have the code for the waterfall graph (fig. 8)?

    ReplyDelete