Friday, February 6, 2015

Calculating the likelihood of item overlap with Base R Assessment

I have recently received an email from someone who had taken my Base R Assessment. In this email, the test taker reported that a large portion of the test items taken were duplicates (around 50%) when he took the test the second time.

I began wondering what the likelihood of this occurring would be.

The test selects between 10-20 items each time the test is taken.

These items are selected randomly.

There is around 80 items in total available.

First let us do some back of the envelope calculations.

1. Lets say the average number of test items is 15. Thus the average quantity of the test items taken is 18.75% for the first test.
2. If the same person were to take the test again then we would expect around 18.75% * 15 = 2.8 items to be duplicates.

We might want to know instead what the change of say 50% of the items administered in the second test being duplicates.

In order to do this calculation we would need to basically come up with the a list of all of the contamination of items which result in 50% overlap.

For instance, let us say that we have a ten item exam. The items that overlap are 1:5 and the items that do not are 6:10.

The probability of this outcome occurring is {first the overlap}*{those that do not overlap} : 1/80 * 1/79 * 1/78 * 1/77 * 1/76 * 74/75 * 73/74 * 72/73 * 71/72 * 70/71

Which results in a really small number. Then we would calculate the probability of the first 4 matching and last items matching: 1/80 * 1/79 * 1/78 * 1/77 * 75/76 * 74/75 * 73/74 * 72/73 * 71/72 * 1/71

Next the first three and last two and so on. You keep doing this with every possible combination. This process becomes rather tedious and time consuming pretty quickly.

However, if you are willing to undergo a little approximation bias then a much easier process is to put together a little simulation.

Let us simulate instead the process x number of times, say 100,000.

Nsim    <- 10^5
overlap <- rep(NA,Nsim)
testMin <- 10
testAdd <- 10
 
for (i in 1:Nsim) {
  testL1 <- testMin + sample(0:testAdd, 1)
  testL2 <- testMin + sample(0:testAdd, 1)
 
  first  <- sample(80, testL1)
  second <- sample(80, testL2)
 
  overlap[i] <- mean(second %in% first)
}
 
# Generate a density curve of overlap. The mode is around .2 which is just around 
# where we expected the average to be.
plot(density(overlap)) 
 
  
mean(overlap)
# [1] 0.1877303
 
# Now we can use a built in function 'ecdf' (empirical cdf) to process our 
# frequencies.
OverCDF <- ecdf(overlap)
 
# The ecdf function is pretty great! It can turn observations into probability data.
 
# We can plot our cdf curve. It is a step function because there is only a finite 
# number of possible overlap percentages possible.
plot(OverCDF, verticals = TRUE, do.points = FALSE) 
 
  
# We can also use it to calculate the likelihood of a certain level of overlap or 
# less.
 
# Say what is the likelihood of 25% or less of the items being repeated?
OverCDF(.25)
# [1] 0.76224 This is pretty high value.
 
# What about 18.75% or less of the items being repeated?
OverCDF(.1875)
# [1] 0.53936] Which is pretty close to
 
# Now we can ask the question, what is the likelihood that the user had 50% or 
# more overlap between exams.
1-OverCDF(.5)
 
#[1] 0.00212 This is a really small number and looking at it alone we might think 
# this person must have been exagerating.
 
# However, a lot of people have taken the exam (around 1,250 independent tries).
 
# Let's say each person attempts twice which gives us 625 attempts. What is the 
# probability that during one of these events someone recieved 50% or more of the 
# same items?
1-(OverCDF(.5))^650
 
# [1] 0.7482862 Which gives us a surprisingly large number.
 
# However, saying everybody took the test twice is pretty unlikely. Let's instead 
# say 100 people took the test twice.
1-(OverCDF(.5))^100
 
# [1] 0.1912173 Which is a very respectable odds. These numbers indicate that if 
# 100 people took the test twice there is a 20% chance that one person would end 
# up seeing on the second test 50% of the same items.
 
# Conversely we might want to ask the likelihood that someone would not see any 
# of the same items.
OverCDF(0)
 
# Giving about 5%.
Created by Pretty R at inside-R.org

Tuesday, February 3, 2015

Base R Assessment!

Test your skills with this R-powered R assessment of base R knowledge!

Built using the R powered adaptive testing platform Concerto, this assessment provides a short but powerful tool at evaluating your base R understanding relative to that of your peers.

Currently the assessment has over seventy items (questions) in the pool while each individual takes less than twenty of these selected randomly. So each test is unique.

Those who score well enough will be given the chance to contribute their own items to challenge other users.
 
http://concerto4.e-psychometrics.com/?wid=13&tid=14
http://concerto4.e-psychometrics.com/?wid=13&tid=14

Wednesday, November 12, 2014

Convergence of a Series

Let us explore using simulation some of the concepts of basic asymptotic theory as presented in Wooldridge 2012, Chapter 3.

Definition: A sequence of nonrandom numbers {a_N:N=1,2,...} converges to a if for all epsilon>0 there exists N_epsilon such that N>N_epsilon, then
$$|a_N - a|<\epsilon$$

Paraphrase: A sequence converges on point a, if you can choose any positive number (any epsilon greater than zero) and you will find some point in the series after which the difference between the series and the point it converges to a will always be smaller than epsilon.

Let us try to make this more concrete.

Imagine the series:

  $$f(N)=N^2$$
 
Does this series converge on a value a?

Let us pick an epsilon, say 100.
## R ---------------------------
f = function(x) x^2
N = 1:30
plot(N, f(N), type='l', col='lightblue', lwd=4)
## -----------------------------
We can see that there is no a which could exist for which our series seems to converge within the positive bound epsilon=100.

This concergence might have been obvious. Let us try a different series.
  $$g(N)=10x^{-99/100}sin(x)+\pi$$
## R ---------------------------
g = function(x) 10*x^(-99/100) * sin(x) + pi
N = 1:100
plot(N, g(N), type='l', lwd=2)
## -----------------------------

The series g demonstrates a common convergence pattern. That is you can alway pick a epsilon after which the difference between N and the convergence point is less then epsilon. In this case the series converges on pi. Let us play around with a few epsilons to see if our hypothesis that the series converges seems to hold true.

## R ---------------------------
e1 = 5
abline(h=pi+e1, lwd=2, col='darkred')
abline(h=pi-e1, lwd=2, col='darkred')
## -----------------------------
We can see that at N<10 there are some points which are not bounded by |N-pi|<e1. However by the time we get to N>20, all points are easily bound (at least those we have plotted).

How about a harder epsilon?
## R ---------------------------
e2 = .5
abline(h=pi+e2, lwd=2, col='darkblue')
abline(h=pi-e2, lwd=2, col='darkblue')
## -----------------------------

This epsilon is outside our bounds at 20 but by the time we have reached 40 the series does not reach outside our bounds.

## R ---------------------------
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
## -----------------------------
We can see that with e3=.1 there is no point on the graph we have plotted which is bounded by our epsilon. However plotting the graph at a longer series time frame will give us a number N after which g is bounded.

## R ---------------------------
N = 100:200
plot(N, g(N), type='l', lwd=2)
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
N= 1:30
plot(N, f(N), type='l', col='paste(readLines(paste0('http://web.archive.org/web/',
              fdate, '/http://www.npr.org/'), warn="F"), collapse='');lightblue', lwd=4)
## -----------------------------

We can see that the graph clearly shows that at our small value of epsilon (.1) we are bounded around pi as early as N=110. For all other even smaller epsilons there will always be an N for which
this is true (even if our computers do not have sufficient accuracy to show this to be the case).

However, since we are working with an infinite series, this also means that there is an infinite area of the graph for which we can never plot. We therefore cannot use graphs to prove that series converge. In theory a series can always break the bounds just beyond the area of inspection. The switch function
$$f(N) = 1[N>100]$$
is an extreme example in which the series seems to converge at 0 up to N<100 but at N=100 or greater the series now converges on 1.

Other series converge without oscillating:
$$f(N)=a+b/N$$

Some more challenging series which also converge include:
$$f(N)=bN/exp(N)$$
$$f(N)=N^{1/N}$$

Monday, November 3, 2014

Make your own hotly criticised circle graph!!!

Make your own hotly criticized circle graph! The recent critical post on R-bloggers has got me wanting to make my own bubble graphs despite the popular criticism of the graph. After some attempts I have produced something that looks similar but needs some more work.

I would like to say briefly that R-blogger post by while interesting and reasonable seemed to miss the point. The graphic was not supposed to convey massive amounts of information but simply to make obvious the scope and degree of how the current Ebola epidemic compares with other epidemics.

There are many good reasons to critique this graph from an epidemiological perspective (forthcoming), yet from a graphing perspective I think  this graph format is extremely efficient at conveying memorable information while the graphs assembled by PirateGrunt were more detailed they were also entirely forgettable unfortunately.

Here are a couple of examples. The first is with meaningless simulated data demonstrating how large differences in scale may appear. The second is actually using HIV prevalence data to give a graphical representation of how such rates have changed over the years. You can fine the code as well as the data on github.

Table 1: Demonstrates how scale can be communicated by size of circle


Table 2: Demonstrates the relative stability of AIDS in terms of people ill. Top number is  year while lower number is millions of cases.

Monday, October 27, 2014

Ebola in Liberia could be different than Ebola in New Jersey

http://www.pixton.com/uk/comic/8wlmhhc2

The Ebola outbreak in Western Africa has initiated panic throughout the world. Thirty-seven countries so far have implemented policies to restrict the international spread of Ebola. In the United States, governor Chris Christie has initiated additional travel restrictions implementing a mandatory quarantine of health workers returning from Liberia, Sierra Leone, and Guinea even when no symptoms are present.

Should Chris Christie be concerned about Ebola in New Jersey? In some ways New Jersey is not so different from Liberia which has already suffered 4,665 deaths according to the CDC.

New Jersey has a population of around 8.9 million people while Liberia has a population a little less than half that of 4.1 million. Both countries were established by foreigners. New Jersey was is first established by Dutch people in America while Liberia was established by Americans in Africa. Both states gained their current affiliation within 60 years ago with New Jersey entering the Union in 1787 and Liberia gaining independence in 1847. The official language of both New Jersey and Liberia is English and both governments use the dollar though one is the US Dollar and the other is the Liberian Dollar. In both states they drive on the right side of the road and the largest religious affiliation in both regions is Christianity.

I know, from this description thus far most people would find it very difficult to tell if they were in Liberia or New Jersey. Under this reasoning it seems very important for New Jersey to implement stringent rules to keep out any chance of Ebola entering its borders. However, there are some minor differences between New Jersey and Liberia that may bear mentioning.

Overall the United States is listed top 4 on the United Nation's Human Development Index while Liberia is ranked the 5th lowest in 2011 out 187 countries. Within the United States New Jersey is ranked the third most developed states. But what does this really mean?

The per capita earnings in Liberia was reported at 436 USD while in New Jersey the per capita was 54,699 USD. Thus in a year a New Jersey resident could be expected to earn more than 125 Liberians. But does this somewhat noticeable difference in earnings translate into differences in medical services?

According to the US census there was about 311 doctors per 100,000 residents of New Jersey in 2006 translating to about 27,680 medical doctors. Liberia on the other hand reported having 51 doctors in the entire country in 2006. If doctors are distributed evenly throughout both states then you are likely to find one medical doctor within every 0.31 square miles in New Jersey while in Liberia you are likely to find only one medical doctor every 811 square miles.

New Jersey residents can expect to live 80.3 years on average and struggle with different health related issues than those faced by Liberians who can only expect to live 57.4 years. In New Jersey where over-consumption presents a major concern with 60.7% of the population is overweight or obese and the chief causes of death is related to over-consumption including heart disease and diabetes. In contrast 38.5% of Liberians suffer from malnutrition with deaths caused from easily treatable diseases such as malaria, pneumonia, and diarrhea which are collectively responsible for 50% of the deaths in the country.

Overall, New Jersey is vastly more wealthy in terms of both income as well as medial expertise, New Jersey is vastly better prepared for Ebola than Liberia. Within Africa there have been over twenty outbreaks of Ebola, the majority of which have been rapidly contained despite most countries not being significantly better off than Liberia. Ebola is a reasonably well understood disease which can be rapidly controlled when appropriate steps are taken using infrastructure much worse off than that faced in New Jersey.

Despite the numerous similarities between New Jersey and Liberia, we should expect any outbreak of Ebola in New Jersey to be rapidly contained. It is unlikely that this outbreak of Ebola in Liberia will result in any significant outbreak in New Jersey or anywhere in the United States in which even our poorest areas are much better equipped than any of the countries now suffering from the outbreak.

This post thus far has been meant as a jab at the hysteria and political maneuvering that has surrounded Ebola. However, the fear of this disease is entirely appropriate if not well placed. This disease with its high mortality rate (~70%) and rapid transmission within these impoverished nations (Sierra Leone, Liberia, and Guinea) has the potential if not stopped to cause as much loss of life and suffering as the worse currently communicable diseases which plague humanity such as malaria (627,000 annual deaths) or tuberculosis (1,460,000 annual deaths).

I believe it is still possible to turn the tide of Ebola around if sufficient international aid is brought to bear in these afflicted nations. Becoming distracted with enacting useless policies which harm the ability of health care workers to travel between wealthy developed nations and West African nations not only misses the point, but actively undermines the ability of philanthropic health care workers to control this disease.

Thursday, October 9, 2014

Waterfall and 3D plotting exploration

Taking the very cool 'waterfall graph' code posted by Robert Grant I have added some features (resistance to distributions with sparse data at some areas) as well as the ability to heat map the bivariate distribution based on a third variable z. Find the code on github.

Overall I find the graphs produced from this code to be beautiful and fascinating though I am not sure if I would really use them as a form of data exploration. In addition, I am not sure if I would expect anybody to be able to understand what I am communicating. But let us first see some graphs associated with different distributions before we jump to any conclusions.
Figure 1: Plotting data points.
The primary motivation for a density graph like this is Figure 1. We can see there is something going on between variables x and y but it is really hard to be able to see what it is. I could make the dots smaller(currently cex=.2) or more transparent (currently=10%) but there will always be the problem that I have 500,000 points and plotting them ultimately will result in some loss of information either in the core where the color becomes solid or near the tails where the color will blend into the background.

Figure 2: A standard 3D density graph created with the MASS package.
Of course there is no way of plotting this much information without some kind of a loss of information. Plotting a 3D graph of densities is one way of reducing information to a more manageable level. The MASS package provides some tools to be able to do this (Figure 2: Reference).

Figure 3: For different ways of slicing the same data.

Let us keep these previous graphs in mind as we explore Robert Grants waterfall graph. The bivariate distribution x and y plotted on the previous graphs can be seen in Figure 3 mapped from four different perspectives (the bottom two of the perspectives are reverse order so the bottom left hand panel is not the same). One immediately notices that while there are some similarities there are also a lot of differences.

Visibly the waterfall graph seems to have more information with many more lines as some capitalization on transparencies. The four 3D renderings of the data is much easier to identify as the same information from different angles. However, the waterfall graph seems to be less of a rotation or transposition of the same object. This puzzled me for a while until I realized that the waterfall graph is not communicating the same information at the 3D graph in its current form. Notice the top left graph. Notice how all of the peaks seem to be equivalent in the waterfall graph yet very pointed in the 3D rendering.

Figure 4: Distance between slices is proportional to number of observations.
This is because a new density curve is mapped for each level of the slice. However, slices are based on equal length cuts from the y range (or the x). Thus some cuts such as those near the center have a lot of observations while those near the ends have very few observations information. In order to adjust for this divergence of information I have included as an option that the distance between the density curves be proportional to the number of observations within each slice (Figure 4). These Figures do not have quite the same feel as the last set probably because they do not look so much like mountains however these figures contain much more information with those density curves near the ends being compressed and those near the center being stretched apart communicating that these extreme values are rare.
Figure 5: Transparency is proportional to sparcity
Figure 6: Height is proportional to number of observations.
I also wrote an option to allow the sparsely populated slices to be more transparent (Figure 5). In some ways this is a more intuitive graph. Conceptually you can think of the opacity of each slide as being filled in a way by the number of observations. Finally I wrote a different option to have the figures height vary by population (Figure 6). Not surprisingly this ended up producing a graph very similar to Figure 2. That was surprising is noting how completely different this rendering of the data appeared than the other graphs.

Now that we have a few different ways of communicating the same information, let us see which set of graphs seems to get us to a place of understanding the relationship between our two variables of interest. First let us note how I generated the data (y~N(0,1) and x~3*N(0,1)+y^2). Thus x and y are uncorrelated but still dependent. All of the figures communicate some of this information. I might personally prefer Figure 1 because the figure sufficiently communicates a significant representation of the data. I think adding the 3d graphs does not add significantly to Figure 1. But this is because everything peaks smoothly in the dark area. If however there were say a dip in density within the dark region then Figure 1 would not be able to warn us of that dip unless it was severe.

Comparing Figure 1, Figure 2, and Figure 3, 4, 5, 6 I am not sure if I would favor the waterfall graphs (except 6 which hardly counts). While beautiful they do not communicate to me clearly what they are representing. However, perhaps this is just an issue of developing an internal intuition of what is going on. To that end, let us explore some less complex data relationships.
Figure 7: No relationship, positive correlation, negative correlation, linear dependency.


Figure 7 (TL, TR, BR, RL) shows what no=correlation (x=N(0,1),y=N(0,1)), positive (x=N(0,1)+.5y), negative (x=N(0,1)-.5y), and perfect coliniearity (x=y). Does this help with out intuition? Maybe a little bit.
Figure 8: It is possible to include an additional variable z which is used to select color from multiple RGB specified colors.
As an additional option, I coded into the function the option of taking a third variable which acts as a color variable. In order for this to work properly you must specify matrix with two or more rows each with 3 or 4 rgb value colors. As the third variable z varies from low to high it will automatically color the slice appropriately. I have set z as a function or x and y with noise (Figure 8). Figure 8 varriest between three different colors (yellow, teal, darker blue). The top two graphs are with x,y with z as color while the bottom two are instead with either z and x with y as color or z and y with x as color. Because this is made up data I do not get much out of it. However, I could imagine someone being able to find meaning in these graphs.

Overall, I would have to say that I am unconvinced that the waterfall graph is an effective substitute for a 3D graph. However, there is no reason to believe that this should be the only criteria for defining such a graph! So far we have been acting as if both x and y were random variables.

But often one of the variables in not random, or we are comfortable acting as if it not random when evaluating the other variable. In addition, we might not care how sparsely populated our slices are, we are mostly concerned with how our distributions change over time. Take income information and age. We might not care how many x year old people their are. But we may care at each age category how the distribution of income lays.
Figure 9: Loge wages against age. Top left youngest is closest, top right oldest is closest, bottom left age against log wages, bottom right is same as top left except height is proportional to observations.
Figure 9 shows us exactly this information. Using IPUMS data (citation below) for 2005 just looking at age and log wage income excluding zeros and non-reported values we see the distribution of wage income for different ages with the youngest in fron and oldest in back on the top left graph. On the top right age order is reversed. We can see that as people get older wages tend to increase and tighten into middle age and then plateau before falling a bit and widenning once again in older life. The bottom left graph is the same information except now wage is the slice and age is the density. It looks like among older folks their wages tend to be higher while younger ones seem to heavily represent the lowest wage earning categories. The final panel is showing us the same information as the first panel except that now height is proportional to density within the slice. We can see that the high representation of middle-aged baby boomers dominates this graph and makes other comparisons difficult.

From this real world example, I therefore think that the waterfall slice graphing framework is very useful. it is not a replacement for the 3D graph but rather an alternative representation of a different feature of the data. If you would like to find the code to create the graphs used in this article please check out my github repo. And if you find this post helpful, please leave me some feedback!

IPUMS Citation:

Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Schroeder, and Matthew Sobek. Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database]. Minneapolis, MN: Minnesota Population Center [producer and distributor], 2010.