Friday, March 13, 2015

Open Data - Stack Exchange

As a fan of Stack Exchange I am very excited about the new exchange being made available. This exchange is called 'Open Data' and allows users to post questions on where to find publicly available data on a topic and other users will respond with one or more answers.

Open Data - Stack Exchange

Friday, February 6, 2015

Calculating the likelihood of item overlap with Base R Assessment

I have recently received an email from someone who had taken my Base R Assessment. In this email, the test taker reported that a large portion of the test items taken were duplicates (around 50%) when he took the test the second time.

I began wondering what the likelihood of this occurring would be.

The test selects between 10-20 items each time the test is taken.

These items are selected randomly.

There is around 80 items in total available.

First let us do some back of the envelope calculations.

1. Lets say the average number of test items is 15. Thus the average quantity of the test items taken is 18.75% for the first test.
2. If the same person were to take the test again then we would expect around 18.75% * 15 = 2.8 items to be duplicates.

We might want to know instead what the change of say 50% of the items administered in the second test being duplicates.

In order to do this calculation we would need to basically come up with the a list of all of the contamination of items which result in 50% overlap.

For instance, let us say that we have a ten item exam. The items that overlap are 1:5 and the items that do not are 6:10.

The probability of this outcome occurring is {first the overlap}*{those that do not overlap} : 1/80 * 1/79 * 1/78 * 1/77 * 1/76 * 74/75 * 73/74 * 72/73 * 71/72 * 70/71

Which results in a really small number. Then we would calculate the probability of the first 4 matching and last items matching: 1/80 * 1/79 * 1/78 * 1/77 * 75/76 * 74/75 * 73/74 * 72/73 * 71/72 * 1/71

Next the first three and last two and so on. You keep doing this with every possible combination. This process becomes rather tedious and time consuming pretty quickly.

However, if you are willing to undergo a little approximation bias then a much easier process is to put together a little simulation.

Let us simulate instead the process x number of times, say 100,000.

Nsim    <- 10^5
overlap <- rep(NA,Nsim)
testMin <- 10
testAdd <- 10
 
for (i in 1:Nsim) {
  testL1 <- testMin + sample(0:testAdd, 1)
  testL2 <- testMin + sample(0:testAdd, 1)
 
  first  <- sample(80, testL1)
  second <- sample(80, testL2)
 
  overlap[i] <- mean(second %in% first)
}
 
# Generate a density curve of overlap. The mode is around .2 which is just around 
# where we expected the average to be.
plot(density(overlap)) 
 
  
mean(overlap)
# [1] 0.1877303
 
# Now we can use a built in function 'ecdf' (empirical cdf) to process our 
# frequencies.
OverCDF <- ecdf(overlap)
 
# The ecdf function is pretty great! It can turn observations into probability data.
 
# We can plot our cdf curve. It is a step function because there is only a finite 
# number of possible overlap percentages possible.
plot(OverCDF, verticals = TRUE, do.points = FALSE) 
 
  
# We can also use it to calculate the likelihood of a certain level of overlap or 
# less.
 
# Say what is the likelihood of 25% or less of the items being repeated?
OverCDF(.25)
# [1] 0.76224 This is pretty high value.
 
# What about 18.75% or less of the items being repeated?
OverCDF(.1875)
# [1] 0.53936] Which is pretty close to
 
# Now we can ask the question, what is the likelihood that the user had 50% or 
# more overlap between exams.
1-OverCDF(.5)
 
#[1] 0.00212 This is a really small number and looking at it alone we might think 
# this person must have been exagerating.
 
# However, a lot of people have taken the exam (around 1,250 independent tries).
 
# Let's say each person attempts twice which gives us 625 attempts. What is the 
# probability that during one of these events someone recieved 50% or more of the 
# same items?
1-(OverCDF(.5))^650
 
# [1] 0.7482862 Which gives us a surprisingly large number.
 
# However, saying everybody took the test twice is pretty unlikely. Let's instead 
# say 100 people took the test twice.
1-(OverCDF(.5))^100
 
# [1] 0.1912173 Which is a very respectable odds. These numbers indicate that if 
# 100 people took the test twice there is a 20% chance that one person would end 
# up seeing on the second test 50% of the same items.
 
# Conversely we might want to ask the likelihood that someone would not see any 
# of the same items.
OverCDF(0)
 
# Giving about 5%.
Created by Pretty R at inside-R.org

Tuesday, February 3, 2015

Base R Assessment!

Test your skills with this R-powered R assessment of base R knowledge!

Built using the R powered adaptive testing platform Concerto, this assessment provides a short but powerful tool at evaluating your base R understanding relative to that of your peers.

Currently the assessment has over seventy items (questions) in the pool while each individual takes less than twenty of these selected randomly. So each test is unique.

Those who score well enough will be given the chance to contribute their own items to challenge other users.
 
http://concerto4.e-psychometrics.com/?wid=13&tid=14
http://concerto4.e-psychometrics.com/?wid=13&tid=14

Wednesday, November 12, 2014

Convergence of a Series

Let us explore using simulation some of the concepts of basic asymptotic theory as presented in Wooldridge 2012, Chapter 3.

Definition: A sequence of nonrandom numbers {a_N:N=1,2,...} converges to a if for all epsilon>0 there exists N_epsilon such that N>N_epsilon, then
$$|a_N - a|<\epsilon$$

Paraphrase: A sequence converges on point a, if you can choose any positive number (any epsilon greater than zero) and you will find some point in the series after which the difference between the series and the point it converges to a will always be smaller than epsilon.

Let us try to make this more concrete.

Imagine the series:

  $$f(N)=N^2$$
 
Does this series converge on a value a?

Let us pick an epsilon, say 100.
## R ---------------------------
f = function(x) x^2
N = 1:30
plot(N, f(N), type='l', col='lightblue', lwd=4)
## -----------------------------
We can see that there is no a which could exist for which our series seems to converge within the positive bound epsilon=100.

This concergence might have been obvious. Let us try a different series.
  $$g(N)=10x^{-99/100}sin(x)+\pi$$
## R ---------------------------
g = function(x) 10*x^(-99/100) * sin(x) + pi
N = 1:100
plot(N, g(N), type='l', lwd=2)
## -----------------------------

The series g demonstrates a common convergence pattern. That is you can alway pick a epsilon after which the difference between N and the convergence point is less then epsilon. In this case the series converges on pi. Let us play around with a few epsilons to see if our hypothesis that the series converges seems to hold true.

## R ---------------------------
e1 = 5
abline(h=pi+e1, lwd=2, col='darkred')
abline(h=pi-e1, lwd=2, col='darkred')
## -----------------------------
We can see that at N<10 there are some points which are not bounded by |N-pi|<e1. However by the time we get to N>20, all points are easily bound (at least those we have plotted).

How about a harder epsilon?
## R ---------------------------
e2 = .5
abline(h=pi+e2, lwd=2, col='darkblue')
abline(h=pi-e2, lwd=2, col='darkblue')
## -----------------------------

This epsilon is outside our bounds at 20 but by the time we have reached 40 the series does not reach outside our bounds.

## R ---------------------------
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
## -----------------------------
We can see that with e3=.1 there is no point on the graph we have plotted which is bounded by our epsilon. However plotting the graph at a longer series time frame will give us a number N after which g is bounded.

## R ---------------------------
N = 100:200
plot(N, g(N), type='l', lwd=2)
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
N= 1:30
plot(N, f(N), type='l', col='paste(readLines(paste0('http://web.archive.org/web/',
              fdate, '/http://www.npr.org/'), warn="F"), collapse='');lightblue', lwd=4)
## -----------------------------

We can see that the graph clearly shows that at our small value of epsilon (.1) we are bounded around pi as early as N=110. For all other even smaller epsilons there will always be an N for which
this is true (even if our computers do not have sufficient accuracy to show this to be the case).

However, since we are working with an infinite series, this also means that there is an infinite area of the graph for which we can never plot. We therefore cannot use graphs to prove that series converge. In theory a series can always break the bounds just beyond the area of inspection. The switch function
$$f(N) = 1[N>100]$$
is an extreme example in which the series seems to converge at 0 up to N<100 but at N=100 or greater the series now converges on 1.

Other series converge without oscillating:
$$f(N)=a+b/N$$

Some more challenging series which also converge include:
$$f(N)=bN/exp(N)$$
$$f(N)=N^{1/N}$$

Monday, November 3, 2014

Make your own hotly criticised circle graph!!!

Make your own hotly criticized circle graph! The recent critical post on R-bloggers has got me wanting to make my own bubble graphs despite the popular criticism of the graph. After some attempts I have produced something that looks similar but needs some more work.

I would like to say briefly that R-blogger post by while interesting and reasonable seemed to miss the point. The graphic was not supposed to convey massive amounts of information but simply to make obvious the scope and degree of how the current Ebola epidemic compares with other epidemics.

There are many good reasons to critique this graph from an epidemiological perspective (forthcoming), yet from a graphing perspective I think  this graph format is extremely efficient at conveying memorable information while the graphs assembled by PirateGrunt were more detailed they were also entirely forgettable unfortunately.

Here are a couple of examples. The first is with meaningless simulated data demonstrating how large differences in scale may appear. The second is actually using HIV prevalence data to give a graphical representation of how such rates have changed over the years. You can fine the code as well as the data on github.

Table 1: Demonstrates how scale can be communicated by size of circle


Table 2: Demonstrates the relative stability of AIDS in terms of people ill. Top number is  year while lower number is millions of cases.

Monday, October 27, 2014

Ebola in Liberia could be different than Ebola in New Jersey

http://www.pixton.com/uk/comic/8wlmhhc2

The Ebola outbreak in Western Africa has initiated panic throughout the world. Thirty-seven countries so far have implemented policies to restrict the international spread of Ebola. In the United States, governor Chris Christie has initiated additional travel restrictions implementing a mandatory quarantine of health workers returning from Liberia, Sierra Leone, and Guinea even when no symptoms are present.

Should Chris Christie be concerned about Ebola in New Jersey? In some ways New Jersey is not so different from Liberia which has already suffered 4,665 deaths according to the CDC.

New Jersey has a population of around 8.9 million people while Liberia has a population a little less than half that of 4.1 million. Both countries were established by foreigners. New Jersey was is first established by Dutch people in America while Liberia was established by Americans in Africa. Both states gained their current affiliation within 60 years ago with New Jersey entering the Union in 1787 and Liberia gaining independence in 1847. The official language of both New Jersey and Liberia is English and both governments use the dollar though one is the US Dollar and the other is the Liberian Dollar. In both states they drive on the right side of the road and the largest religious affiliation in both regions is Christianity.

I know, from this description thus far most people would find it very difficult to tell if they were in Liberia or New Jersey. Under this reasoning it seems very important for New Jersey to implement stringent rules to keep out any chance of Ebola entering its borders. However, there are some minor differences between New Jersey and Liberia that may bear mentioning.

Overall the United States is listed top 4 on the United Nation's Human Development Index while Liberia is ranked the 5th lowest in 2011 out 187 countries. Within the United States New Jersey is ranked the third most developed states. But what does this really mean?

The per capita earnings in Liberia was reported at 436 USD while in New Jersey the per capita was 54,699 USD. Thus in a year a New Jersey resident could be expected to earn more than 125 Liberians. But does this somewhat noticeable difference in earnings translate into differences in medical services?

According to the US census there was about 311 doctors per 100,000 residents of New Jersey in 2006 translating to about 27,680 medical doctors. Liberia on the other hand reported having 51 doctors in the entire country in 2006. If doctors are distributed evenly throughout both states then you are likely to find one medical doctor within every 0.31 square miles in New Jersey while in Liberia you are likely to find only one medical doctor every 811 square miles.

New Jersey residents can expect to live 80.3 years on average and struggle with different health related issues than those faced by Liberians who can only expect to live 57.4 years. In New Jersey where over-consumption presents a major concern with 60.7% of the population is overweight or obese and the chief causes of death is related to over-consumption including heart disease and diabetes. In contrast 38.5% of Liberians suffer from malnutrition with deaths caused from easily treatable diseases such as malaria, pneumonia, and diarrhea which are collectively responsible for 50% of the deaths in the country.

Overall, New Jersey is vastly more wealthy in terms of both income as well as medial expertise, New Jersey is vastly better prepared for Ebola than Liberia. Within Africa there have been over twenty outbreaks of Ebola, the majority of which have been rapidly contained despite most countries not being significantly better off than Liberia. Ebola is a reasonably well understood disease which can be rapidly controlled when appropriate steps are taken using infrastructure much worse off than that faced in New Jersey.

Despite the numerous similarities between New Jersey and Liberia, we should expect any outbreak of Ebola in New Jersey to be rapidly contained. It is unlikely that this outbreak of Ebola in Liberia will result in any significant outbreak in New Jersey or anywhere in the United States in which even our poorest areas are much better equipped than any of the countries now suffering from the outbreak.

This post thus far has been meant as a jab at the hysteria and political maneuvering that has surrounded Ebola. However, the fear of this disease is entirely appropriate if not well placed. This disease with its high mortality rate (~70%) and rapid transmission within these impoverished nations (Sierra Leone, Liberia, and Guinea) has the potential if not stopped to cause as much loss of life and suffering as the worse currently communicable diseases which plague humanity such as malaria (627,000 annual deaths) or tuberculosis (1,460,000 annual deaths).

I believe it is still possible to turn the tide of Ebola around if sufficient international aid is brought to bear in these afflicted nations. Becoming distracted with enacting useless policies which harm the ability of health care workers to travel between wealthy developed nations and West African nations not only misses the point, but actively undermines the ability of philanthropic health care workers to control this disease.