Monday, October 27, 2014

Ebola in Liberia could be different than Ebola in New Jersey

The Ebola outbreak in Western Africa has initiated panic throughout the world. Thirty-seven countries so far have implemented policies to restrict the international spread of Ebola. In the United States, governor Chris Christie has initiated additional travel restrictions implementing a mandatory quarantine of health workers returning from Liberia, Sierra Leone, and Guinea even when no symptoms are present.

Should Chris Christie be concerned about Ebola in New Jersey? In some ways New Jersey is not so different from Liberia which has already suffered 4,665 deaths according to the CDC.

New Jersey has a population of around 8.9 million people while Liberia has a population a little less than half that of 4.1 million. Both countries were established by foreigners. New Jersey was is first established by Dutch people in America while Liberia was established by Americans in Africa. Both states gained their current affiliation within 60 years ago with New Jersey entering the Union in 1787 and Liberia gaining independence in 1847. The official language of both New Jersey and Liberia is English and both governments use the dollar though one is the US Dollar and the other is the Liberian Dollar. In both states they drive on the right side of the road and the largest religious affiliation in both regions is Christianity.

I know, from this description thus far most people would find it very difficult to tell if they were in Liberia or New Jersey. Under this reasoning it seems very important for New Jersey to implement stringent rules to keep out any chance of Ebola entering its borders. However, there are some minor differences between New Jersey and Liberia that may bear mentioning.

Overall the United States is listed top 4 on the United Nation's Human Development Index while Liberia is ranked the 5th lowest in 2011 out 187 countries. Within the United States New Jersey is ranked the third most developed states. But what does this really mean?

The per capita earnings in Liberia was reported at 436 USD while in New Jersey the per capita was 54,699 USD. Thus in a year a New Jersey resident could be expected to earn more than 125 Liberians. But does this somewhat noticeable difference in earnings translate into differences in medical services?

According to the US census there was about 311 doctors per 100,000 residents of New Jersey in 2006 translating to about 27,680 medical doctors. Liberia on the other hand reported having 51 doctors in the entire country in 2006. If doctors are distributed evenly throughout both states then you are likely to find one medical doctor within every 0.31 square miles in New Jersey while in Liberia you are likely to find only one medical doctor every 811 square miles.

New Jersey residents can expect to live 80.3 years on average and struggle with different health related issues than those faced by Liberians who can only expect to live 57.4 years. In New Jersey where over-consumption presents a major concern with 60.7% of the population is overweight or obese and the chief causes of death is related to over-consumption including heart disease and diabetes. In contrast 38.5% of Liberians suffer from malnutrition with deaths caused from easily treatable diseases such as malaria, pneumonia, and diarrhea which are collectively responsible for 50% of the deaths in the country.

Overall, New Jersey is vastly more wealthy in terms of both income as well as medial expertise, New Jersey is vastly better prepared for Ebola than Liberia. Within Africa there have been over twenty outbreaks of Ebola, the majority of which have been rapidly contained despite most countries not being significantly better off than Liberia. Ebola is a reasonably well understood disease which can be rapidly controlled when appropriate steps are taken using infrastructure much worse off than that faced in New Jersey.

Despite the numerous similarities between New Jersey and Liberia, we should expect any outbreak of Ebola in New Jersey to be rapidly contained. It is unlikely that this outbreak of Ebola in Liberia will result in any significant outbreak in New Jersey or anywhere in the United States in which even our poorest areas are much better equipped than any of the countries now suffering from the outbreak.

This post thus far has been meant as a jab at the hysteria and political maneuvering that has surrounded Ebola. However, the fear of this disease is entirely appropriate if not well placed. This disease with its high mortality rate (~70%) and rapid transmission within these impoverished nations (Sierra Leone, Liberia, and Guinea) has the potential if not stopped to cause as much loss of life and suffering as the worse currently communicable diseases which plague humanity such as malaria (627,000 annual deaths) or tuberculosis (1,460,000 annual deaths).

I believe it is still possible to turn the tide of Ebola around if sufficient international aid is brought to bear in these afflicted nations. Becoming distracted with enacting useless policies which harm the ability of health care workers to travel between wealthy developed nations and West African nations not only misses the point, but actively undermines the ability of philanthropic health care workers to control this disease.

Thursday, October 9, 2014

Waterfall and 3D plotting exploration

Taking the very cool 'waterfall graph' code posted by Robert Grant I have added some features (resistance to distributions with sparse data at some areas) as well as the ability to heat map the bivariate distribution based on a third variable z. Find the code on github.

Overall I find the graphs produced from this code to be beautiful and fascinating though I am not sure if I would really use them as a form of data exploration. In addition, I am not sure if I would expect anybody to be able to understand what I am communicating. But let us first see some graphs associated with different distributions before we jump to any conclusions.
Figure 1: Plotting data points.
The primary motivation for a density graph like this is Figure 1. We can see there is something going on between variables x and y but it is really hard to be able to see what it is. I could make the dots smaller(currently cex=.2) or more transparent (currently=10%) but there will always be the problem that I have 500,000 points and plotting them ultimately will result in some loss of information either in the core where the color becomes solid or near the tails where the color will blend into the background.

Figure 2: A standard 3D density graph created with the MASS package.
Of course there is no way of plotting this much information without some kind of a loss of information. Plotting a 3D graph of densities is one way of reducing information to a more manageable level. The MASS package provides some tools to be able to do this (Figure 2: Reference).

Figure 3: For different ways of slicing the same data.

Let us keep these previous graphs in mind as we explore Robert Grants waterfall graph. The bivariate distribution x and y plotted on the previous graphs can be seen in Figure 3 mapped from four different perspectives (the bottom two of the perspectives are reverse order so the bottom left hand panel is not the same). One immediately notices that while there are some similarities there are also a lot of differences.

Visibly the waterfall graph seems to have more information with many more lines as some capitalization on transparencies. The four 3D renderings of the data is much easier to identify as the same information from different angles. However, the waterfall graph seems to be less of a rotation or transposition of the same object. This puzzled me for a while until I realized that the waterfall graph is not communicating the same information at the 3D graph in its current form. Notice the top left graph. Notice how all of the peaks seem to be equivalent in the waterfall graph yet very pointed in the 3D rendering.

Figure 4: Distance between slices is proportional to number of observations.
This is because a new density curve is mapped for each level of the slice. However, slices are based on equal length cuts from the y range (or the x). Thus some cuts such as those near the center have a lot of observations while those near the ends have very few observations information. In order to adjust for this divergence of information I have included as an option that the distance between the density curves be proportional to the number of observations within each slice (Figure 4). These Figures do not have quite the same feel as the last set probably because they do not look so much like mountains however these figures contain much more information with those density curves near the ends being compressed and those near the center being stretched apart communicating that these extreme values are rare.
Figure 5: Transparency is proportional to sparcity
Figure 6: Height is proportional to number of observations.
I also wrote an option to allow the sparsely populated slices to be more transparent (Figure 5). In some ways this is a more intuitive graph. Conceptually you can think of the opacity of each slide as being filled in a way by the number of observations. Finally I wrote a different option to have the figures height vary by population (Figure 6). Not surprisingly this ended up producing a graph very similar to Figure 2. That was surprising is noting how completely different this rendering of the data appeared than the other graphs.

Now that we have a few different ways of communicating the same information, let us see which set of graphs seems to get us to a place of understanding the relationship between our two variables of interest. First let us note how I generated the data (y~N(0,1) and x~3*N(0,1)+y^2). Thus x and y are uncorrelated but still dependent. All of the figures communicate some of this information. I might personally prefer Figure 1 because the figure sufficiently communicates a significant representation of the data. I think adding the 3d graphs does not add significantly to Figure 1. But this is because everything peaks smoothly in the dark area. If however there were say a dip in density within the dark region then Figure 1 would not be able to warn us of that dip unless it was severe.

Comparing Figure 1, Figure 2, and Figure 3, 4, 5, 6 I am not sure if I would favor the waterfall graphs (except 6 which hardly counts). While beautiful they do not communicate to me clearly what they are representing. However, perhaps this is just an issue of developing an internal intuition of what is going on. To that end, let us explore some less complex data relationships.
Figure 7: No relationship, positive correlation, negative correlation, linear dependency.

Figure 7 (TL, TR, BR, RL) shows what no=correlation (x=N(0,1),y=N(0,1)), positive (x=N(0,1)+.5y), negative (x=N(0,1)-.5y), and perfect coliniearity (x=y). Does this help with out intuition? Maybe a little bit.
Figure 8: It is possible to include an additional variable z which is used to select color from multiple RGB specified colors.
As an additional option, I coded into the function the option of taking a third variable which acts as a color variable. In order for this to work properly you must specify matrix with two or more rows each with 3 or 4 rgb value colors. As the third variable z varies from low to high it will automatically color the slice appropriately. I have set z as a function or x and y with noise (Figure 8). Figure 8 varriest between three different colors (yellow, teal, darker blue). The top two graphs are with x,y with z as color while the bottom two are instead with either z and x with y as color or z and y with x as color. Because this is made up data I do not get much out of it. However, I could imagine someone being able to find meaning in these graphs.

Overall, I would have to say that I am unconvinced that the waterfall graph is an effective substitute for a 3D graph. However, there is no reason to believe that this should be the only criteria for defining such a graph! So far we have been acting as if both x and y were random variables.

But often one of the variables in not random, or we are comfortable acting as if it not random when evaluating the other variable. In addition, we might not care how sparsely populated our slices are, we are mostly concerned with how our distributions change over time. Take income information and age. We might not care how many x year old people their are. But we may care at each age category how the distribution of income lays.
Figure 9: Loge wages against age. Top left youngest is closest, top right oldest is closest, bottom left age against log wages, bottom right is same as top left except height is proportional to observations.
Figure 9 shows us exactly this information. Using IPUMS data (citation below) for 2005 just looking at age and log wage income excluding zeros and non-reported values we see the distribution of wage income for different ages with the youngest in fron and oldest in back on the top left graph. On the top right age order is reversed. We can see that as people get older wages tend to increase and tighten into middle age and then plateau before falling a bit and widenning once again in older life. The bottom left graph is the same information except now wage is the slice and age is the density. It looks like among older folks their wages tend to be higher while younger ones seem to heavily represent the lowest wage earning categories. The final panel is showing us the same information as the first panel except that now height is proportional to density within the slice. We can see that the high representation of middle-aged baby boomers dominates this graph and makes other comparisons difficult.

From this real world example, I therefore think that the waterfall slice graphing framework is very useful. it is not a replacement for the 3D graph but rather an alternative representation of a different feature of the data. If you would like to find the code to create the graphs used in this article please check out my github repo. And if you find this post helpful, please leave me some feedback!

IPUMS Citation:

Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Schroeder, and Matthew Sobek. Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database]. Minneapolis, MN: Minnesota Population Center [producer and distributor], 2010.

Wednesday, October 8, 2014

Julia style string literal interpolation in R

I feel like a sculptor who has been using the same metal tools for the last four years and happened to have looked at my comrades and found them sporting new, sleek electric tools. Suddenly all of the hard work put into maintaining and adapting my metal tools ends up looking like duck tape and bubble gum patches.

I hate to say it but I feel that I have become somewhat infatuated with Julia. And infatuation is the right word. I have not yet committed the time to fully immerse myself in the language, yet everything I know about it makes me want to learn more. The language is well known for its mind-blowingly speed accomplished through just-in-time compiling. It also has many features which enhance the efficiency and readability of its code (see previous post, note the documentation has greatly improved since posting).

However, though I very much want to, I cannot entirely switch my coding needs from R into Julia. This is primarily due to my ongoing usage of packages such as RStudio's "Shiny" and the University of Cambridge's server side software for building adaptive tests, "Concerto". And so with regret I will resign my Julia coding to probably a minor portion of my programming needs.

That does not mean however that I can't make some small changes to make R work more like Julia. To this end I have programmed a small function p which will replace string literals identified as "Hello #(name), how are you?" with their values being evaluated. If there are nested parenthesizes then it is necessary to close the literal with ")#", for example "c=#(b^(1+a))#".

# Julia like text concaction function.
p <- function(..., sep="", esc="#") { 
  # Change escape characters by specifying esc.
  # Break the input values into different strings cut at '#('
  x <- paste(..., sep=sep)
  x <- unlist(strsplit(x, paste0(esc,"("), fixed = TRUE))
  # The first element is never evaluated.
  out <- x[1]
  # Check if x has been split.
  if (length(x)>1) for (i in 2:length(x)) {
    y <- unlist(strsplit(x[i], paste0(")",esc), fixed = TRUE))
    if (x[i]==y[1])
      y <- unlist(regmatches(x[i], regexpr(")", x[i]), 
                             invert = TRUE))
    out <- paste0(out, eval(parse(text=y[1])), y[-1])
# Let's see it in action
p(sep=" ", "Hello #(name).",
  "My record indicates you are #(height) inches tall and weigh #(weight) pounds.",
  "Your body mass index is #(round(703*weight/height^2,1))#") 
# [1] "Hello Bob. My record indicates you are 72 inches tall and weigh 230 pounds. 
# Your body mass index is 31.2" 
# The other nice thing about the p function is that it can be used to concat
# strings as a shortcut for paste0.
# [1] "QRSTUV"
Created by Pretty R at

Thank you SO community for your help.

Monday, October 6, 2014

Julia: The "Distributions" Package

This is a follow up to my post from a few days ago exploring random number generation in Julia's base system.  In this post I will explore the 'distributions' package.

You can find the excellent documentation on the "Distributions" package at:

# First let's set the current directory
cd("C:/Dropbox/Econometrics by Simulation/2014-10-October/")

# This post uses the following distributions
using Distributions
using Gadfly

# I have got to say that I love the way Julia handles distributions
# as I discovered through this post.

# The Distributions package gives trenendous power to the user by
# providing a common framework to apply various function.

# For instance let's say you want to draw 10 draws from a Binomial(n=10, p=.25) distribution

rand(Binomial(10, .25), 1, 10)
#  4  3  0  5  1  3  5  2  2  1

# Looks pretty standard right? Well, what if we want the mean?

mean(Binomial(10, .25))
# 2.5

# mode, skewness, kurtosis, median?

a = Binomial(10, .25)
println("mode:", mode(a), " skewness:", skewness(a),
        " kurtosis:", kurtosis(a), " median:", median(a))
# mode:3 skewness:0.3651483716701107 kurtosis:-0.06666666666666667 median:3

# Cool huh?

# Let's see how we can use Gadfly to easily plot some distributions:

# First we generate the plot (assign it to a 'p' for later drawing to disk)
#  In order to plot the different CDFs I will use an 'anonymous' function defined:
#  argument -> f(argument)
p=plot([x -> cdf(Normal(5,5^.5), x),
      x -> cdf(Gamma(3,1), x),
      x -> cdf(Exponential(2), x)]
      , 0, 10)

 # Write the graph to disk
draw(PNG("2014-10-06CDF.png", 24cm, 12cm), p)

g = plot([x -> pdf(Normal(5,2), x),
      x -> pdf(Gamma(3,1), x),
      x -> pdf(Exponential(2), x)]
      , 0, 10)

draw(PNG("2014-10-06PDF.png", 24cm, 12cm), g)

Friday, October 3, 2014

Ebola: Beds, Labs, and Warnings? Can they help? (Shiny App)

A month ago when the WHO was projecting estimates of the effect of current outbreak of Ebola being as deadly as affecting 20,000 people, I ran some elementary modelling and found that these estimates are far too small given the current trend.  The motivation for the post was to raise awareness that situation could get far worse than anybody was talking about at the time. Since then, most of my 'back of the envelope' estimates have ended up being disturbingly close to reports the World Health Organization has been releasing.

Which frankly is extremely scary. Currently I am living in Mozambique in Southern Africa and though Mozambique is slightly more developed than Liberia, I have no reason to believe that things would be any different here than in Western Africa if an outbreak went undetected for a month as it did in Western Africa.

This among other things has made me wonder how inevitable such an outcome is. Should everybody who can leave Africa and find a nice little bunker to hide in until this whole thing passes? Well probably not but only if Ebola can be stopped.

Currently the world seems to be responding to the crisis in the countries affect in three major ways: 1. provide beds, 2. provide laboratory capabilities to diagnose Ebola, and 3. provide advertisements to increase awareness. But how can we know how effective these measures can be against such a seemly unstoppable force?

Time to break out our models!

Unfortunately there is no really easy way to model this. However, modifying the standard epidemiological SIR (susceptible, infected, recovered) model I am able create a model which looks to be functioning the way we would like it to by including some additional parameters. To see details of the model's construction, see the technical appendix.


The primary new parameters of consideration are 'beds' which represent the number of beds available as well as the food and supplies necessary in order to feed people who are residents of these beds. Infected individuals once detected are transferred to quarantine if beds are available. If they are not available then infected individuals remain contagious until they recover or die.

Social adoption
From a paper by Fisman, Khoo, and Tuite I have incorporated the idea of social adaption to the epidemic. This captures the concept that the infection could be naturally controlled to some extent by changes in the behavior of the susceptible population and that of the contagious population.

The Model
It becomes immediately clear that the model is extremely sensitive to just about every parameter included. If the infection rate is too high then everybody gets sick. If the rate is too low then the epidemic is quickly contained. However, for this exercise let us assume we cannot directly control in any way infection rates but we can choose how many beds, how effective we are at detecting new cases, and we have some influence on how people respond over time to Ebola by taking safety precautions such as not touching the sick or dying.

Figure 1:Base Model After 9 Months

Each of these interventions can have a significant effect on the outbreak. These interventions when looked at carefully turn into two different strategies: 1. Quarantine infected by providing beds and provisions and 2. increasing public awareness to reduce probability of spread over time.

Providing Beds
The effect of a significant investment in beds (500 new beds) after seven months can abruptly turn around the spread of Ebola as the contagious population is rapidly shifted from free and dangerous to safely quarantined (assuming an effective mechanism exists for detecting those who are ill).

Figure 2: Base model after seven month beds intervention.
Changing Behavior
I have not including behavior curbing into the model quite as dramatically. Instead I have specified social behavior changes as a cumulative effect over time. In the base model individuals adapt to the disease by being .03% less likely each day to contract the disease. This is not much though it does accumulate significantly over time. After six months of the epidemic individuals would be about 5% less likely to get Ebola when exposed to an individual with Ebola.

If we are able to increase awareness about prevention of contraction of the disease to say .06% increase per day then individuals are about 10% less likely to contract Ebola after six months. Though these numbers are not large the effect can be profound on our model.

Figure 3: Behavioral adaption can dramatically reduce the lifespan of the outbreak.

However, the significant problem with including social adaption in this way is that this is based on accumulated actions over time. If this is the case then Ebola should already but or its way out.

Sensitivity of the Model - the shiny app
As mentioned previously this model is extremely sensitive to parameter choices. It is therefore more of an illustrative tool than actually meant to exactly represent the situation in Western Africa. As a tool we can see under the right circumstances that beds and public information can have a dramatic effect on the spread of Ebola. However, don't take my word for it! Check out the app below and play around with the model yourself.

Technical Appendix
alpha is detection rate.
delta is transition rate to recovery or death.
mu is mortality rate.

State equations:
Change in susceptible population:
$$\dot S = -\frac{\gamma S_R S_t C_t}{S_t C_t}$$

Change in contagious population:
$$\dot C = -\dot S-\min[\alpha C_t , \max(beds-Q_t(1-\delta),0)]-\delta C_t$$

Change in the quarantined population:
$$\dot Q = \min[\alpha C_t , \max(beds-Q_t(1-\delta),0)] - \delta Q_t$$

Change in the recovered population:
$$\dot R = (1-\mu) \delta (Q_t+C_t)$$

Change in the decease population:
$$\dot D = \mu \delta (Q_t+C_t)$$

R Code
The R code used to produce this app can be found on Github. If you prefer running the app from your computer, you can download server.R and ui.R and run the package from your own