Wednesday, November 12, 2014

Convergence of a Series

Let us explore using simulation some of the concepts of basic asymptotic theory as presented in Wooldridge 2012, Chapter 3.

Definition: A sequence of nonrandom numbers {a_N:N=1,2,...} converges to a if for all epsilon>0 there exists N_epsilon such that N>N_epsilon, then
$$|a_N - a|<\epsilon$$

Paraphrase: A sequence converges on point a, if you can choose any positive number (any epsilon greater than zero) and you will find some point in the series after which the difference between the series and the point it converges to a will always be smaller than epsilon.

Let us try to make this more concrete.

Imagine the series:

  $$f(N)=N^2$$
 
Does this series converge on a value a?

Let us pick an epsilon, say 100.
## R ---------------------------
f = function(x) x^2
N = 1:30
plot(N, f(N), type='l', col='lightblue', lwd=4)
## -----------------------------
We can see that there is no a which could exist for which our series seems to converge within the positive bound epsilon=100.

This concergence might have been obvious. Let us try a different series.
  $$g(N)=10x^{-99/100}sin(x)+\pi$$
## R ---------------------------
g = function(x) 10*x^(-99/100) * sin(x) + pi
N = 1:100
plot(N, g(N), type='l', lwd=2)
## -----------------------------

The series g demonstrates a common convergence pattern. That is you can alway pick a epsilon after which the difference between N and the convergence point is less then epsilon. In this case the series converges on pi. Let us play around with a few epsilons to see if our hypothesis that the series converges seems to hold true.

## R ---------------------------
e1 = 5
abline(h=pi+e1, lwd=2, col='darkred')
abline(h=pi-e1, lwd=2, col='darkred')
## -----------------------------
We can see that at N<10 there are some points which are not bounded by |N-pi|<e1. However by the time we get to N>20, all points are easily bound (at least those we have plotted).

How about a harder epsilon?
## R ---------------------------
e2 = .5
abline(h=pi+e2, lwd=2, col='darkblue')
abline(h=pi-e2, lwd=2, col='darkblue')
## -----------------------------

This epsilon is outside our bounds at 20 but by the time we have reached 40 the series does not reach outside our bounds.

## R ---------------------------
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
## -----------------------------
We can see that with e3=.1 there is no point on the graph we have plotted which is bounded by our epsilon. However plotting the graph at a longer series time frame will give us a number N after which g is bounded.

## R ---------------------------
N = 100:200
plot(N, g(N), type='l', lwd=2)
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
N= 1:30
plot(N, f(N), type='l', col='paste(readLines(paste0('http://web.archive.org/web/',
              fdate, '/http://www.npr.org/'), warn="F"), collapse='');lightblue', lwd=4)
## -----------------------------

We can see that the graph clearly shows that at our small value of epsilon (.1) we are bounded around pi as early as N=110. For all other even smaller epsilons there will always be an N for which
this is true (even if our computers do not have sufficient accuracy to show this to be the case).

However, since we are working with an infinite series, this also means that there is an infinite area of the graph for which we can never plot. We therefore cannot use graphs to prove that series converge. In theory a series can always break the bounds just beyond the area of inspection. The switch function
$$f(N) = 1[N>100]$$
is an extreme example in which the series seems to converge at 0 up to N<100 but at N=100 or greater the series now converges on 1.

Other series converge without oscillating:
$$f(N)=a+b/N$$

Some more challenging series which also converge include:
$$f(N)=bN/exp(N)$$
$$f(N)=N^{1/N}$$

Monday, November 3, 2014

Make your own hotly criticised circle graph!!!

Make your own hotly criticized circle graph! The recent critical post on R-bloggers has got me wanting to make my own bubble graphs despite the popular criticism of the graph. After some attempts I have produced something that looks similar but needs some more work.

I would like to say briefly that R-blogger post by while interesting and reasonable seemed to miss the point. The graphic was not supposed to convey massive amounts of information but simply to make obvious the scope and degree of how the current Ebola epidemic compares with other epidemics.

There are many good reasons to critique this graph from an epidemiological perspective (forthcoming), yet from a graphing perspective I think  this graph format is extremely efficient at conveying memorable information while the graphs assembled by PirateGrunt were more detailed they were also entirely forgettable unfortunately.

Here are a couple of examples. The first is with meaningless simulated data demonstrating how large differences in scale may appear. The second is actually using HIV prevalence data to give a graphical representation of how such rates have changed over the years. You can fine the code as well as the data on github.

Table 1: Demonstrates how scale can be communicated by size of circle


Table 2: Demonstrates the relative stability of AIDS in terms of people ill. Top number is  year while lower number is millions of cases.

Monday, October 27, 2014

Ebola in Liberia could be different than Ebola in New Jersey

http://www.pixton.com/uk/comic/8wlmhhc2

The Ebola outbreak in Western Africa has initiated panic throughout the world. Thirty-seven countries so far have implemented policies to restrict the international spread of Ebola. In the United States, governor Chris Christie has initiated additional travel restrictions implementing a mandatory quarantine of health workers returning from Liberia, Sierra Leone, and Guinea even when no symptoms are present.

Should Chris Christie be concerned about Ebola in New Jersey? In some ways New Jersey is not so different from Liberia which has already suffered 4,665 deaths according to the CDC.

New Jersey has a population of around 8.9 million people while Liberia has a population a little less than half that of 4.1 million. Both countries were established by foreigners. New Jersey was is first established by Dutch people in America while Liberia was established by Americans in Africa. Both states gained their current affiliation within 60 years ago with New Jersey entering the Union in 1787 and Liberia gaining independence in 1847. The official language of both New Jersey and Liberia is English and both governments use the dollar though one is the US Dollar and the other is the Liberian Dollar. In both states they drive on the right side of the road and the largest religious affiliation in both regions is Christianity.

I know, from this description thus far most people would find it very difficult to tell if they were in Liberia or New Jersey. Under this reasoning it seems very important for New Jersey to implement stringent rules to keep out any chance of Ebola entering its borders. However, there are some minor differences between New Jersey and Liberia that may bear mentioning.

Overall the United States is listed top 4 on the United Nation's Human Development Index while Liberia is ranked the 5th lowest in 2011 out 187 countries. Within the United States New Jersey is ranked the third most developed states. But what does this really mean?

The per capita earnings in Liberia was reported at 436 USD while in New Jersey the per capita was 54,699 USD. Thus in a year a New Jersey resident could be expected to earn more than 125 Liberians. But does this somewhat noticeable difference in earnings translate into differences in medical services?

According to the US census there was about 311 doctors per 100,000 residents of New Jersey in 2006 translating to about 27,680 medical doctors. Liberia on the other hand reported having 51 doctors in the entire country in 2006. If doctors are distributed evenly throughout both states then you are likely to find one medical doctor within every 0.31 square miles in New Jersey while in Liberia you are likely to find only one medical doctor every 811 square miles.

New Jersey residents can expect to live 80.3 years on average and struggle with different health related issues than those faced by Liberians who can only expect to live 57.4 years. In New Jersey where over-consumption presents a major concern with 60.7% of the population is overweight or obese and the chief causes of death is related to over-consumption including heart disease and diabetes. In contrast 38.5% of Liberians suffer from malnutrition with deaths caused from easily treatable diseases such as malaria, pneumonia, and diarrhea which are collectively responsible for 50% of the deaths in the country.

Overall, New Jersey is vastly more wealthy in terms of both income as well as medial expertise, New Jersey is vastly better prepared for Ebola than Liberia. Within Africa there have been over twenty outbreaks of Ebola, the majority of which have been rapidly contained despite most countries not being significantly better off than Liberia. Ebola is a reasonably well understood disease which can be rapidly controlled when appropriate steps are taken using infrastructure much worse off than that faced in New Jersey.

Despite the numerous similarities between New Jersey and Liberia, we should expect any outbreak of Ebola in New Jersey to be rapidly contained. It is unlikely that this outbreak of Ebola in Liberia will result in any significant outbreak in New Jersey or anywhere in the United States in which even our poorest areas are much better equipped than any of the countries now suffering from the outbreak.

This post thus far has been meant as a jab at the hysteria and political maneuvering that has surrounded Ebola. However, the fear of this disease is entirely appropriate if not well placed. This disease with its high mortality rate (~70%) and rapid transmission within these impoverished nations (Sierra Leone, Liberia, and Guinea) has the potential if not stopped to cause as much loss of life and suffering as the worse currently communicable diseases which plague humanity such as malaria (627,000 annual deaths) or tuberculosis (1,460,000 annual deaths).

I believe it is still possible to turn the tide of Ebola around if sufficient international aid is brought to bear in these afflicted nations. Becoming distracted with enacting useless policies which harm the ability of health care workers to travel between wealthy developed nations and West African nations not only misses the point, but actively undermines the ability of philanthropic health care workers to control this disease.

Thursday, October 9, 2014

Waterfall and 3D plotting exploration

Taking the very cool 'waterfall graph' code posted by Robert Grant I have added some features (resistance to distributions with sparse data at some areas) as well as the ability to heat map the bivariate distribution based on a third variable z. Find the code on github.

Overall I find the graphs produced from this code to be beautiful and fascinating though I am not sure if I would really use them as a form of data exploration. In addition, I am not sure if I would expect anybody to be able to understand what I am communicating. But let us first see some graphs associated with different distributions before we jump to any conclusions.
Figure 1: Plotting data points.
The primary motivation for a density graph like this is Figure 1. We can see there is something going on between variables x and y but it is really hard to be able to see what it is. I could make the dots smaller(currently cex=.2) or more transparent (currently=10%) but there will always be the problem that I have 500,000 points and plotting them ultimately will result in some loss of information either in the core where the color becomes solid or near the tails where the color will blend into the background.

Figure 2: A standard 3D density graph created with the MASS package.
Of course there is no way of plotting this much information without some kind of a loss of information. Plotting a 3D graph of densities is one way of reducing information to a more manageable level. The MASS package provides some tools to be able to do this (Figure 2: Reference).

Figure 3: For different ways of slicing the same data.

Let us keep these previous graphs in mind as we explore Robert Grants waterfall graph. The bivariate distribution x and y plotted on the previous graphs can be seen in Figure 3 mapped from four different perspectives (the bottom two of the perspectives are reverse order so the bottom left hand panel is not the same). One immediately notices that while there are some similarities there are also a lot of differences.

Visibly the waterfall graph seems to have more information with many more lines as some capitalization on transparencies. The four 3D renderings of the data is much easier to identify as the same information from different angles. However, the waterfall graph seems to be less of a rotation or transposition of the same object. This puzzled me for a while until I realized that the waterfall graph is not communicating the same information at the 3D graph in its current form. Notice the top left graph. Notice how all of the peaks seem to be equivalent in the waterfall graph yet very pointed in the 3D rendering.

Figure 4: Distance between slices is proportional to number of observations.
This is because a new density curve is mapped for each level of the slice. However, slices are based on equal length cuts from the y range (or the x). Thus some cuts such as those near the center have a lot of observations while those near the ends have very few observations information. In order to adjust for this divergence of information I have included as an option that the distance between the density curves be proportional to the number of observations within each slice (Figure 4). These Figures do not have quite the same feel as the last set probably because they do not look so much like mountains however these figures contain much more information with those density curves near the ends being compressed and those near the center being stretched apart communicating that these extreme values are rare.
Figure 5: Transparency is proportional to sparcity
Figure 6: Height is proportional to number of observations.
I also wrote an option to allow the sparsely populated slices to be more transparent (Figure 5). In some ways this is a more intuitive graph. Conceptually you can think of the opacity of each slide as being filled in a way by the number of observations. Finally I wrote a different option to have the figures height vary by population (Figure 6). Not surprisingly this ended up producing a graph very similar to Figure 2. That was surprising is noting how completely different this rendering of the data appeared than the other graphs.

Now that we have a few different ways of communicating the same information, let us see which set of graphs seems to get us to a place of understanding the relationship between our two variables of interest. First let us note how I generated the data (y~N(0,1) and x~3*N(0,1)+y^2). Thus x and y are uncorrelated but still dependent. All of the figures communicate some of this information. I might personally prefer Figure 1 because the figure sufficiently communicates a significant representation of the data. I think adding the 3d graphs does not add significantly to Figure 1. But this is because everything peaks smoothly in the dark area. If however there were say a dip in density within the dark region then Figure 1 would not be able to warn us of that dip unless it was severe.

Comparing Figure 1, Figure 2, and Figure 3, 4, 5, 6 I am not sure if I would favor the waterfall graphs (except 6 which hardly counts). While beautiful they do not communicate to me clearly what they are representing. However, perhaps this is just an issue of developing an internal intuition of what is going on. To that end, let us explore some less complex data relationships.
Figure 7: No relationship, positive correlation, negative correlation, linear dependency.


Figure 7 (TL, TR, BR, RL) shows what no=correlation (x=N(0,1),y=N(0,1)), positive (x=N(0,1)+.5y), negative (x=N(0,1)-.5y), and perfect coliniearity (x=y). Does this help with out intuition? Maybe a little bit.
Figure 8: It is possible to include an additional variable z which is used to select color from multiple RGB specified colors.
As an additional option, I coded into the function the option of taking a third variable which acts as a color variable. In order for this to work properly you must specify matrix with two or more rows each with 3 or 4 rgb value colors. As the third variable z varies from low to high it will automatically color the slice appropriately. I have set z as a function or x and y with noise (Figure 8). Figure 8 varriest between three different colors (yellow, teal, darker blue). The top two graphs are with x,y with z as color while the bottom two are instead with either z and x with y as color or z and y with x as color. Because this is made up data I do not get much out of it. However, I could imagine someone being able to find meaning in these graphs.

Overall, I would have to say that I am unconvinced that the waterfall graph is an effective substitute for a 3D graph. However, there is no reason to believe that this should be the only criteria for defining such a graph! So far we have been acting as if both x and y were random variables.

But often one of the variables in not random, or we are comfortable acting as if it not random when evaluating the other variable. In addition, we might not care how sparsely populated our slices are, we are mostly concerned with how our distributions change over time. Take income information and age. We might not care how many x year old people their are. But we may care at each age category how the distribution of income lays.
Figure 9: Loge wages against age. Top left youngest is closest, top right oldest is closest, bottom left age against log wages, bottom right is same as top left except height is proportional to observations.
Figure 9 shows us exactly this information. Using IPUMS data (citation below) for 2005 just looking at age and log wage income excluding zeros and non-reported values we see the distribution of wage income for different ages with the youngest in fron and oldest in back on the top left graph. On the top right age order is reversed. We can see that as people get older wages tend to increase and tighten into middle age and then plateau before falling a bit and widenning once again in older life. The bottom left graph is the same information except now wage is the slice and age is the density. It looks like among older folks their wages tend to be higher while younger ones seem to heavily represent the lowest wage earning categories. The final panel is showing us the same information as the first panel except that now height is proportional to density within the slice. We can see that the high representation of middle-aged baby boomers dominates this graph and makes other comparisons difficult.

From this real world example, I therefore think that the waterfall slice graphing framework is very useful. it is not a replacement for the 3D graph but rather an alternative representation of a different feature of the data. If you would like to find the code to create the graphs used in this article please check out my github repo. And if you find this post helpful, please leave me some feedback!

IPUMS Citation:

Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Schroeder, and Matthew Sobek. Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database]. Minneapolis, MN: Minnesota Population Center [producer and distributor], 2010.

Wednesday, October 8, 2014

Julia style string literal interpolation in R

I feel like a sculptor who has been using the same metal tools for the last four years and happened to have looked at my comrades and found them sporting new, sleek electric tools. Suddenly all of the hard work put into maintaining and adapting my metal tools ends up looking like duck tape and bubble gum patches.

I hate to say it but I feel that I have become somewhat infatuated with Julia. And infatuation is the right word. I have not yet committed the time to fully immerse myself in the language, yet everything I know about it makes me want to learn more. The language is well known for its mind-blowingly speed accomplished through just-in-time compiling. It also has many features which enhance the efficiency and readability of its code (see previous post, note the documentation has greatly improved since posting).

However, though I very much want to, I cannot entirely switch my coding needs from R into Julia. This is primarily due to my ongoing usage of packages such as RStudio's "Shiny" and the University of Cambridge's server side software for building adaptive tests, "Concerto". And so with regret I will resign my Julia coding to probably a minor portion of my programming needs.

That does not mean however that I can't make some small changes to make R work more like Julia. To this end I have programmed a small function p which will replace string literals identified as "Hello #(name), how are you?" with their values being evaluated. If there are nested parenthesizes then it is necessary to close the literal with ")#", for example "c=#(b^(1+a))#".

# Julia like text concaction function.
p <- function(..., sep="", esc="#") { 
  # Change escape characters by specifying esc.
  # Break the input values into different strings cut at '#('
  x <- paste(..., sep=sep)
  x <- unlist(strsplit(x, paste0(esc,"("), fixed = TRUE))
 
  # The first element is never evaluated.
  out <- x[1]
  # Check if x has been split.
  if (length(x)>1) for (i in 2:length(x)) {
    y <- unlist(strsplit(x[i], paste0(")",esc), fixed = TRUE))
    if (x[i]==y[1])
      y <- unlist(regmatches(x[i], regexpr(")", x[i]), 
                             invert = TRUE))
    out <- paste0(out, eval(parse(text=y[1])), y[-1])
  }
  out
}
 
name="Bob"
height=72
weight=230
 
# Let's see it in action
p(sep=" ", "Hello #(name).",
  "My record indicates you are #(height) inches tall and weigh #(weight) pounds.",
  "Your body mass index is #(round(703*weight/height^2,1))#") 
# [1] "Hello Bob. My record indicates you are 72 inches tall and weigh 230 pounds. 
# Your body mass index is 31.2" 
 
# The other nice thing about the p function is that it can be used to concat
# strings as a shortcut for paste0.
p("QRS","TUV")
# [1] "QRSTUV"
Created by Pretty R at inside-R.org

Thank you SO community for your help.

Monday, October 6, 2014

Julia: The "Distributions" Package

This is a follow up to my post from a few days ago exploring random number generation in Julia's base system.  In this post I will explore the 'distributions' package.

You can find the excellent documentation on the "Distributions" package at:
http://distributionsjl.readthedocs.org/en/latest/index.html

# First let's set the current directory
cd("C:/Dropbox/Econometrics by Simulation/2014-10-October/")

# This post uses the following distributions
using Distributions
using Gadfly

# I have got to say that I love the way Julia handles distributions
# as I discovered through this post.

# The Distributions package gives trenendous power to the user by
# providing a common framework to apply various function.

# For instance let's say you want to draw 10 draws from a Binomial(n=10, p=.25) distribution

rand(Binomial(10, .25), 1, 10)
#  4  3  0  5  1  3  5  2  2  1

# Looks pretty standard right? Well, what if we want the mean?

mean(Binomial(10, .25))
# 2.5

# mode, skewness, kurtosis, median?

a = Binomial(10, .25)
println("mode:", mode(a), " skewness:", skewness(a),
        " kurtosis:", kurtosis(a), " median:", median(a))
# mode:3 skewness:0.3651483716701107 kurtosis:-0.06666666666666667 median:3
 

# Cool huh?

# Let's see how we can use Gadfly to easily plot some distributions:

# First we generate the plot (assign it to a 'p' for later drawing to disk)
#  In order to plot the different CDFs I will use an 'anonymous' function defined:
#  argument -> f(argument)
p=plot([x -> cdf(Normal(5,5^.5), x),
      x -> cdf(Gamma(3,1), x),
      x -> cdf(Exponential(2), x)]
      , 0, 10)

 # Write the graph to disk
draw(PNG("2014-10-06CDF.png", 24cm, 12cm), p)



g = plot([x -> pdf(Normal(5,2), x),
      x -> pdf(Gamma(3,1), x),
      x -> pdf(Exponential(2), x)]
      , 0, 10)

draw(PNG("2014-10-06PDF.png", 24cm, 12cm), g)