Saturday, April 11, 2020

Is COVID-19 as bad as all that? Yes it probably is.

As an economist without formal training in epidemiology I have done my best to leave the modelling up to the experts. But, the world as shut down around me and my life is suddenly so much more complicated and I have to wonder, is this COVID thing as dangerous as it seems? When things got bad in Italy my optimistic friends said, “that’s just Italy”. When things got bad in Spain, they said the same. But now New York has more deaths per capita than either Italy or Spain and I am starting to sweat a little. Is there something particularly bad about the New York health care system which has made them more vulnerable to this disease than others?

Looking at the mortality rate of previous flu seasons in New York in which for the last four years, they have been in the top 10 best performing states according to the CDC. In 2017 New York had the 7th lowest death rate by State, only being beat by states which had lower elderly populations (Utah, Alaska, California, Colorado, Texas, and Washington).

Does New York have a particularly large elderly population which has made it more vulnerable? Nope. New York state, at 14.66%, ranks squarely toward the youngish center of states (29/50 youngest) while New York city is younger in general than the rest of the state with only 13% of the city population older than 65.

Well maybe the mortality rate of COVID-19 just seems high because it is about to peak? After all, the flu kills somewhere between 12 and 70 thousand people in the US every year and 290 to 650 thousand globally. COVID-19 with an estimated number of deaths in the US of around 19 thousand and 104 thousand globally doesn’t seem that dangerous.

Yet, the very reasonable concern is that this disease is just getting started. Wikipedia numbers suggest the total number of cases globally is 1.7 million, which we know is a lower bound of the true number of cases, as many of those who have COVID-19 have not been tested.

We don’t know how many people currently have COVID but we can imagine a few different scenarios.

Scenario 1: COVID-19 reported cases are close to true cases

Let’s imagine that the number of people with COVID is approximately the number that we have record of. There are some unreported cases but not that many. If this is the case, we are in an extremely frightening world because so far the disease has killed about 104 thousand people out of the 1.7 million it has affected, a 6 % mortality rate and almost all of those infected are not yet recovered, meaning some of them will die, increasing the observed mortality rate. The small consolation under this scenario is that cases are largely detected and therefore with enough government and individual intervention ongoing transmission likely could be slowed and stopped through thorough and diligent contact tracing.

Scenario 2: COVID-19 reported cases are reasonable fraction of true cases

Let’s imagine that the true number of cases is somewhere between 2 to 10 times as many as those reported. Under this scenario, the current mortality rate is calculated by dividing the observed mortality rate by the factor of unknown cases so 6/2=3% for 2 times with 6/10=0.6% for ten times. In this scenario contact tracing by and large will fail as there is simply too many unknown cases. The best thing governments and individuals can do in this scenario is shut off potential avenues of transmission between individuals until either a vaccine can be found or the number of new cases is so small that the implementation of contact tracing is feasible. Sadly even in the scenario in which the true number of cases is 10x that of the reported cases the mortality rate of COVID at a minimum of 0.6% is still much higher than of the seasonal flu and if left unchecked would result in 2.28 million fatalities in the US alone (0.6% * 380 million) which is greater than the top ten leading causes of death in the US combined:

Table 1

    Heart disease: 647,457
    Cancer: 599,108
    Accidents (unintentional injuries): 169,936
    Chronic lower respiratory diseases: 160,201
    Stroke (cerebrovascular diseases): 146,383
    Alzheimer’s disease: 121,404
    Diabetes: 83,564
    Influenza and pneumonia: 55,672
    Nephritis, nephrotic syndrome, and nephrosis: 50,633
    Intentional self-harm (suicide): 47,173
Total 2.08 million

Scenario 3: COVID is already everywhere and most people have it or have already had it

Strangely this is the best-case scenario. Under this scenario only those who have severe outcomes from COVID-19 are being reported while the vast majority (like 99%) of individuals are asymptomatic. Under this scenario, shutting down state, national, and international travel and social activities is futile for any extended period of time as the virus is already everywhere and we just need to treat the severe cases that pop up the best we can and suck it up. This scenario is appealing as it means the worst has already come or is soon to.

So which scenario are we in?

Reviewing the scenarios it is impossible to know with certainty in which scenario lies reality. However, does the evidence point against any given scenario?

Scenario 1 seems unlikely to me due to the tens of thousands of cases are popping up each day (Figure 5). This rate of new infections seems to indicate that there is a sizable infected population which has not yet been detected and has continued to spread the virus despite national, state, and local recommendations and mandates intended to limit spread.

Under Scenario 3 in which COVID-19 is already everywhere this scenario seems unlikely due to the lumpiness of the mortality numbers. If COVID-19 were everywhere then we would expect people across all states and countries to be dying from the disease more or less proportionately. If COVID-19 were already everywhere we would expect that mortality numbers to be mostly homogenous across states. However, this is not what we are seeing with highly heterogenous mortality numbers across states and countries. New York currently has around 400 deaths per million while New Jersey 218, Michigan 108, Florida has around 19, California 14, Texas 8, and Montana 6.

These numbers suggest that COVID is spreading from infected communities to non-infected communities in a hotspot community spread model rather than that of a widespread dispersal characteristic of Scenario 3.

But maybe one might ask, is it possible that deaths previously assigned to other causes might have actually been caused by COVID-19 before the virus was known and publicized? Yes, there are very likely deaths caused by COVID-19 which have not yet been correctly attributed to the disease. If accounted for could, these deaths correct the heterogeneity in the data in order to place us back in Scenario 3? Figure 1 shows the known deaths in New York by COVID-19 compared with flu mortality numbers from 2014-2017. Already, COVID-19 has or will soon double the mortality of the flu for these years and unfortunately the number of infections has continued to grow at an alarming pace (Figure 2).

Figure 1
Figure 1
So, while it is impossible to know, I believe it extremely unlikely that a disease twice as deadly as a typical flu (at least in New York) could go undetected in thousands of hospitals and laboratories across the United States.

Assuming Scenario 2

With some reports saying 80% of cases are asymptomatic, an estimate of 5x as many people infected with COVID-19 as what has been reported might not be crazy. This would mean that the actual number of people infected with COVID in New York is something like 650,000 which while encouraging in that 9,000 deaths out of 850,000 (1%) is much better than 9,000 deaths out of 170,000 people (5.3%).

The problem of course is that even the inflated number 850 thousand is only 10% of the city’s population and 4.6% of the total population of the state. Meaning we still would have a vast large potential population to infect. Combine that with the factor that we are having somewhere between 6,000 and 10,000 new cases pop up every day in the state despite a ‘stay are home’ order in effect for two and a half weeks.

Looking at the graph (Figure 2), the number of cases in New York has grown very rapidly. Yet, presumably the number of cases would be even greater if the lockdown order were not in effect.

Yet most of us don’t live in New York. How much should we be worried?

As New York has an above average health care system and relatively lower proportion at risk elderly population New York could be seen as a lower risk state compared with many. Yet, New York City is also the most dense city in the country with perhaps the highest use of public transportation and correspondingly highest use of public potential infection points such as grocery stores, theaters, restaurants, etc..

Looking at only states which have reported more than 5,000 cases and scaling counts by log10 we get Figure 3. In Figure 3 is it hard to mark out much except that the overall shape of the infection curve seems to be similar across states.

Figure 3: Total number of cases by state for states reporting at least 5,000 cases.   
It is difficult to make comparisons between states and to make predictions from Figure 3. However, one technique often used to pick a point in time with a certain number of cases then compare how growth rate in cases changed for others states after they reached the same point. In this case, I will pick my earliest date in my dataset March 18th in New York in which there were around 2,500 cases of COVID-19 reported. This number was reached later by different states, New Jersey on the 23rd, California on the 25th, Washington on the 26th, Michigan, Florida, and Illinois on the 27th, and so on.

Plotting cases starting at this common point now gives us a means of comparing case growth by state (Figure 4). Under this technique, New York definitely appears to have a higher growth rate followed by New Jersey with Michigan, California, Louisiana, Massachusetts, Pennsylvania, Illinois, Texas, Georgia, and many other states following a less aggressive but still positive growth trajectory.

Figure 4: Day 0 is the first day a state passes 2,480 cases of reported COVID-19.


COVID-19 appears to be really bad and New York has been hit the hardest - so far. How bad? We won’t know until after crisis has passed. Fortunately, other states other states had lower rates around the time the country (the President) started taking this crisis seriously. Since then those states appear to be on a more gradual growth trajectory than that of New York.

Yet despite widespread concern over COVID-19 and instructions and mandates to help reduce the spread new infections in the are still on the rise (Figure 5). And this is under conditions in which we have put a stop in person social gathering, closed restaurants, and ordered residents to stay in doors in many states. If we were to say, start withdrawing these restrictions, it would seem likely that growth rates of new infections would start rising rapidly once again.

Figure 5
Graphs created in R - code on GitHub

Friday, March 22, 2019

Data Fun - Inspired by Darasaurus

After my recent post on Anscombe's Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice's Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to fit a specified mean and variance for X and Y. Their method randomly applies small disturbances to individual observations to gradually move the data to match a target preference set.

Inspired by their use of point images which match a specific parameter set, I have done generated some of my own. For all of them X has a mean and variance of 1 and 11. While y has a mean and variance of 1 and 1. The correlation between X and Y is set to 0 which causes some distortion in the images. More on that in the post.
Figure 1: Shows a graph of 12 data sets each with 15 transitional data sets. The mean, variance, and correlations of X and Y are held constant throughout the sets and transitions.

Data Source

I generated the data myself using Mobilefish's upload photo and record clicks webapp. The source images are from images I found online.

The only slight trick to using the data generated by Mobilefish was that the y cooridates are typically tracked from the top of the page with software, yet most statistical graphing software plots with y starting from the bottom of the graph.

Raw Images

The raw data when plotted look like their source material..

New Images: Force Cor(X,Y)=0

When we force the correlation of X and Y to be zero certain point distributions become distorted.

For Bart and the Cat forcing cor(X,Y) has noticable distortions while for the flower minimal distortions seem to have been introduced.

New Images: Force Cor(X,Y)<>0

It gets even worse when we impose a constant correlation between X and Y. The following shows the distortions to the flower when we change b1, keeping Cor(X,Y) constant and fixing the Y plot limits.
Figure 8: Shows the effect on Var(Y) that changing b1, when all other factors are held constant.

Slight changes to the Anscombe-Generator Code

In order to generate graphs that had cor(X,Y)=0 I had to modify my previous code to allow variation in Y that was completely independent of X. The problem with my code was that if b1=1, my calculation used SSE = (b1^2 * var(X))*n in order to infer how large the variation in u needed to be (varianceu = (SSE/corXY^2 - SSE)/n). This backwards inference does not work if b1=0.

So, just for the special case of corXY=0 I have included an additional error term E which is helpful in the even that b1=0.


The thought of use points to make recognizable images had not occurred to me until I viewed Justin Matejka and George Fitzmaurice's Datasaurus work. I hope that in making a slightly more efficient distribution manipulator I will allow new and better datasets to be generated which will help students understand the importance of graphical exploration of their data.


My code as well as the slight change to Anscombe's code can be found here. The standarized data sets can be found here. They are the ones that end with std.csv