## Wednesday, September 12, 2018

### Plotting the Impact of Atlantic Hurricanes on the US

With hurricane Florence bearing down with expected to be devastating force, it might be a good time to reflect on the history of Atlantic Hurricanes on the United States. As an easy if not full proof source for hurricane data I drew on public data collated through Wikipedia's list of costliest Atlantic hurricanes as well as linked articles. On the page it lists 45 hurricanes going back to Hurricane Betsy in 1965, with 30 of them since 2000 totaling 803 billion dollars in estimated damages and 6,591 US fatalities. While this is quite a large number of fatalities, the death toll of Atlantic Hurricanes outside of the US is significantly larger with nearly 30,000 deaths estimated.

Within the US, the track record for deaths from hurricanes tend to be quite low with all hurricanes recorded having a fatality rate of less than two hundred with the exception of Maria (2017 - Trump administration) and Katrina (2005 - GW Bush administration).

 Figure 1: US Fatalities by Year.

I have also scaled the y-axis by log10 to make it easier to examine less deadly storms.

 Figure 2: US Fatalities by Year Log10 scaled.

Maria and Katrina being the most deadly of storms does not indicate that they are necessarily the most damaging of storms.

 Figure 3: Damages in the Billions by Year.
Harvey and Katrina are reported to have done 125 billions dollars each followed by Maria doing damage of 91.2 billion, Sandy at 68.7, and Irma at 64.8.

So why is it that Harvey (2017 - Trump), a storm that did as much damage as Katrina resulted in only 106 deaths while Maria a storm that did 25% less damage resulted in a US citizen total death toll comparable to the sum of all 44 other Hurricanes on record (2982 vs. 3609)?

 Figure 4: Total deaths by Year.

We might attempt to predict which storms might be the most deadly using the storm measurements highest wind speed recorded and lowest pressure recorded.

 Figure 5: Wind speed by US_Fatalities
 Figure 6: Lowest air pressure recorded in terms of mbar (whatever that is).

As we can see that Maria and Katrina are both on the higher end of  the wind speed distribution and the lower end of the air pressure distribution, we can assert that a more powerful storm is a necessary component for the kind of death tolls associated with these storms.

What factors must be present for a severe storm which otherwise would be associated with a death toll of less than 200 people to being a tremendous human tragedy, such as in the case of Katrina or even more so Maria are yet unknown given the sparsity of data available to the public.

From the data it appears that Atlantic hurricanes have grown more damaging and more deadly for US citizens.

Hopefully, despite the catastrophic blunders of the handling of Katrina and Maria by the Bush and Trump administrations respectively, future hurricanes such as Florence will be dealt with in a more responsible manner.

Code and data can be found here

## Thursday, December 1, 2016

### Efficiently Saving and Sharing Data in R

After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat file).

It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.

In this post I test four different file formats available to R users. These formats are comma separated values csv (write.csv()), object representation format as a ASCII txt (dput()), a serialized R object (saveRDS()), and a Stata file (write.dta() from the foreign package). For reference, rds files seem to be identical to Rdata files except that they deal with only one object rather than potentially multiple.

In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:

Numeric Formats

• Index 1 to N - ex. 1,2,3,4,...
• Whole Numbers - ex. 30, 81, 73, 5, ...
• Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
• Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...

Text Formats

• String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
• String coded factor variables with 16 characters coded as strings
• String coded factor variables with 64 characters coded as strings
• Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
• Factor coded variables with 16 characters
• Factor coded variables with 64 characters
• String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
• String variables with random 16 characters
• String variables with random 64 characters

What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (1:N).

In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.

As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.

# Raw Files

Let's first look at space taken when saving uncompressed files.

Figure 1: File Size

Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The dataframe object is the data.frame composed off all of the variables. From the height of the dataframe, we can see that rds is overall the winner. Looking at the other variable values we can see only that csv appear to consistently underperform for most file formats except for random strings.

Figure 2: File Size Log scaled

In Figure 2 we can see that rds is consistently outperforming all of the other formats with the one exception of index in which the txt encoding simply reads 1:10000. Apparently even serializing to bytes can't beat that.

Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of csv to compress factor strings dramatically penalizes the efficiency of csv relative to the other formats.

# File Compression

But data is rarely shared in uncompressed formats. How does compression change things?

Figure 3: Zipped File Sizes Logged

We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as rds. Comma delineated csv files are a bit of an exception with factor variables suffering under csv. Yet random strings perform slightly better under csv than other formats. Interesting rds files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.

So, should we stick with whatever format we prefer?

Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.

But an equally important concern, is how long it takes write and read different file formats.

In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.

Figure 4: Reading and unzipping average speeds

From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB csv file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.

In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (rds), files could be 100 times larger than those used in this simulation and still open in less than a minute.

# Conclusions

Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.

This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.

Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in rds formats in addition to the outdated formats they currently exist in.

And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.

Yet, R is free and its database format is public domain. Any user could download R, open a rds or Rdata file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.