Sunday, May 5, 2013

Quandl Package - 5,000,000 free datasets at the tip of your fingers!

# Yes, you read that correctly and no Quandl (http://www.quandl.com/) did not pay me anything.

# Quandl is a new database management tool which seeks to become the place to find datasets.  They boast of having over 5x10^6 data sets available though after examining them, I have decided that they are not entirely what everybody might think of as data sets.  That is, each unique indicator is considered an independent data set.  This helps them to seem to have a ginormous quantity of data sets.

# That said, they are not wrong in calling each indicator its own data set since much of their data, like financial data or government data is collected by disjoint teams.  The scope of their ambition is fantastic yet it is doable and frankly someone needed to do it.

# Currently, data seekers can access the Inter-University Consortium for Political and Social Research (IPCSR).  This great resource is composed mostly of cross section and panel data sets which are great for much analysis but IPCSR resricts access to data to member universities.  In addition, the kind of data that Quandl is indexing is a lot of data that would not show up on IPCSR database.  In addition, Quandl is integrating an automated structure that will be self-updating.

# For an example of how Quandl is a good step ahead of the game take a look at this search quiery:

http://www.quandl.com/search/lansing,%20michigan

# In this search, I searched out Lansing, Michigan where I live and returned results of data for the last decade or earlier up to today from sources such as the Federal Reserve and the US Energy Information Administration.

http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies?q=Lansing%2C+Michigan&permit%5B0%5D=AVAILABLE

# In constrast when queirying ICPSR, I found a few databases listed but they were historical databases that spanned back generally between 30 and 70 years.  That said both sources could provide valuable information depending upon what I am interested in modeling.

# Quandl is very clever for a number of reasons.  One of these reasons is that they have simultaneously released 8 software packages that can be used in a number of statistical packages such as R, Stata, and Excel.

# In order to demonstrate the use of Quandl I will grab a few data sets from the Lansing quiery drawn from the Federal Reserve.

install.packages("Quandl")
library(Quandl)

# Employment numbers (thousands of people") for Lansing, Michigan
NonFarm = Quandl("FRED/LANS626NAN")
CivLaborForce = Quandl("FRED/LANS626LFN")
PerCapitaIncome = Quandl("FRED/LANS626PCPI")

# Now let's combine the data so that we can related data values.
Labor = merge(NonFarm, CivLaborForce, by="Date")
Combined = merge(Labor, PerCapitaIncome, by="Date")
colnames(Combined) = c("Date", "NonFarm", "CivLaborForce", "PerCapitaIncome")
  # Notice that though our data had many more data points, the default option of merge only keeps data that exists in both data sets.  In this case, it is per capital income that has the least number of data points.

# Let's see if we can predict income as a function of employment:
summary(lm(PerCapitaIncome~NonFarm+CivLaborForce, data=Combined))

# Our naive prediction as a result of this is that as the Civilian Labor Force increases, wages rise.  This is of course a naive example ignoring completely issues of causation and endogeneity not to mention probable random walks and other challenging features of this kind of data.

# The overall take away though, should be "cool", I think.  Maybe this data bank does not provide information currently on many issues of interest to those looking for data.  But it does make things easier and self-updating, which are great features.

1 comment: