Tuesday, May 6, 2014

Survey of Code Written to Date

Using code from this great post I have been able to construct some tables illustrating how much code I have written for this blog since 2012 (note install.packages("reshape") and install.packages("reshape") might be necessary).

The top left plot shows number of files with over 200 files written with a .do extension for Stata, over 100 with .r, and a small number in .jl for Julia.  The second graph shows file length by date.  One can see the general increase in R posts in both frequency and length.  The bottom left graph shows a density graph of files by file length.  We can see that the highest density of files peaking around 100 lines with long tales with some files being as long as 400+ lines of code.  The final graph shows cumulative file length over time. 

It is easy to see on this graph how coding and blogging comes in fits and starts.  So far this graph seems to indicate that I have coded more than 20 thousand lines of code in Stata and over 10 thousand in R.  Julia is just a speck on this graph.  I hope to post more in Julia in the future.

Monday, May 5, 2014

7 R Quirks That Will Drive You Nutty

7 R Quirks That Will Drive You Nutty Every language has its idiosyncrasies. Some “designer”“ type languages have less due to extreme thoughtfulness of language engineers. I suspect Julia for example has many less quirks. However, despite its quirkiness R has become an amazingly flexible resource for a diverse range of tasks with thousands of packages and over 100,000 available commands (Rdocumentation.org) in subject matter as diverse as Pharmacokinetics, to Medical Imagining, to Psychometrics making it a quirky but effective standard in many research fields.

1. Vectors do not have defined "dimensions”

Strangely the dim “dimension” function does not work on vectors though it does work on higher dimensional matrices and arrays.

dim(1:10)
## NULL

If you want to figure out how long a vector is you need to use the length function.

length(1:10)
## [1] 10

Which would be fine except that the length function does work for matrices and will give you results based on counting the number of elements rather than on its “length” which is no longer well defined.

length(matrix(1, nrow = 10, ncol = 10))
## [1] 100

So length with vectors and dim with matrices, easy right? . Wrong. Which brings me to my next quirk.

2. Class Dropping

Matrices that are reduced to a single row become vectors and some matrix functions no longer work.

mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
mymatrix
##         [,1]    [,2]    [,3]    [,4]
## [1,]  1.3941 -0.7149 -1.7237 -1.6695
## [2,]  0.6882  1.4039 -2.2238 -0.3019
## [3,] -0.2032  1.3995 -0.3562 -0.3349
mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
dim(mymatrix[1:2, ])
## [1] 2 4
dim(mymatrix[1, ])
## NULL

Fortunately there is a kind of fix for this. Specifying drop=F when specifying a subset of a matrix will preserve its class as a matrix.

dim(mymatrix[1, , drop = F])
## [1] 1 4

Though, you do not need to worry about your matrix becoming a vector if you go to zero columns or rows.

dim(mymatrix[-1:-3, ])
## [1] 0 4

3. Negative subscripts

These are nothing to be bothered by since they are entirely optional. However, if you run into them they can certainly throw you for a loop. Negative subscripts act as though you are subsetting a matrix or vector by removing the specified columns or rows.

For example:

myvect <- -4:5
myvect
##  [1] -4 -3 -2 -1  0  1  2  3  4  5

They can be a little tricky to work with. Say we wanted to remove every other number in our 10 digit vector.

myvect[c(-1, -3, -5, -7, -9)]
## [1] -3 -1  1  3  5

Or more easily

myvect[-seq(1, 9, 2)]
## [1] -3 -1  1  3  5

4. NA Values Specifying Missing Data

These different special values can be quite challenging to work with. The biggest challenge for me is detecting and adjusting my code when one of these values pops up.

NA indicates a value that is “Not Available” or missing. Having an indicator for missing is standard. Stata uses a “.” while other programs use a -9999 or other unlikely number. The problem with NAs are not that they exist since almost all data sets are going to have some missing values at some point, but the frequency in which they interrupt normal function behavior. For instance if you have a long vector as 1:100, if one of the values is NA then most of the functions you attempt on it will fail.

a <- c(1:20, NA)
sum(a)
## [1] NA
max(a)
## [1] NA
min(a)
## [1] NA
cor(a, a)
## [1] NA

Once again this is not a problem necessarily, but it is annoying and can easily create unexpected problems. There are several solutions to the problem. One may be to remove observations from your data which have any values which are missing. This can create its own issues since NAs do not conform to logical operators (in contrast with Stata, SPSS, and all other statistical languages I know of). Thus

a2 <- a[a != NA]
a2
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Fails while

a3 <- a[!is.na(a)]
a3
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

does the job.

Alternatively, many commonly used functions have special arguments expressly devoted to how to handle missing data.

sum(a, na.rm = T)
## [1] 210
max(a, na.rm = T)
## [1] 20
min(a, na.rm = T)
## [1] 1
cor(a, a, use = "pairwise.complete.obs")
## [1] 1

5. “Empty values”: NULL, integer(0), numeric(0), logical(0)

These four values are several of R's empty value indicators. I don't believe they exhaust the list and I am sure each of them has its own unique properties which I cannot say I fully understand. The challenge usually lies in how to detect when they occur. As with NAs they resist logical arguments

a == NULL
## logical(0)
logical(0)
## logical(0)
# Even:
(a == NULL) == logical(0)
## logical(0)

What we are forced to do instead is detect the vector length of these values.

length(NULL)
## [1] 0
length(integer(0))
## [1] 0
length(numeric(0))
## [1] 0
length(logical(0))
## [1] 0

This trick works for an empty matrix as well.

length(mymatrix[!1:3, !1:4])
## [1] 0

Though dim is not a bad choice.

dim(mymatrix[!1:3, !1:4])
## [1] 0 0

6. Attach Does not Do What You Want it to Do

Attach and detach are functions which ostensibly promise to fulfill the desires of a user to be able to rapidly reference, use, and make changes to an active data set in a similar fashion as Stata or SPSS.

However, this is a mistake. IT IS NOT the solution to making R work more like Stata. If a data frame is “attached” in R it will make its columns accessible through to common interface.

mydata <- data.frame(a1 = 1:30, b1 = 31:60)
attach(mydata)
b1
##  [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## [24] 54 55 56 57 58 59 60

However, if you want to create a new variable in the data set say

c1 <- a1 + b1

Then you are going to have a problem because sure you just created c1, but c1 is not part of mydata.

names(mydata)
## [1] "a1" "b1"

You may add c to mydata

mydata$c1 <- c1
names(mydata)
## [1] "a1" "b1" "c1"

But the “c1” in my data is a different object than the c1 in the working memory.

mydata$c1 <- mydata$c1 * -1
head(mydata)
##   a1 b1  c1
## 1  1 31 -32
## 2  2 32 -34
## 3  3 33 -36
## 4  4 34 -38
## 5  5 35 -40
## 6  6 36 -42
head(c1)
## [1] 32 34 36 38 40 42

Which is not necessarily a problem since we can drop our working memory c1.

rm(c1)

But c1 is still not available even though mydata is still attached.

try(c1)

Which means we have “attached” our mydata object but that was back when mydata did not include c1. We need to reattach it to update the values. But first we need to detach our old mydata:

detach(mydata)
attach(mydata)
head(c1)
## [1] -32 -34 -36 -38 -40 -42

Thus we realize that attach is not a workable solution for handling data. Instead we are forced to use R's elaborate sub-scripting to manipulate our data such as

mydata$c1 <- mydata$a1 + mydata$b1

This is not unworkable though it can be annoying. There are shortcuts in R for helping to specify a dataframe or list to work from but they end up facing similar challenges to that of attach and ultimately create code which is unnecessarily complex.

7. R Functional Help Is Grouped

For instance ?gsub will return a list of seven related functions grep, grepl, sub, regexpr, gregexpr, and regexec as well as gsub which all share some commonalities with gsub, yet each do something somewhat different. I can understand why the documentation could have been originally organized this way since I imagine it was rather terse initially (as some help files persist in being). As help files got fleshed out, it just never happened that functions were broken out of their clusters.

It is worth noting that I have never seen this kind of function clumping in any of the help files in packages in R. For example library(ggplot2); ?ggplot ;?geom_point all return help files specific to a single function apiece.

Flattr this

Friday, May 2, 2014

Why Blog?

The Blog Review Process
A series of events in my life have lead me to reconsider the value of blogging.

The Back Story

Short story: I got fired.

Long story: Recently I was hired to write occasional blog posts for Quandl. They probably figured that due to my large readership and specialization in econometrics that I could actually write something informative that would shed a good light on them.  I wrote this!  Being how I have never taken a class in time series econometrics and barely touched it in my blog posts to date this was an unwise move on my part.  But I figured, I would post as a kind of exploratory procedure so what can it hurt?

The result, public defenestration and termination.

Quandl was right to fire me. I threw the results themselves together carelessly (not used to working with data) but ended up spending the majority of my effort investigating the technical issues of managing Quandl data in R.  In addition, I did not spend enough time thinking through the quality issues of my post.  I consider my general posts to be of medium quality.  If instead I am posting for someone else then I should really make my posts of the highest caliber possible.  I had not intended my post to be a serious econometric analysis but rather a demonstration on how easily Quandl data can be accessed and deployed.  Of course, I should have thought through this more.


Why Blog?
That said, this whole situation got me thinking about the value of blogging in contrast to the traditional academic publication establishment.  First off, I would like to say that despite it being in my own best interest, I find it difficult to be motivated to write academic papers for academic journals.  Why?

1. Because the average citation index of most papers is around in the single digits and I suspect that a the actual readership for most articles might not be that much larger than that. Yet even the most mundane of the posts on my blog tends to be read by hundreds of readers.  My blog itself, as of this week has reached the half million hits mark!

2. Because I do not believe in the traditional peer review process.  a. The process typically takes a year or more for a paper to go from written to published. b. The "peer" reviewers are left entirely up to the discretion of the editor and therefore creates massive variability in publication quality. c. Most traditional journals have few mechanisms for researchers to point out research failures which were missed by the reviewers. d. Journals are seen as canonical and often unassailable even though publications may very well be completely wrong.

3. Because I believe in the public review process in the form of comments.  On my blog I publish all comments which are written with regards to posts and only have an administrative review to limit robots from spamming the comment section.

Yet despite or because of these reasons posts go out which are just plain bad.  I know I am not the only one who sometimes produces junk.  This creates a situation in which blog posts are not respected as legitimate research and would likely never appear on as academic citation.  They could perhaps be compared with the prestige of an independent reporter.

There are things I love about this.  First off, I can stimulate discussion by writing my thoughts about different subjects.  Often, I know I am not an expert in the field I am writing on yet my hope is that by writing something sufficiently clear, an expert will come in and correct whatever I got wrong through comments.  Thus even my bad blog posts such as the one on Quandl will become an asset to readers even if it is through showing people what not to do.

Yet, I am unconvinced, as I would assume many readers are, that comments are sufficient to correct for such botched posts.

Thus the question arises, "How do we create an open exchange discourse that even though potentially starting of low quality might find its pay to produce true gems?"

Enter: Recursive Blogging
This is a blog design concept inspired by Stack Exchange in which users pose questions and other users answer those questions and there is a reward system giving points as well as, critically, the ability to edit questions and answers.  As far as I am concerned the questions and answers on SE are of some of the highest quality that can be found anywhere.

In my Recursive Blogging web concept I suggest that like SE blog posts could be editable, thumb up, or down, with comments and the ability to provide detailed responses to posts by a larger community in which the comments and responses might be just as valuable as the original blog post.

This ability to revise and update is the amazing sticking power of SE.  The reason why SE questions will continue to be referenced for years into the future.  In contrast most blogs and blogs posts in their current form will fade and become less relevant as their posts fade in relevance over time.  Recursive Blogging offers an alternative schema.  I think like SE Recursive Blogging should work on reputation and user profiles and in order to accomplish this it would need a framework that either supported multiple blogs or was shared across blogs.

This is my answer to the paradox.  Use recursive blogging as a mechanism for generating high quality post which have long term value.  I do not have the skills or time right now to design such a system, but I think there is a good chance that it might take off if someone took the initiative.
Flattr this

Thursday, May 1, 2014

How to Code Something ‘New’ in R

Programming New Things
I currently have been programming in R for more than half a decade and can now fondly look back on the days when I went on a spring break with Venables' little blue book (now out of print).  Back then I was entertained and wowed by R’s object oriented environment and its intuitive language, the way it seemed you could do so much with short precise commands.  I remember being particularly impressed with how R could seamlessly manipulate matrices or vectors and especially how easy it was to write new functions.

But all was not rainbows and sunshine.  I was less impressed by how challenging it was to browse data (for example head(mydata)), estimate common statistical models due to overly complex commands (for example summary(glm(y ~ x1 + x2 + x3 - 1, family = binomial(link = "probit"), data = mydata)), modify elements within datasets.  However, over time I grew more comfortable with R and Google seems to have also learned that R is a word.

In turn, on my blog, over time I began publishing more extensively in R and now have over 100 posts in the language.  Each of these posts is unique and the majority required that I learned new techniques in order to accomplish the tasks I had set out for myself.  As a result, I have done an extensive and prolonged study of how to teach oneself new things in R.  In this post I will describe my process for learning new R techniques.

0. Think through the task
Imagine in your mind each step you want your code to accomplish.  If you can, write out a diagram or step by step process that you think will be sufficient to accomplish the task on hand.  Having a well thought out plan could save you many an hour of frustration.  Work from the diagram.  If you find that you are getting frustrated or sidetracked during your coding go back to your diagram and investigate alternative paths to accomplish your goal.

1. Consult an R guru
If you are so fortunate as to have an R guru already in your life, do not neglect to use this incredible resource.  It is not an act of humility nor does it demonstrate your ignorance to ask someone else for help with a problem in R as R is capable of being used for so many things it seems nearly impossible to me for anybody to have a monopoly on knowledge of R.  That said, do not use up your good graces with your R guru too quickly.  R gurus tend to be extremely well paid and therefore have a high opportunity cost of their time.


2. Figure out what your question is
Often times we have a question but have difficulty finding the right words to express it.  Usually the simplest questions are those most difficult to ask.  If you have an R guru available, he or she can probably tell you what your question is.  However, absent of such a guru, do your best to figure out what it is you want to ask.  Search online for words which are similar to how you would describe what you wish to accomplish.  If you want to join two datasets together by a common variable look for “join”… no, “group”… no, “merge”… Bingo!  In this amazing age of information figuring how to express your question is 90% of solving it.

3. Search Google
Yup, almost all of my questions are directly answered by sticking them into a google search.  If you are asking a question, you probably are not the first one.

4. Search R’s Interactive Documentation
If you know a command in R which is close to the command that you want, then it is often fruitful to search for that command with help(command) or ?command then scroll to the bottom and search through the list of related commands.


5. Search and Possibly Post a Question on StackOverflow.com
The only reason I did not list Searching Stackoverflow earlier was because if you ask the question sufficiently clearly searching Google will probably bring you to a stack overflow page which answers your question.  However, if your other attempts to find a solution have failed you should strongly consider posting your question on Stackoverflow. 

Stackoverflow is a magical place in which the community of R users is so responsive that the average time I have experiences between posting a question and getting an answer is less than 10 minutes.  Not only are StackOverflow users extremely responsive they are also generally very nice and helpful.  I believe this is because of the reputation point system on the site rewards users for providing answers to your questions.

That said, take care that your question is not already answered on the site when at all possible.    Also, be sure to be as clear as possible when asking questions.  Giving clear examples of what you would like to accomplish is always encouraged.  Please read over the Asking Guidelines before posting.

There is an additional reason to be an active member on Stackoverflow.  Though I have no data to back up this assertion, I believe having a good reputation on the site is a good indicator of programming ability and would make a good résumé item (though I do not have a particularly fantastic reputation as I have only recently become active).

6. Contact Package Specific Help
There are several packages in R such as shiny or ggplot2 in which there is an active community of package specific users groups.  These users can be extremely helpful and timely at responding to questions.

7. Submit to an R Email List
This would definitely rank on the lowest option as I have had mixed success with the R email lists.  Some questions which seem worthy are readily answered while others seem to be ruthlessly ridiculed.  Unfortunately I cannot tell the difference ex-ante how any question will be received as it is entirely within the discretion of whoever seems inclined to respond to the request.  In addition, become a subscribing member seems to result in my mail box quickly being overwhelmed with emails from the list.  That being said, some very generous folks on the R email list have been so good as to answer some very unusual questions I have posed in the past.
Flattr this