Monday, May 5, 2014

7 R Quirks That Will Drive You Nutty

7 R Quirks That Will Drive You Nutty Every language has its idiosyncrasies. Some “designer”“ type languages have less due to extreme thoughtfulness of language engineers. I suspect Julia for example has many less quirks. However, despite its quirkiness R has become an amazingly flexible resource for a diverse range of tasks with thousands of packages and over 100,000 available commands (Rdocumentation.org) in subject matter as diverse as Pharmacokinetics, to Medical Imagining, to Psychometrics making it a quirky but effective standard in many research fields.

1. Vectors do not have defined "dimensions”

Strangely the dim “dimension” function does not work on vectors though it does work on higher dimensional matrices and arrays.

dim(1:10)
## NULL

If you want to figure out how long a vector is you need to use the length function.

length(1:10)
## [1] 10

Which would be fine except that the length function does work for matrices and will give you results based on counting the number of elements rather than on its “length” which is no longer well defined.

length(matrix(1, nrow = 10, ncol = 10))
## [1] 100

So length with vectors and dim with matrices, easy right? . Wrong. Which brings me to my next quirk.

2. Class Dropping

Matrices that are reduced to a single row become vectors and some matrix functions no longer work.

mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
mymatrix
##         [,1]    [,2]    [,3]    [,4]
## [1,]  1.3941 -0.7149 -1.7237 -1.6695
## [2,]  0.6882  1.4039 -2.2238 -0.3019
## [3,] -0.2032  1.3995 -0.3562 -0.3349
mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
dim(mymatrix[1:2, ])
## [1] 2 4
dim(mymatrix[1, ])
## NULL

Fortunately there is a kind of fix for this. Specifying drop=F when specifying a subset of a matrix will preserve its class as a matrix.

dim(mymatrix[1, , drop = F])
## [1] 1 4

Though, you do not need to worry about your matrix becoming a vector if you go to zero columns or rows.

dim(mymatrix[-1:-3, ])
## [1] 0 4

3. Negative subscripts

These are nothing to be bothered by since they are entirely optional. However, if you run into them they can certainly throw you for a loop. Negative subscripts act as though you are subsetting a matrix or vector by removing the specified columns or rows.

For example:

myvect <- -4:5
myvect
##  [1] -4 -3 -2 -1  0  1  2  3  4  5

They can be a little tricky to work with. Say we wanted to remove every other number in our 10 digit vector.

myvect[c(-1, -3, -5, -7, -9)]
## [1] -3 -1  1  3  5

Or more easily

myvect[-seq(1, 9, 2)]
## [1] -3 -1  1  3  5

4. NA Values Specifying Missing Data

These different special values can be quite challenging to work with. The biggest challenge for me is detecting and adjusting my code when one of these values pops up.

NA indicates a value that is “Not Available” or missing. Having an indicator for missing is standard. Stata uses a “.” while other programs use a -9999 or other unlikely number. The problem with NAs are not that they exist since almost all data sets are going to have some missing values at some point, but the frequency in which they interrupt normal function behavior. For instance if you have a long vector as 1:100, if one of the values is NA then most of the functions you attempt on it will fail.

a <- c(1:20, NA)
sum(a)
## [1] NA
max(a)
## [1] NA
min(a)
## [1] NA
cor(a, a)
## [1] NA

Once again this is not a problem necessarily, but it is annoying and can easily create unexpected problems. There are several solutions to the problem. One may be to remove observations from your data which have any values which are missing. This can create its own issues since NAs do not conform to logical operators (in contrast with Stata, SPSS, and all other statistical languages I know of). Thus

a2 <- a[a != NA]
a2
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Fails while

a3 <- a[!is.na(a)]
a3
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

does the job.

Alternatively, many commonly used functions have special arguments expressly devoted to how to handle missing data.

sum(a, na.rm = T)
## [1] 210
max(a, na.rm = T)
## [1] 20
min(a, na.rm = T)
## [1] 1
cor(a, a, use = "pairwise.complete.obs")
## [1] 1

5. “Empty values”: NULL, integer(0), numeric(0), logical(0)

These four values are several of R's empty value indicators. I don't believe they exhaust the list and I am sure each of them has its own unique properties which I cannot say I fully understand. The challenge usually lies in how to detect when they occur. As with NAs they resist logical arguments

a == NULL
## logical(0)
logical(0)
## logical(0)
# Even:
(a == NULL) == logical(0)
## logical(0)

What we are forced to do instead is detect the vector length of these values.

length(NULL)
## [1] 0
length(integer(0))
## [1] 0
length(numeric(0))
## [1] 0
length(logical(0))
## [1] 0

This trick works for an empty matrix as well.

length(mymatrix[!1:3, !1:4])
## [1] 0

Though dim is not a bad choice.

dim(mymatrix[!1:3, !1:4])
## [1] 0 0

6. Attach Does not Do What You Want it to Do

Attach and detach are functions which ostensibly promise to fulfill the desires of a user to be able to rapidly reference, use, and make changes to an active data set in a similar fashion as Stata or SPSS.

However, this is a mistake. IT IS NOT the solution to making R work more like Stata. If a data frame is “attached” in R it will make its columns accessible through to common interface.

mydata <- data.frame(a1 = 1:30, b1 = 31:60)
attach(mydata)
b1
##  [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## [24] 54 55 56 57 58 59 60

However, if you want to create a new variable in the data set say

c1 <- a1 + b1

Then you are going to have a problem because sure you just created c1, but c1 is not part of mydata.

names(mydata)
## [1] "a1" "b1"

You may add c to mydata

mydata$c1 <- c1
names(mydata)
## [1] "a1" "b1" "c1"

But the “c1” in my data is a different object than the c1 in the working memory.

mydata$c1 <- mydata$c1 * -1
head(mydata)
##   a1 b1  c1
## 1  1 31 -32
## 2  2 32 -34
## 3  3 33 -36
## 4  4 34 -38
## 5  5 35 -40
## 6  6 36 -42
head(c1)
## [1] 32 34 36 38 40 42

Which is not necessarily a problem since we can drop our working memory c1.

rm(c1)

But c1 is still not available even though mydata is still attached.

try(c1)

Which means we have “attached” our mydata object but that was back when mydata did not include c1. We need to reattach it to update the values. But first we need to detach our old mydata:

detach(mydata)
attach(mydata)
head(c1)
## [1] -32 -34 -36 -38 -40 -42

Thus we realize that attach is not a workable solution for handling data. Instead we are forced to use R's elaborate sub-scripting to manipulate our data such as

mydata$c1 <- mydata$a1 + mydata$b1

This is not unworkable though it can be annoying. There are shortcuts in R for helping to specify a dataframe or list to work from but they end up facing similar challenges to that of attach and ultimately create code which is unnecessarily complex.

7. R Functional Help Is Grouped

For instance ?gsub will return a list of seven related functions grep, grepl, sub, regexpr, gregexpr, and regexec as well as gsub which all share some commonalities with gsub, yet each do something somewhat different. I can understand why the documentation could have been originally organized this way since I imagine it was rather terse initially (as some help files persist in being). As help files got fleshed out, it just never happened that functions were broken out of their clusters.

It is worth noting that I have never seen this kind of function clumping in any of the help files in packages in R. For example library(ggplot2); ?ggplot ;?geom_point all return help files specific to a single function apiece.

Flattr this

9 comments:

  1. What gives me a headache from time to time is the inconsistent behaviour of some commands, like e.g. sample:

    When you have a vector of varying length you get a completely different behaviour when the vector is of length one:
    sample(9:10,10,replace=TRUE) # only values 9 and 10 are taken
    sample(10,10,replace=TRUE) # values from 1 to 10 are taken

    For a remedy see e.g. here: http://stackoverflow.com/questions/13990125/sampling-in-r-from-vector-of-varying-length

    ReplyDelete
  2. I think (but am not 100%) the last one has to do with how the functions are organized with their documentation. It probably wouldn't be that hard for them to change. Just split the functions to separate files with each having their own documentation.

    ReplyDelete
    Replies
    1. this is just the distinction between methods and functions. methods are grouped in the documentation, functions are not.

      Delete
  3. Hi Francis,

    I enjoyed reading your list of pet peeves. Here's mine:
    http://r4stats.com/2012/06/13/why-r-is-hard-to-learn/
    and I've got quite a few more on a list that I need to add to that.

    Cheers,
    Bob

    ReplyDelete
  4. I have never had a problem with any of these but then I again I dont come from a stata background.

    ReplyDelete
  5. "sum(a, na.rm = T)"

    using T instead of TRUE (or F instead of FALSE) is something that potentially may drive someone insane too, since R happily allows you to reassign the values of T and F.

    Like:
    T <- 0
    ....
    mean(c(1,2,3,4,5,NA),na.rm=T)

    And while it would be caught easily here, it may not in other instances.

    ReplyDelete
  6. I actually have noticed function clumping but only in one package: load dplyr and then look up documentation on one of the main functions—filter, summarise, mutate, arrange, and select—and you will notice some function clumping!

    ReplyDelete
  7. There are a lot more quirks in 'The R Inferno' (and a bit of explanation of some of these):
    http://www.burns-stat.com/documents/books/the-r-inferno/

    ReplyDelete
  8. Your idea of attach() functioning as "make anything not indexed into or out of a dataset affect this set" would have mostly unwanted side effects requiring the equivalent of wrapping everything you don't want to affect the "blessed" dataset to be wrapped in the equivalent of a "not within{}" statement.

    Given there are often dozens of loaded data, it's safer and easier to be explicit.

    When modifying a dataset, within() is your friend. e.g.:

    x = data.frame(A=1:5)
    x = within(x, {
    B = A+3
    })

    x
    # A B
    # 1 1 4
    # 2 2 5
    # 3 3 6
    # 4 4 7
    # 5 5 8

    ReplyDelete