1. Vectors do not have defined "dimensions”
Strangely the dim “dimension” function does not work on vectors though it does work on higher dimensional matrices and arrays.
dim(1:10)
## NULL
If you want to figure out how long a vector is you need to use the length function.
length(1:10)
## [1] 10
Which would be fine except that the length function does work for matrices and will give you results based on counting the number of elements rather than on its “length” which is no longer well defined.
length(matrix(1, nrow = 10, ncol = 10))
## [1] 100
So length with vectors and dim with matrices, easy right? . Wrong. Which brings me to my next quirk.
2. Class Dropping
Matrices that are reduced to a single row become vectors and some matrix functions no longer work.
mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
mymatrix
## [,1] [,2] [,3] [,4]
## [1,] 1.3941 -0.7149 -1.7237 -1.6695
## [2,] 0.6882 1.4039 -2.2238 -0.3019
## [3,] -0.2032 1.3995 -0.3562 -0.3349
mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
dim(mymatrix[1:2, ])
## [1] 2 4
dim(mymatrix[1, ])
## NULL
Fortunately there is a kind of fix for this. Specifying drop=F when specifying a subset of a matrix will preserve its class as a matrix.
dim(mymatrix[1, , drop = F])
## [1] 1 4
Though, you do not need to worry about your matrix becoming a vector if you go to zero columns or rows.
dim(mymatrix[-1:-3, ])
## [1] 0 4
3. Negative subscripts
These are nothing to be bothered by since they are entirely optional. However, if you run into them they can certainly throw you for a loop. Negative subscripts act as though you are subsetting a matrix or vector by removing the specified columns or rows.
For example:
myvect <- -4:5
myvect
## [1] -4 -3 -2 -1 0 1 2 3 4 5
They can be a little tricky to work with. Say we wanted to remove every other number in our 10 digit vector.
myvect[c(-1, -3, -5, -7, -9)]
## [1] -3 -1 1 3 5
Or more easily
myvect[-seq(1, 9, 2)]
## [1] -3 -1 1 3 5
4. NA Values Specifying Missing Data
These different special values can be quite challenging to work with. The biggest challenge for me is detecting and adjusting my code when one of these values pops up.
NA indicates a value that is “Not Available” or missing. Having an indicator for missing is standard. Stata uses a “.” while other programs use a -9999 or other unlikely number. The problem with NAs are not that they exist since almost all data sets are going to have some missing values at some point, but the frequency in which they interrupt normal function behavior. For instance if you have a long vector as 1:100, if one of the values is NA then most of the functions you attempt on it will fail.
a <- c(1:20, NA)
sum(a)
## [1] NA
max(a)
## [1] NA
min(a)
## [1] NA
cor(a, a)
## [1] NA
Once again this is not a problem necessarily, but it is annoying and can easily create unexpected problems. There are several solutions to the problem. One may be to remove observations from your data which have any values which are missing. This can create its own issues since NAs do not conform to logical operators (in contrast with Stata, SPSS, and all other statistical languages I know of). Thus
a2 <- a[a != NA]
a2
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Fails while
a3 <- a[!is.na(a)]
a3
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
does the job.
Alternatively, many commonly used functions have special arguments expressly devoted to how to handle missing data.
sum(a, na.rm = T)
## [1] 210
max(a, na.rm = T)
## [1] 20
min(a, na.rm = T)
## [1] 1
cor(a, a, use = "pairwise.complete.obs")
## [1] 1
5. “Empty values”: NULL, integer(0), numeric(0), logical(0)
These four values are several of R's empty value indicators. I don't believe they exhaust the list and I am sure each of them has its own unique properties which I cannot say I fully understand. The challenge usually lies in how to detect when they occur. As with NAs they resist logical arguments
a == NULL
## logical(0)
logical(0)
## logical(0)
# Even:
(a == NULL) == logical(0)
## logical(0)
What we are forced to do instead is detect the vector length of these values.
length(NULL)
## [1] 0
length(integer(0))
## [1] 0
length(numeric(0))
## [1] 0
length(logical(0))
## [1] 0
This trick works for an empty matrix as well.
length(mymatrix[!1:3, !1:4])
## [1] 0
Though dim
is not a bad choice.
dim(mymatrix[!1:3, !1:4])
## [1] 0 0
6. Attach Does not Do What You Want it to Do
Attach and detach are functions which ostensibly promise to fulfill the desires of a user to be able to rapidly reference, use, and make changes to an active data set in a similar fashion as Stata or SPSS.
However, this is a mistake. IT IS NOT the solution to making R work more like Stata. If a data frame is “attached” in R it will make its columns accessible through to common interface.
mydata <- data.frame(a1 = 1:30, b1 = 31:60)
attach(mydata)
b1
## [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## [24] 54 55 56 57 58 59 60
However, if you want to create a new variable in the data set say
c1 <- a1 + b1
Then you are going to have a problem because sure you just created c1, but c1 is not part of mydata
.
names(mydata)
## [1] "a1" "b1"
You may add c to mydata
mydata$c1 <- c1
names(mydata)
## [1] "a1" "b1" "c1"
But the “c1” in my data is a different object than the c1 in the working memory.
mydata$c1 <- mydata$c1 * -1
head(mydata)
## a1 b1 c1
## 1 1 31 -32
## 2 2 32 -34
## 3 3 33 -36
## 4 4 34 -38
## 5 5 35 -40
## 6 6 36 -42
head(c1)
## [1] 32 34 36 38 40 42
Which is not necessarily a problem since we can drop our working memory c1.
rm(c1)
But c1
is still not available even though mydata
is still attached.
try(c1)
Which means we have “attached” our mydata object but that was back when mydata did not include c1. We need to reattach it to update the values. But first we need to detach our old mydata:
detach(mydata)
attach(mydata)
head(c1)
## [1] -32 -34 -36 -38 -40 -42
Thus we realize that attach is not a workable solution for handling data. Instead we are forced to use R's elaborate sub-scripting to manipulate our data such as
mydata$c1 <- mydata$a1 + mydata$b1
This is not unworkable though it can be annoying. There are shortcuts in R for helping to specify a dataframe or list to work from but they end up facing similar challenges to that of attach and ultimately create code which is unnecessarily complex.
7. R Functional Help Is Grouped
For instance ?gsub
will return a list of seven related functions grep
, grepl
, sub
, regexpr
, gregexpr
, and regexec
as well as gsub
which all share some commonalities with gsub
, yet each do something somewhat different. I can understand why the documentation could have been originally organized this way since I imagine it was rather terse initially (as some help files persist in being). As help files got fleshed out, it just never happened that functions were broken out of their clusters.
It is worth noting that I have never seen this kind of function clumping in any of the help files in packages in R. For example library(ggplot2); ?ggplot ;?geom_point
all return help files specific to a single function apiece.
What gives me a headache from time to time is the inconsistent behaviour of some commands, like e.g. sample:
ReplyDeleteWhen you have a vector of varying length you get a completely different behaviour when the vector is of length one:
sample(9:10,10,replace=TRUE) # only values 9 and 10 are taken
sample(10,10,replace=TRUE) # values from 1 to 10 are taken
For a remedy see e.g. here: http://stackoverflow.com/questions/13990125/sampling-in-r-from-vector-of-varying-length
I think (but am not 100%) the last one has to do with how the functions are organized with their documentation. It probably wouldn't be that hard for them to change. Just split the functions to separate files with each having their own documentation.
ReplyDeletethis is just the distinction between methods and functions. methods are grouped in the documentation, functions are not.
DeleteHi Francis,
ReplyDeleteI enjoyed reading your list of pet peeves. Here's mine:
http://r4stats.com/2012/06/13/why-r-is-hard-to-learn/
and I've got quite a few more on a list that I need to add to that.
Cheers,
Bob
I have never had a problem with any of these but then I again I dont come from a stata background.
ReplyDelete"sum(a, na.rm = T)"
ReplyDeleteusing T instead of TRUE (or F instead of FALSE) is something that potentially may drive someone insane too, since R happily allows you to reassign the values of T and F.
Like:
T <- 0
....
mean(c(1,2,3,4,5,NA),na.rm=T)
And while it would be caught easily here, it may not in other instances.
I actually have noticed function clumping but only in one package: load dplyr and then look up documentation on one of the main functions—filter, summarise, mutate, arrange, and select—and you will notice some function clumping!
ReplyDeleteThere are a lot more quirks in 'The R Inferno' (and a bit of explanation of some of these):
ReplyDeletehttp://www.burns-stat.com/documents/books/the-r-inferno/
Your idea of attach() functioning as "make anything not indexed into or out of a dataset affect this set" would have mostly unwanted side effects requiring the equivalent of wrapping everything you don't want to affect the "blessed" dataset to be wrapped in the equivalent of a "not within{}" statement.
ReplyDeleteGiven there are often dozens of loaded data, it's safer and easier to be explicit.
When modifying a dataset, within() is your friend. e.g.:
x = data.frame(A=1:5)
x = within(x, {
B = A+3
})
x
# A B
# 1 1 4
# 2 2 5
# 3 3 6
# 4 4 7
# 5 5 8