Friday, August 30, 2013

Making regex examples work for you!

One of the most frequently used string recognition algorithms out there is regex and R implements regex.  However, users can often be frustrated with how despite taking examples verbatim from many sources such as stackoverflow they do not seem to work.  From my own experience, I have found that the largest issue is really about what characters need to be escaped from R.

For example:

Listing all files whose names match a simple pattern.

Looking at "/^.*icon.*\.png$/i" from

I was able to get "^.*icon.*.png$" to work in R though I lost the case insensitivity.  I think including the "^." ensures that only files in the current directory, not subdirectory are matched but I am not sure.

So, the following code will return a list of file names from the folder Clipart which match the pattern [anything]icon.png

list.files("C:/Clipart/", pattern="^.*icon.*.png$")
[1] "manicon.png"     "handicon.png"     "bookicon.png"

Looking at the original entry we can see that what was causing us problems was the attempt to escape the "^" which does not need to be escaped in R.

Before looking at another example lets modify the previous command slightly to show how we can make it match differently.

list.files("C:/Clipart/", pattern="^.*icon*.*.png$")
[1] "manicon.png"     "handicon.png"     "bookicon.png"    "iconnew.png"    

There are a lot of resources available for regex since it is really its own text matching language supported by many different programming languages.  A good introductory guide can be found:



  1. For insensitivity to case either use the flag in list.files (ie, use ""), or include the (?i) flag inside the regular expression. The ^. is just saying "match the start of the name" of the file. It's the lack of "recursive=TRUE" that's causing the restriction to the given directory.

  2. The "^" means that the regex must match starting at the beginning of the string you are testing. In the original example, "\.png$" means that you must explicitly match ".png" at the end of the string you are matching -- the backslash indicate the period is to be interpreted as a period, and not as a placeholder for any single character.

    In point of fact, the "^.*" bit is redundant and unnecessary. It just says that any characters (or none) must appear before the first real pattern -- "icon" -- is matched. You get the same results by leaving the "^.*" off. You should get the same results with

    pattern = "icon.*.png$"

    although you really want

    pattern = "icon.*\\.png$"

    here the double backslash period bit means to match a period explicitly -- it's not entirely obvious why you need to use a double backslash -- one is the usual for most regex implementations, as is the case with your initial example. Regardless, that way you get only matches that end with ".png". Without, you'd get matches to (e.g.) "iconXpng" which you wouldn't want -- not that this is likely to cause you any problems in this context.