Skip to content
Prev 245811 / 398506 Next

Parsing a Simple Chemical Formula

On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
This can be done by strapply in gsubfn.  It matches the regular
expression to the target string passing the back references (the
parenthesized portions of the regular expression) through a specified
function as successive arguments.

Thus the first arg is form, your input string.  The second arg is the
regular expression which matches an upper case letter optionally
followed by lower case letters and all that is optionally followed by
digits.  The third arg is a function shown in a formula
representation. strapply passes the back references (i.e. the portions
within parentheses) to the function as the two arguments.  Finally
simplify is another function in formula notation which turns the
result into a matrix and then a data frame.  Finally we make the
second column of the data frame numeric.

library(gsubfn)

DF <- strapply(form,
   "([A-Z][a-z]*)(\\d*)",
   ~ c(..1, if (nchar(..2)) ..2 else 1),
   simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE))
DF[[2]] <- as.numeric(DF[[2]])

DF looks like this:
V1 V2
1  C  5
2  H 11
3 Br  1
4  O  1