Skip to content

Parsing a Simple Chemical Formula

16 messages · jim holtman, David A. Johnston, Bryan Hanson +4 more

#
Hello R Folks...

I've been looking around the 'net and I see many complex solutions in  
various languages to this question, but I have a pretty simple need  
(and I'm not much good at regex).  I want to use a chemical formula as  
a function argument.  The formula would be in "Hill order" which is to  
list C, then H, then all other elements in alphabetical order.  My  
example will have only a limited number of elements, few enough that  
one can search directly for each element.  So some examples would be  
C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or  
Br, there is no following number meaning a 1 is implied).

Let's say

 > form <- "C5H11BrO"

I'd like to get the count of each element, so in this case I need to  
extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the  
molecular weight by mulitplying).  Sounds pretty simple, but my  
experiments with grep and strsplit don't immediately clue me into an  
obvious solution.  As I said, I don't need a general solution to the  
problem of calculating molecular weight from an arbitrary formula,  
that seems quite challenging, just a way to convert "form" into a list  
or data frame which I can then do the math on.

Here's hoping this is a simple issue for more experienced R users!   
TIA,  Bryan
***********
Bryan Hanson
Professor of Chemistry & Biochemistry
#
try this:
+ {
+     # pattern to match the initial chemical
+     # assumes chemical starts with an upper case and optional lower
case followed
+     # by zero or more digits.
+     first <- "^([[:upper:]][[:lower:]]?)([0-9]*).*"
+     # inverse of above to remove the initial chemical
+     last <- "^[[:upper:]][[:lower:]]?[0-9]*(.*)"
+     result <- list()
+     extract <- formula
+     # repeat as long as there is data
+     while ((start <- nchar(extract)) > 0){
+         chem <- sub(first, '\\1 \\2', extract)
+         extract <- sub(last, '\\1', extract)
+         # if the number of characters is the same, then there was an error
+         if (nchar(extract) == start){
+             warning("Invalid formula:", formula)
+             return(NULL)
+         }
+         # append to the list
+         result[[length(result) + 1L]] <- strsplit(chem, ' ')[[1]]
+     }
+     result
+ }
[[1]]
[1] "C" "5"

[[2]]
[1] "H"  "11"

[[3]]
[1] "Br"

[[4]]
[1] "O"
[[1]]
[1] "H" "2"

[[2]]
[1] "O"
[[1]]
[1] "C"

[[2]]
[1] "C"

[[3]]
[1] "C"
NULL
Warning message:
In f.extract("Crr") : Invalid formula:Crr
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:

  
    
#
There might be something simpler, but this is what I came up with:

form = "C5H11BrO"
ups = c(gregexpr("[[:upper:]]", form)[[1]], nchar(form) + 1)
seperated = sapply(1:(length(ups)-1), function(x) substr(form, ups[x],
ups[x+1] - 1))
elements =  gsub("[[:digit:]]", "", seperated)
nums = gsub("[[:alpha:]]", "", seperated)
ans = data.frame(element = as.character(elements),
  num = as.numeric(ifelse(nums == "", 1, nums)), stringsAsFactors = FALSE)
#
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
This can be done by strapply in gsubfn.  It matches the regular
expression to the target string passing the back references (the
parenthesized portions of the regular expression) through a specified
function as successive arguments.

Thus the first arg is form, your input string.  The second arg is the
regular expression which matches an upper case letter optionally
followed by lower case letters and all that is optionally followed by
digits.  The third arg is a function shown in a formula
representation. strapply passes the back references (i.e. the portions
within parentheses) to the function as the two arguments.  Finally
simplify is another function in formula notation which turns the
result into a matrix and then a data frame.  Finally we make the
second column of the data frame numeric.

library(gsubfn)

DF <- strapply(form,
   "([A-Z][a-z]*)(\\d*)",
   ~ c(..1, if (nchar(..2)) ..2 else 1),
   simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE))
DF[[2]] <- as.numeric(DF[[2]])

DF looks like this:
V1 V2
1  C  5
2  H 11
3 Br  1
4  O  1
#
On Dec 26, 2010, at 6:29 PM, Bryan Hanson wrote:

            
Well here's how I see it:

The "form" can be split with a regular expression:
Capital letter followed by zero or one lower, followeed by a various  
number of digits

greg <- gregexpr("[A-Z]{1}[a-z]?[0-9]*", form)

Append a number equal to one moe lan the ength for reasins that will  
become clear

ugreg <- c(unlist(greg), nchar(form)+1)

Then use substring function to serially pick from a split point to one  
minus the next split point (or in that case of the last element one  
minus the length of the string:

 > sapply(1:(length(ugreg)-1), function(z) substr(form, ugreg[z],  
ugreg[z+1]-1) )
[1] "C5"  "H11" "Br"  "O"

Then you can split these "triples" (cap,lower,n) and if n is absent  
assume 1.

 > sub("(\\d*)$", "", sapply(1:(length(ugreg)-1),   # blank out the  
digits
                 function(z) substr(form, ugreg[z], ugreg[z+1]-1) ) )
[1] "C"  "H"  "Br" "O"

sub("^$", "1", sub("([A-Za-z]*)", "",    # subst "1" for empty strings
                     sapply(1:(length(ugreg)-1),
                           function(z) substr(form, ugreg[z], ugreg[z 
+1]-1) ) ) )
[1] "5"  "11" "1"  "1"

If you limited the number of elements searched for, it might improve  
the error trapping, I suppose.
#
Well let me just say thanks and WOW!  Four great ideas, each worthy of  
study and I'll learn several things from each.  Interestingly, these  
solutions seem more general and more compact than the solutions I  
found on the 'net using python and perl.  More evidence for the power  
of R!  A big thanks to each of you!  Bryan
On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:

            
#
Have you considered the 'CHNOSZ' package?


 > makeup("C5H11BrO" )
    count
C      5
H     11
Br     1
O      1


       I found this using the 'sos' package as follows:


library(sos)
cf <- ???'chemical formula'
found 21 matches;  retrieving 2 pages
cf


       The print method for "cf" opened the results in a web browser, 
which showed that the "CHNOSZ" package had 14 of these 11 matches, and 
the other 7 were in 7 different packages.  Moreover, the "CHNOSZ" 
package is devoted to "Chemical Thermodynamics and Activity Diagrams" 
and provides many more capabilities that might interest you.


       Hope this helps.
       Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:

  
    
#
p.s.  help(pac=CHNOSZ) reveals that this package has 3 vignettes.  I 
have not looked at these vignettes, but most vignettes provide excellent 
introductions (though rarely with complete coverage) of important 
capabilities of the package.  (The 'sos' package includes a vignette, 
which exposes more capabilities than the example below.)


######################
       Have you considered the 'CHNOSZ' package?
count
C      5
H     11
Br     1
O      1


       I found this using the 'sos' package as follows:


library(sos)
cf <- ???'chemical formula'
found 21 matches;  retrieving 2 pages
cf


       The print method for "cf" opened the results in a web browser, 
which showed that the "CHNOSZ" package had 14 of these 11 matches, and 
the other 7 were in 7 different packages.  Moreover, the "CHNOSZ" 
package is devoted to "Chemical Thermodynamics and Activity Diagrams" 
and provides many more capabilities that might interest you.


       Hope this helps.
       Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:

  
    
#
Thanks Spencer, I'll definitely have a look at this package and it's  
vignettes.  I believe I have looked at it before, but didn't catch it  
on this particular search.  Bryan
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:

            
#
I think the OP had a very limited need but there is something
more sophisticated that may be of larger insterest called "SMILES"
which attempts to capture some structural information about a molecule
in a text sting. Reducing pictures to tractable text is an important step
in many analysis efforts and i was curious what others may be able to say about
R support for things like this.

A quick google search turned up this, 

http://cran.r-project.org/web/packages/rpubchem/rpubchem.pdf

but I wasn't sure if there are more packages for manipulating
different ball and stick collections( the atom and bond descriptions
could just as easily represent any other collection of nodes
and connections).

You can get some idea what this does by typing your favorite chemical
name here,

http://pubchem.ncbi.nlm.nih.gov/

and the entries give something called "Canonical SMILES structures"
For example, 

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=8030&loc=ec_rcs


UPAC Name: thiophene
Canonical SMILES: C1=CSC=C1
InChI: InChI=1S/C4H4S/c1-2-4-5-3-1/h1-4H
InChIKey: YTPLMLYBLZKORZ-UHFFFAOYSA-N [Click for Info]
#
On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:

            
Using the thermo list that the makeup function accesses to get its  
valid atomic symbols one can arrive at the the answer you posited  
would be too difficult in you first posting, the atomic weight from  
the formulae:

 > str(thermo$element)
'data.frame':	130 obs. of  6 variables:
  $ element: chr  "Z" "O" "H" "He" ...
  $ state  : chr  "aq" "gas" "gas" "gas" ...
  $ source : chr  "CWM89" "CWM89" "CWM89" "CWM89" ...
  $ mass   : num  0 16 1.01 4 20.18 ...
  $ s      : num  -15.6 49 31.2 30.2 35 ...
  $ n      : int  1 2 2 1 1 1 1 1 2 2 ...

patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo 
$element[ grep(x, thermo$element[[1]])[1], "mass"])}  )
sum(makuform$amass *makuform$count)
# [1] 167.0457
David Winsemius, MD
West Hartford, CT
#
Hi David & others...

I did find the function you recommended, plus, it's even easier (but a  
little hidden in the doc): >element(form, "mass").  But, this uses the  
atomic masses from the periodic table, which are weighted averages of  
the isotopes of each element.  What I'm doing actually involves mass  
spectrometry, so I need the isotope masses, which are integers (think  
12C, 13C, 14C, but the periodic table says 12.011 reflecting the  
relative abundances).  I used Gabor's solution and got my little  
function humming.  Plus, I have several things to read through from  
the various recommendations.

Thanks again, Bryan
On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:

            
#
On Sun, Dec 26, 2010 at 7:26 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Here is a variation that is slightly simpler. The function in the
third argument has been changed from c to paste so that it outputs
strings like "C 5".  With this form of output we can use read.table to
read it directly creating a data frame.
+   "([A-Z][a-z]*)(\\d*)",
+   ~ paste(..1, if (nchar(..2)) ..2 else 1),
+   simplify = ~ read.table(textConnection(..1)))
  V1 V2
1  C  5
2  H 11
3 Br  1
4  O  1
#
Mike Marchywka's post mentioned a CRAN package, "rpubchem", 
missed by my search for "chemical formula".  A further search for 
"chemical" and "chemistry" still missed it.  "compound" found it.  
Adding "compounds" and combining them with "union" produced a list of 
564 links in 219 packages;  7 of the help pages were for "rpubchem".  
The package with the most matches is "seacarb" (seawater carbonate 
chemistry with R:  21 matches), followed by "CHNOSZ", previously 
mentioned (19 matches).  " rpubchem" is the 22nd package on this list (5 
matches, with a max score of 32, less than the max score of 2 other 
packages with 5 matches).


       Spencer
On 12/26/2010 7:36 PM, Bryan Hanson wrote:
#
----------------------------------------
This is why I always like to have ASCII text help that I can throw into
a flat file and search myself with bash scripts outside of any program
I'm trying to figure out. These problems of looking
for things I don't know, like the guy who wanted to optimize his double
loop but apparently didn't know the names of equivalent matrix operations, 
had a more elaborate but somewhat similar problem. 

Generally for chem or med tools, ncbi is a good place to hunt for vocabulary
and facilities. I have various scripts for sorting vocabularies but still
need various vocabularies ( for example, I'd like to go to IUPAC and download
a list of systematic and trivial chemical names etc). In any case, this
can be a big step in finding things in help or "real" literature. 
catch ya on the flip flop good buddy LOL.
#
There are probably packages specialized to that field. For example, 

http://www.bioconductor.org/help/course-materials/2010/HeidelbergNovember2010/MSintro_LaurentGatto.pdf

which presumably includes relevant physical data files in some format.