Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say > form <- "C5H11BrO" I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA, Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry
Parsing a Simple Chemical Formula
16 messages · jim holtman, David A. Johnston, Bryan Hanson +4 more
try this:
f.extract <- function(formula)
+ {
+ # pattern to match the initial chemical
+ # assumes chemical starts with an upper case and optional lower
case followed
+ # by zero or more digits.
+ first <- "^([[:upper:]][[:lower:]]?)([0-9]*).*"
+ # inverse of above to remove the initial chemical
+ last <- "^[[:upper:]][[:lower:]]?[0-9]*(.*)"
+ result <- list()
+ extract <- formula
+ # repeat as long as there is data
+ while ((start <- nchar(extract)) > 0){
+ chem <- sub(first, '\\1 \\2', extract)
+ extract <- sub(last, '\\1', extract)
+ # if the number of characters is the same, then there was an error
+ if (nchar(extract) == start){
+ warning("Invalid formula:", formula)
+ return(NULL)
+ }
+ # append to the list
+ result[[length(result) + 1L]] <- strsplit(chem, ' ')[[1]]
+ }
+ result
+ }
f.extract("C5H11BrO")
[[1]] [1] "C" "5" [[2]] [1] "H" "11" [[3]] [1] "Br" [[4]] [1] "O"
f.extract("H2O")
[[1]] [1] "H" "2" [[2]] [1] "O"
f.extract("CCC")
[[1]] [1] "C" [[2]] [1] "C" [[3]] [1] "C"
f.extract("Crr") # bad
NULL
Warning message:
In f.extract("Crr") : Invalid formula:Crr
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). ?I want to use a chemical formula as a function argument. ?The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. ?My example will have only a limited number of elements, few enough that one can search directly for each element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). ?Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. ?As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! ?TIA, ?Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
There might be something simpler, but this is what I came up with:
form = "C5H11BrO"
ups = c(gregexpr("[[:upper:]]", form)[[1]], nchar(form) + 1)
seperated = sapply(1:(length(ups)-1), function(x) substr(form, ups[x],
ups[x+1] - 1))
elements = gsub("[[:digit:]]", "", seperated)
nums = gsub("[[:alpha:]]", "", seperated)
ans = data.frame(element = as.character(elements),
num = as.numeric(ifelse(nums == "", 1, nums)), stringsAsFactors = FALSE)
View this message in context: http://r.789695.n4.nabble.com/Parsing-a-Simple-Chemical-Formula-tp3164562p3164581.html Sent from the R help mailing list archive at Nabble.com.
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). ?I want to use a chemical formula as a function argument. ?The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. ?My example will have only a limited number of elements, few enough that one can search directly for each element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). ?Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. ?As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! ?TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Dec 26, 2010, at 6:29 PM, Bryan Hanson wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
Well here's how I see it:
The "form" can be split with a regular expression:
Capital letter followed by zero or one lower, followeed by a various
number of digits
greg <- gregexpr("[A-Z]{1}[a-z]?[0-9]*", form)
Append a number equal to one moe lan the ength for reasins that will
become clear
ugreg <- c(unlist(greg), nchar(form)+1)
Then use substring function to serially pick from a split point to one
minus the next split point (or in that case of the last element one
minus the length of the string:
> sapply(1:(length(ugreg)-1), function(z) substr(form, ugreg[z],
ugreg[z+1]-1) )
[1] "C5" "H11" "Br" "O"
Then you can split these "triples" (cap,lower,n) and if n is absent
assume 1.
> sub("(\\d*)$", "", sapply(1:(length(ugreg)-1), # blank out the
digits
function(z) substr(form, ugreg[z], ugreg[z+1]-1) ) )
[1] "C" "H" "Br" "O"
sub("^$", "1", sub("([A-Za-z]*)", "", # subst "1" for empty strings
sapply(1:(length(ugreg)-1),
function(z) substr(form, ugreg[z], ugreg[z
+1]-1) ) ) )
[1] "5" "11" "1" "1"
If you limited the number of elements searched for, it might improve
the error trapping, I suppose.
David. > > I'd like to get the count of each element, so in this case I need to > extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate > the molecular weight by mulitplying). Sounds pretty simple, but my > experiments with grep and strsplit don't immediately clue me into an > obvious solution. As I said, I don't need a general solution to the > problem of calculating molecular weight from an arbitrary formula, > that seems quite challenging, just a way to convert "form" into a > list or data frame which I can then do the math on. > > Here's hoping this is a simple issue for more experienced R users! > TIA, Bryan > *********** > Bryan Hanson > Professor of Chemistry & Biochemistry > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan
On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Have you considered the 'CHNOSZ' package?
> makeup("C5H11BrO" )
count
C 5
H 11
Br 1
O 1
I found this using the 'sos' package as follows:
library(sos)
cf <- ???'chemical formula'
found 21 matches; retrieving 2 pages
cf
The print method for "cf" opened the results in a web browser,
which showed that the "CHNOSZ" package had 14 of these 11 matches, and
the other 7 were in 7 different packages. Moreover, the "CHNOSZ"
package is devoted to "Chemical Thermodynamics and Activity Diagrams"
and provides many more capabilities that might interest you.
Hope this helps.
Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I
have not looked at these vignettes, but most vignettes provide excellent
introductions (though rarely with complete coverage) of important
capabilities of the package. (The 'sos' package includes a vignette,
which exposes more capabilities than the example below.)
######################
Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count
C 5
H 11
Br 1
O 1
I found this using the 'sos' package as follows:
library(sos)
cf <- ???'chemical formula'
found 21 matches; retrieving 2 pages
cf
The print method for "cf" opened the results in a web browser,
which showed that the "CHNOSZ" package had 14 of these 11 matches, and
the other 7 were in 7 different packages. Moreover, the "CHNOSZ"
package is devoted to "Chemical Thermodynamics and Activity Diagrams"
and provides many more capabilities that might interest you.
Hope this helps.
Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567
Thanks Spencer, I'll definitely have a look at this package and it's vignettes. I believe I have looked at it before, but didn't catch it on this particular search. Bryan
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I
have not looked at these vignettes, but most vignettes provide
excellent introductions (though rarely with complete coverage) of
important capabilities of the package. (The 'sos' package includes
a vignette, which exposes more capabilities than the example below.)
######################
Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count
C 5
H 11
Br 1
O 1
I found this using the 'sos' package as follows:
library(sos)
cf <- ???'chemical formula'
found 21 matches; retrieving 2 pages
cf
The print method for "cf" opened the results in a web browser,
which showed that the "CHNOSZ" package had 14 of these 11 matches,
and the other 7 were in 7 different packages. Moreover, the
"CHNOSZ" package is devoted to "Chemical Thermodynamics and Activity
Diagrams" and provides many more capabilities that might interest you.
Hope this helps.
Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567
I think the OP had a very limited need but there is something more sophisticated that may be of larger insterest called "SMILES" which attempts to capture some structural information about a molecule in a text sting. Reducing pictures to tractable text is an important step in many analysis efforts and i was curious what others may be able to say about R support for things like this. A quick google search turned up this, http://cran.r-project.org/web/packages/rpubchem/rpubchem.pdf but I wasn't sure if there are more packages for manipulating different ball and stick collections( the atom and bond descriptions could just as easily represent any other collection of nodes and connections). You can get some idea what this does by typing your favorite chemical name here, http://pubchem.ncbi.nlm.nih.gov/ and the entries give something called "Canonical SMILES structures" For example, http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=8030&loc=ec_rcs UPAC Name: thiophene Canonical SMILES: C1=CSC=C1 InChI: InChI=1S/C4H4S/c1-2-4-5-3-1/h1-4H InChIKey: YTPLMLYBLZKORZ-UHFFFAOYSA-N [Click for Info]
From: hanson at depauw.edu To: ggrothendieck at gmail.com Date: Sun, 26 Dec 2010 20:01:45 -0500 CC: r-help at stat.math.ethz.ch Subject: Re: [R] Parsing a Simple Chemical Formula Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
Thanks Spencer, I'll definitely have a look at this package and it's vignettes. I believe I have looked at it before, but didn't catch it on this particular search. Bryan
Using the thermo list that the makeup function accesses to get its
valid atomic symbols one can arrive at the the answer you posited
would be too difficult in you first posting, the atomic weight from
the formulae:
> str(thermo$element)
'data.frame': 130 obs. of 6 variables:
$ element: chr "Z" "O" "H" "He" ...
$ state : chr "aq" "gas" "gas" "gas" ...
$ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
$ mass : num 0 16 1.01 4 20.18 ...
$ s : num -15.6 49 31.2 30.2 35 ...
$ n : int 1 2 2 1 1 1 1 1 2 2 ...
patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo
$element[ grep(x, thermo$element[[1]])[1], "mass"])} )
sum(makuform$amass *makuform$count)
# [1] 167.0457
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes.
I have not looked at these vignettes, but most vignettes provide
excellent introductions (though rarely with complete coverage) of
important capabilities of the package. (The 'sos' package includes
a vignette, which exposes more capabilities than the example below.)
######################
Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count
C 5
H 11
Br 1
O 1
I found this using the 'sos' package as follows:
library(sos)
cf <- ???'chemical formula'
found 21 matches; retrieving 2 pages
cf
The print method for "cf" opened the results in a web browser,
which showed that the "CHNOSZ" package had 14 of these 11 matches,
and the other 7 were in 7 different packages. Moreover, the
"CHNOSZ" package is devoted to "Chemical Thermodynamics and
Activity Diagrams" and provides many more capabilities that might
interest you.
Hope this helps.
Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
Hi David & others... I did find the function you recommended, plus, it's even easier (but a little hidden in the doc): >element(form, "mass"). But, this uses the atomic masses from the periodic table, which are weighted averages of the isotopes of each element. What I'm doing actually involves mass spectrometry, so I need the isotope masses, which are integers (think 12C, 13C, 14C, but the periodic table says 12.011 reflecting the relative abundances). I used Gabor's solution and got my little function humming. Plus, I have several things to read through from the various recommendations. Thanks again, Bryan
On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
Thanks Spencer, I'll definitely have a look at this package and it's vignettes. I believe I have looked at it before, but didn't catch it on this particular search. Bryan
Using the thermo list that the makeup function accesses to get its valid atomic symbols one can arrive at the the answer you posited would be too difficult in you first posting, the atomic weight from the formulae:
str(thermo$element)
'data.frame': 130 obs. of 6 variables:
$ element: chr "Z" "O" "H" "He" ...
$ state : chr "aq" "gas" "gas" "gas" ...
$ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
$ mass : num 0 16 1.01 4 20.18 ...
$ s : num -15.6 49 31.2 30.2 35 ...
$ n : int 1 2 2 1 1 1 1 1 2 2 ...
patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo
$element[ grep(x, thermo$element[[1]])[1], "mass"])} )
sum(makuform$amass *makuform$count)
# [1] 167.0457
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I have not looked at these vignettes, but most vignettes provide excellent introductions (though rarely with complete coverage) of important capabilities of the package. (The 'sos' package includes a vignette, which exposes more capabilities than the example below.) ###################### Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count C 5 H 11 Br 1 O 1 I found this using the 'sos' package as follows: library(sos) cf <- ???'chemical formula' found 21 matches; retrieving 2 pages cf The print method for "cf" opened the results in a web browser, which showed that the "CHNOSZ" package had 14 of these 11 matches, and the other 7 were in 7 different packages. Moreover, the "CHNOSZ" package is devoted to "Chemical Thermodynamics and Activity Diagrams" and provides many more capabilities that might interest you. Hope this helps. Spencer On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
On Sun, Dec 26, 2010 at 7:26 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). ?I want to use a chemical formula as a function argument. ?The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. ?My example will have only a limited number of elements, few enough that one can search directly for each element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). ?Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. ?As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! ?TIA,
This can be done by strapply in gsubfn. ?It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. ?The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. ?The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. ?Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. ?Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, ? "([A-Z][a-z]*)(\\d*)", ? ~ c(..1, if (nchar(..2)) ..2 else 1), ? simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
?V1 V2 1 ?C ?5 2 ?H 11 3 Br ?1 4 ?O ?1
Here is a variation that is slightly simpler. The function in the third argument has been changed from c to paste so that it outputs strings like "C 5". With this form of output we can use read.table to read it directly creating a data frame.
strapply(form,
+ "([A-Z][a-z]*)(\\d*)", + ~ paste(..1, if (nchar(..2)) ..2 else 1), + simplify = ~ read.table(textConnection(..1))) V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Mike Marchywka's post mentioned a CRAN package, "rpubchem",
missed by my search for "chemical formula". A further search for
"chemical" and "chemistry" still missed it. "compound" found it.
Adding "compounds" and combining them with "union" produced a list of
564 links in 219 packages; 7 of the help pages were for "rpubchem".
The package with the most matches is "seacarb" (seawater carbonate
chemistry with R: 21 matches), followed by "CHNOSZ", previously
mentioned (19 matches). " rpubchem" is the 22nd package on this list (5
matches, with a max score of 32, less than the max score of 2 other
packages with 5 matches).
Spencer
On 12/26/2010 7:36 PM, Bryan Hanson wrote:
Hi David & others... I did find the function you recommended, plus, it's even easier (but a little hidden in the doc): >element(form, "mass"). But, this uses the atomic masses from the periodic table, which are weighted averages of the isotopes of each element. What I'm doing actually involves mass spectrometry, so I need the isotope masses, which are integers (think 12C, 13C, 14C, but the periodic table says 12.011 reflecting the relative abundances). I used Gabor's solution and got my little function humming. Plus, I have several things to read through from the various recommendations. Thanks again, Bryan On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
Thanks Spencer, I'll definitely have a look at this package and it's vignettes. I believe I have looked at it before, but didn't catch it on this particular search. Bryan
Using the thermo list that the makeup function accesses to get its valid atomic symbols one can arrive at the the answer you posited would be too difficult in you first posting, the atomic weight from the formulae:
str(thermo$element)
'data.frame': 130 obs. of 6 variables:
$ element: chr "Z" "O" "H" "He" ...
$ state : chr "aq" "gas" "gas" "gas" ...
$ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
$ mass : num 0 16 1.01 4 20.18 ...
$ s : num -15.6 49 31.2 30.2 35 ...
$ n : int 1 2 2 1 1 1 1 1 2 2 ...
patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo$element[
grep(x, thermo$element[[1]])[1], "mass"])} )
sum(makuform$amass *makuform$count)
# [1] 167.0457
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I have not looked at these vignettes, but most vignettes provide excellent introductions (though rarely with complete coverage) of important capabilities of the package. (The 'sos' package includes a vignette, which exposes more capabilities than the example below.) ###################### Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count C 5 H 11 Br 1 O 1 I found this using the 'sos' package as follows: library(sos) cf <- ???'chemical formula' found 21 matches; retrieving 2 pages cf The print method for "cf" opened the results in a web browser, which showed that the "CHNOSZ" package had 14 of these 11 matches, and the other 7 were in 7 different packages. Moreover, the "CHNOSZ" package is devoted to "Chemical Thermodynamics and Activity Diagrams" and provides many more capabilities that might interest you. Hope this helps. Spencer On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
----------------------------------------
Date: Sun, 26 Dec 2010 20:24:23 -0800 From: spencer.graves at structuremonitoring.com To: hanson at depauw.edu CC: r-help at stat.math.ethz.ch Subject: Re: [R] Parsing a Simple Chemical Formula Mike Marchywka's post mentioned a CRAN package, "rpubchem", missed by my search for "chemical formula". A further search for "chemical" and "chemistry" still missed it. "compound" found it. Adding "compounds" and combining them with "union" produced a list of 564 links in 219 packages; 7 of the help pages were for "rpubchem". The package with the most matches is "seacarb" (seawater carbonate chemistry with R: 21 matches), followed by "CHNOSZ", previously mentioned (19 matches). " rpubchem" is the 22nd package on this list (5 matches, with a max score of 32, less than the max score of 2 other packages with 5 matches).
This is why I always like to have ASCII text help that I can throw into a flat file and search myself with bash scripts outside of any program I'm trying to figure out. These problems of looking for things I don't know, like the guy who wanted to optimize his double loop but apparently didn't know the names of equivalent matrix operations, had a more elaborate but somewhat similar problem. Generally for chem or med tools, ncbi is a good place to hunt for vocabulary and facilities. I have various scripts for sorting vocabularies but still need various vocabularies ( for example, I'd like to go to IUPAC and download a list of systematic and trivial chemical names etc). In any case, this can be a big step in finding things in help or "real" literature. catch ya on the flip flop good buddy LOL.
Spencer On 12/26/2010 7:36 PM, Bryan Hanson wrote:
Hi David & others... I did find the function you recommended, plus, it's even easier (but a little hidden in the doc): >element(form, "mass"). But, this uses the atomic masses from the periodic table, which are weighted averages of the isotopes of each element. What I'm doing actually involves mass spectrometry, so I need the isotope masses, which are integers (think 12C, 13C, 14C, but the periodic table says 12.011 reflecting the relative abundances). I used Gabor's solution and got my little function humming. Plus, I have several things to read through from the various recommendations. Thanks again, Bryan On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
Thanks Spencer, I'll definitely have a look at this package and it's vignettes. I believe I have looked at it before, but didn't catch it on this particular search. Bryan
Using the thermo list that the makeup function accesses to get its valid atomic symbols one can arrive at the the answer you posited would be too difficult in you first posting, the atomic weight from the formulae:
str(thermo$element)
'data.frame': 130 obs. of 6 variables:
$ element: chr "Z" "O" "H" "He" ...
$ state : chr "aq" "gas" "gas" "gas" ...
$ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
$ mass : num 0 16 1.01 4 20.18 ...
$ s : num -15.6 49 31.2 30.2 35 ...
$ n : int 1 2 2 1 1 1 1 1 2 2 ...
patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo$element[
grep(x, thermo$element[[1]])[1], "mass"])} )
sum(makuform$amass *makuform$count)
# [1] 167.0457
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I have not looked at these vignettes, but most vignettes provide excellent introductions (though rarely with complete coverage) of important capabilities of the package. (The 'sos' package includes a vignette, which exposes more capabilities than the example below.) ###################### Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count C 5 H 11 Br 1 O 1 I found this using the 'sos' package as follows: library(sos) cf <- ???'chemical formula' found 21 matches; retrieving 2 pages cf The print method for "cf" opened the results in a web browser, which showed that the "CHNOSZ" package had 14 of these 11 matches, and the other 7 were in 7 different packages. Moreover, the "CHNOSZ" package is devoted to "Chemical Thermodynamics and Activity Diagrams" and provides many more capabilities that might interest you. Hope this helps. Spencer On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
From: hanson at depauw.edu To: dwinsemius at comcast.net Date: Sun, 26 Dec 2010 22:36:49 -0500 CC: r-help at stat.math.ethz.ch Subject: Re: [R] Parsing a Simple Chemical Formula Hi David & others... I did find the function you recommended, plus, it's even easier (but a little hidden in the doc): >element(form, "mass"). But, this uses the atomic masses from the periodic table, which are weighted averages of the isotopes of each element. What I'm doing actually involves mass spectrometry, so I need the isotope masses, which are integers (think
There are probably packages specialized to that field. For example, http://www.bioconductor.org/help/course-materials/2010/HeidelbergNovember2010/MSintro_LaurentGatto.pdf which presumably includes relevant physical data files in some format.
12C, 13C, 14C, but the periodic table says 12.011 reflecting the relative abundances). I used Gabor's solution and got my little function humming. Plus, I have several things to read through from the various recommendations. Thanks again, Bryan On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
Thanks Spencer, I'll definitely have a look at this package and it's vignettes. I believe I have looked at it before, but didn't catch it on this particular search. Bryan
Using the thermo list that the makeup function accesses to get its valid atomic symbols one can arrive at the the answer you posited would be too difficult in you first posting, the atomic weight from the formulae:
str(thermo$element)
'data.frame': 130 obs. of 6 variables:
$ element: chr "Z" "O" "H" "He" ...
$ state : chr "aq" "gas" "gas" "gas" ...
$ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
$ mass : num 0 16 1.01 4 20.18 ...
$ s : num -15.6 49 31.2 30.2 35 ...
$ n : int 1 2 2 1 1 1 1 1 2 2 ...
patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo
$element[ grep(x, thermo$element[[1]])[1], "mass"])} )
sum(makuform$amass *makuform$count)
# [1] 167.0457
On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I have not looked at these vignettes, but most vignettes provide excellent introductions (though rarely with complete coverage) of important capabilities of the package. (The 'sos' package includes a vignette, which exposes more capabilities than the example below.) ###################### Have you considered the 'CHNOSZ' package?
makeup("C5H11BrO" )
count C 5 H 11 Br 1 O 1 I found this using the 'sos' package as follows: library(sos) cf <- ???'chemical formula' found 21 matches; retrieving 2 pages cf The print method for "cf" opened the results in a web browser, which showed that the "CHNOSZ" package had 14 of these 11 matches, and the other 7 were in 7 different packages. Moreover, the "CHNOSZ" package is devoted to "Chemical Thermodynamics and Activity Diagrams" and provides many more capabilities that might interest you. Hope this helps. Spencer On 12/26/2010 5:01 PM, Bryan Hanson wrote:
Well let me just say thanks and WOW! Four great ideas, each worthy of study and I'll learn several things from each. Interestingly, these solutions seem more general and more compact than the solutions I found on the 'net using python and perl. More evidence for the power of R! A big thanks to each of you! Bryan On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson wrote:
Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say
form <- "C5H11BrO"
I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA,
This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:
DF
V1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.