Best way to test for numeric digits?
This seems unnecessarily complex. Or rather,
it pushes the complexity into an arcane notation
What we really want is something that says "here is a string,
here is a pattern, give me all the substrings that match."
What we're given is a function that tells us where those
substrings are.
# greg.matches(pattern, text)
# accepts a POSIX regular expression, pattern
# and a text to search in. Both arguments must be character strings
# (length(...) = 1) not longer vectors of strings.
# It returns a character vector of all the (non-overlapping)
# substrings of text as determined by gregexpr.
greg.matches <- function (pattern, text) {
if (length(pattern) > 1) stop("pattern has too many elements")
if (length(text) > 1) stop( "text has too many elements")
match.info <- gregexpr(pattern, text)
starts <- match.info[[1]]
stops <- attr(starts, "match.length") - 1 + starts
sapply(seq(along=starts), function (i) {
substr(text, starts[i], stops[i])
})
}
Given greg.matches, we can do the rest with very simple
and easily comprehended regular expressions.
# parse.chemical(formula)
# takes a simple chemical formula "<element><count>..." and
# returns a list with components
# $elements -- character -- the atom symbols
# $counts -- number -- the counts (missing counts taken as 1).
# BEWARE. This does not handle formulas like "CH(OH)3".
parse.chemical <- function (formula) {
parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula)
elements <- gsub("[0-9]+", "", parts)
counts <- as.numeric(gsub("[^0-9]+", "", parts))
counts <- ifelse(is.na(counts), 1, counts)
list(elements=elements, counts=counts)
}
parse.chemical("CCl3F")
$elements [1] "C" "Cl" "F" $counts [1] 1 3 1
parse.chemical("Li4Al4H16")
$elements [1] "Li" "Al" "H" $counts [1] 4 4 16
parse.chemical("CCl2CO2AlPO4SiO4Cl")
$elements [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" $counts [1] 1 2 1 2 1 1 4 1 4 1 On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help <r-help at r-project.org> wrote:
Dear List members,
What is the best way to test for numeric digits?
suppressWarnings(as.double(c("Li", "Na", "K", "2", "Rb", "Ca", "3")))
# [1] NA NA NA 2 NA NA 3
The above requires the use of the suppressWarnings function. Are there
any better ways?
I was working to extract chemical elements from a formula, something
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
# Perl is partly broken in R 4.3, but this works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
# stringi::stri_split(x, regex = regex);
s = strsplit(x, regex, perl = TRUE);
if(rm.digits) {
s = lapply(s, function(s) {
isNotD = is.na(suppressWarnings(as.numeric(s)));
s = s[isNotD];
});
}
return(s);
}
split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
Sincerely,
Leonard
Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.