Skip to content
Prev 395292 / 398502 Next

Best way to test for numeric digits?

?s 17:24 de 18/10/2023, Leonard Mada escreveu:
Hello,

You and Avi are right, my function's performance is terrible. The 
following is much faster.

As for how to not have digits throw warnings, the lapply in the version 
of your function below solves it by setting grep argument invert = TRUE. 
This will get all strings where digits do not occur.



split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringr::str_replace_all(mol, regex, "#") |>
       strsplit("#|[[:digit:]]") |>
       lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

split.symbol.character = function(x, rm.digits = TRUE) {
   # Perl is partly broken in R 4.3, but this works:
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   s <- strsplit(x, regex, perl = TRUE)
   if(rm.digits) {
     s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
   }
   s
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"
split.symbol.character(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"

mol10000 <- rep(mol, 10000)

system.time(
   split_chem_elements(mol10000)
)
#>    user  system elapsed
#>    0.01    0.00    0.02
system.time(
   split.symbol.character(mol10000)
)
#>    user  system elapsed
#>    0.35    0.07    0.47



Hope this helps,

Rui Barradas