Back to formatted view
Raw Message

Message-ID: <4bc2d866-15fb-4fb0-801f-f6ad4c445280@sapo.pt>
Date: 2023-10-18T18:54:32Z
From: Rui Barradas
Subject: Best way to test for numeric digits?
In-Reply-To: <8314deb9-1e71-4a7d-ad0d-ade1f7c7008a@syonic.eu>

?s 19:35 de 18/10/2023, Leonard Mada escreveu:
> Dear Rui,
> 
> On 10/18/2023 8:45 PM, Rui Barradas wrote:
>> split_chem_elements <- function(x, rm.digits = TRUE) {
>> ? regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>> ? if(rm.digits) {
>> ??? stringr::str_replace_all(mol, regex, "#") |>
>> ????? strsplit("#|[[:digit:]]") |>
>> ????? lapply(\(x) x[nchar(x) > 0L])
>> ? } else {
>> ??? strsplit(x, regex, perl = TRUE)
>> ? }
>> }
>>
>> split.symbol.character = function(x, rm.digits = TRUE) {
>> ? # Perl is partly broken in R 4.3, but this works:
>> ? regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>> ? s <- strsplit(x, regex, perl = TRUE)
>> ? if(rm.digits) {
>> ??? s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
>> ? }
>> ? s
>> }
> 
> You have a glitch (mol is hardcoded) in the code of the first function. 
> The times are similar, after correcting for that glitch.
> 
> Note:
> - grep("[[:digit:]]", ...) behaves almost twice as slow as grep("[0-9]", 
> ...)!
> - corrected results below;
> 
> Sincerely,
> 
> Leonard
> #######
> 
> split_chem_elements <- function(x, rm.digits = TRUE) {
>  ? regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>  ? if(rm.digits) {
>  ??? stringr::str_replace_all(x, regex, "#") |>
>  ????? strsplit("#|[[:digit:]]") |>
>  ????? lapply(\(x) x[nchar(x) > 0L])
>  ? } else {
>  ??? strsplit(x, regex, perl = TRUE)
>  ? }
> }
> 
> split.symbol.character = function(x, rm.digits = TRUE) {
>  ? # Perl is partly broken in R 4.3, but this works:
>  ? regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>  ? s <- strsplit(x, regex, perl = TRUE)
>  ? if(rm.digits) {
>  ??? s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
>  ? }
>  ? s
> }
> 
> mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
> mol10000 <- rep(mol, 10000)
> 
> system.time(
>  ? split_chem_elements(mol10000)
> )
> #?? user? system elapsed
> #?? 0.58??? 0.00??? 0.58
> 
> system.time(
>  ? split.symbol.character(mol10000)
> )
> #?? user? system elapsed
> #?? 0.67??? 0.00??? 0.67
> 
Hello,

You are right, sorry for the blunder :(.
In the code below I have replaced stringr::str_replace_all by the 
package stringi function stri_replace_all_regex and the improvement is 
significant.


split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringi::stri_replace_all_regex(x, "#", regex) |>
       strsplit("#|[0-9]") |>
       lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

# system.time(
#   split_chem_elements(mol10000)
# )
#  user  system elapsed
#  0.06    0.00    0.09
# system.time(
#   split.symbol.character(mol10000)
# )
#  user  system elapsed
#  0.25    0.00    0.28



Hope this helps,

Rui Barradas




-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus.
www.avg.com