Skip to content
Prev 359985 / 398503 Next

what is the faster way to search for a pattern in a few million entries data frame ?

Hi there,

I have a data frame DF with 40 millions strings and their frequency. I 
am searching for strings with a given pattern and I am trying to speed 
up this part of my code. I try many options but so far I am not 
satisfied. I tried:
- grepl and subset are equivalent in term of processing time
    grepl(paste0("^",pattern),df$Strings)
    subset(df, grepl(paste0("^",pattern), df$Strings))

- lookup(pattern,df) is not what I am looking for since it is doing an 
exact matching

- I tried to convert my data frame in a data table but it didn't improve 
things (probably read/write of this DT will be much faster)

- the only way I found was to remove 1/3 of the data frame with the 
strings of lowest frequency which speed up the process by a factor x10 !

- didn't try yet parRapply and with a machine with multicore I can get 
another factor.
    I did use parLapply for some other code but I had many issue with 
memory (crashing my Mac).
    I had to sub-divide the dataset to have it working correctly but I 
didn't manage to fully understand the issue.

I am sure their is some other smart way to do that. Any good 
article/blogs or suggestion that can give me some guidance ?

Thanks a lot
Cheers
Fabien