Skip to content

what is the faster way to search for a pattern in a few million entries data frame ?

8 messages · Duncan Murdoch, Fabien Tarrade, Bert Gunter +2 more

#
Hi there,

I have a data frame DF with 40 millions strings and their frequency. I 
am searching for strings with a given pattern and I am trying to speed 
up this part of my code. I try many options but so far I am not 
satisfied. I tried:
- grepl and subset are equivalent in term of processing time
    grepl(paste0("^",pattern),df$Strings)
    subset(df, grepl(paste0("^",pattern), df$Strings))

- lookup(pattern,df) is not what I am looking for since it is doing an 
exact matching

- I tried to convert my data frame in a data table but it didn't improve 
things (probably read/write of this DT will be much faster)

- the only way I found was to remove 1/3 of the data frame with the 
strings of lowest frequency which speed up the process by a factor x10 !

- didn't try yet parRapply and with a machine with multicore I can get 
another factor.
    I did use parLapply for some other code but I had many issue with 
memory (crashing my Mac).
    I had to sub-divide the dataset to have it working correctly but I 
didn't manage to fully understand the issue.

I am sure their is some other smart way to do that. Any good 
article/blogs or suggestion that can give me some guidance ?

Thanks a lot
Cheers
Fabien
#
On 10/04/2016 2:03 PM, Fabien Tarrade wrote:
Didn't you post the same question yesterday?  Perhaps nobody answered 
because your question is unanswerable.  You need to describe what the 
strings are like and what the patterns are like if you want advice on 
speeding things up.

Duncan Murdoch
#
Hi Duncan,
sorry, I got a email that my message was waiting for approval and when I 
look at the forum I didn't see my message and this is why  I sent it 
again and this time I did check that the format of my message was text 
only. Sorry for the noise.
my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words) 
and I am searching for the frequency in my DF of the strings starting 
with a sequence of few words.

I guess these days it is standard to use DF with millions of entries so 
I was wondering how people are doing that in the faster way.

Thanks
Cheers
Fabien
#
Fabien:

I was unable to make any sense of your latest response (maybe I'm just
dense). If others have similar difficulties, and you fail to get a
satisfactory response, I suggest that you read and follow the posting
guide's request for a **small, reproducible example* (perhaps the
first few dozen rows of your data frame) in which you show the code
you tried and your desired result.


Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Apr 10, 2016 at 12:27 PM, Fabien Tarrade
<fabien.tarrade at gmail.com> wrote:
#
Hi Fabien,
I was going to send this last night, but I thought it was too simple.
Runs in about one millisecond.

df<-data.frame(freq=runif(1000),
 strings=apply(matrix(sample(LETTERS,10000,TRUE),ncol=10),
 1,paste,collapse=""))
match.ind<-grep("DF",df$strings)
match.ind
 [1]   2  11  91 133 169 444 547 605 734 943

Jim


On Mon, Apr 11, 2016 at 5:27 AM, Fabien Tarrade
<fabien.tarrade at gmail.com> wrote:
#
Hi Jim,

I didn't know this one. I will have a look.

Thanks
Cheers
Fabien

  
    
#
On 04/10/2016 03:27 PM, Fabien Tarrade wrote:
I did this to generate and search 40 million unique strings

 > grams <- as.character(1:4e7)        ## a long time passes...
 > system.time(grep("^900001", grams)) ## similar times to grepl
    user  system elapsed
  10.384   0.168  10.543

Is that the basic task you're trying to accomplish? grep(l) goes quickly 
to C, so I don't think data.table or other will be markedly faster if 
you're looking for an arbitrary regular expression (use fixed=TRUE if 
looking for an exact match).

If you're looking for strings that start with a pattern, then in R-3.3.0 
there is

 > system.time(res0 <- startsWith(grams, "900001"))
    user  system elapsed
   0.658   0.012   0.669

which returns the same result as grepl

 > identical(res0, res1 <- grepl("^900001", grams))
[1] TRUE

One can also parallelize the already vectorized grepl function with 
parallel::pvec, with some opportunity for gain (compared to grepl) on 
non-Windows

 > system.time(res2 <- pvec(seq_along(grams), function(i) 
grepl("^900001", grams[i]), mc.cores=8))
    user  system elapsed
  24.996   1.709   3.974
 > identical(res0, res2)
[[1]] TRUE

I think anything else would require pre-processing of some kind, and 
then some more detail about what your data looks like is required.

Martin Morgan
This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
18 days later
#
Hi Martin and everybody,

sorry for the long delay. Thanks for all the suggestions. With my code 
and my training data I found similar numbers to the one below.

Thanks

Cheers

Fabien