speed issue: gsub on large data frame

11 messages · Jeff Newmiller, jim holtman, Brian Ripley +3 more

Original

1

11

Mon, Nov 4, 2013 2:57 PM #

Hi R?lers,

I?m running into speeding issues, performing a bunch of 

?gsub(patternvector, [token],dataframe$text_column)"

on a data frame containing >4millionentries.

(The ?patternvectors? contain up to 500 elements) 

Is there any better/faster way than performing like 20 gsub commands in a row?


Thanks!
Simon

Mon, Nov 4, 2013 9:59 PM #

Example not reproducible. Communication fail. Please refer to Posting Guide.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pickert at t-online.de> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Tue, Nov 5, 2013 12:13 AM #

How?s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector, ?[token]?,dataframe$text_column)  )
4. General question: How to speed up string operations on ?large' data sets?


Please let me know what more information you need in order to reproduce this example? 
It?s more a general type of question, while I think the description above gives you a specific picture of what I?m doing right now.






General question: 
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnewmil at dcn.davis.CA.us>:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Tue, Nov 5, 2013 12:31 AM #

It is not reproducible [1] because I cannot run your (representative) example. The type of regex pattern,  token, and even the character of the data you are searching can affect possible optimizations. Note that a non-memory-resident tool such as sed or perl may be an appropriate tool for a problem like this.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pickert at t-online.de> wrote:

How?s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector,
?[token]?,dataframe$text_column)  )
4. General question: How to speed up string operations on ?large' data
sets?


Please let me know what more information you need in order to reproduce
this example? 
It?s more a general type of question, while I think the description
above gives you a specific picture of what I?m doing right now.






General question: 
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller
<jdnewmil at dcn.davis.CA.us>:

Example not reproducible. Communication fail. Please refer to Posting

Guide.

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....  Go

Live...

DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live

Go...

                                     Live:   OO#.. Dead: OO#..

Playing

Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.

rocks...1k

---------------------------------------------------------------------------

Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pickert at t-online.de> wrote:

Hi R?lers,

I?m running into speeding issues, performing a bunch of 

?gsub(patternvector, [token],dataframe$text_column)"

on a data frame containing >4millionentries.

(The ?patternvectors? contain up to 500 elements) 

Is there any better/faster way than performing like 20 gsub commands

in

a row?


Thanks!
Simon

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

jim holtman

Tue, Nov 5, 2013 1:06 AM #

what is missing is any idea of what the 'patterns' are that you are searching for.  Regular expressions are very sensitive to how you specify the pattern.  you indicated that you have up to 500 elements in the pattern, so what does it look like?  alternation and backtracking can be very expensive.  so a lot more specificity is required.  there are whole books written on how pattern matching works and what is hard and what is easy.  this is true for wherever regular expressions are used, not just in R.  also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert <simon.pickert at t-online.de> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Brian Ripley

Tue, Nov 5, 2013 2:31 AM #

But note too what the help says:

Performance considerations:

      If you are doing a lot of regular expression matching, including
      on very long strings, you will want to consider the options used.
      Generally PCRE will be faster than the default regular expression
      engine, and ?fixed = TRUE? faster still (especially when each
      pattern is matched only a few times).

(and there is more).  I don't see perl=TRUE here.

On 05/11/2013 09:06, Jim Holtman wrote:

what is missing is any idea of what the 'patterns' are that you are searching for.  Regular expressions are very sensitive to how you specify the pattern.  you indicated that you have up to 500 elements in the pattern, so what does it look like?  alternation and backtracking can be very expensive.  so a lot more specificity is required.  there are whole books written on how pattern matching works and what is hard and what is easy.  this is true for wherever regular expressions are used, not just in R.  also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert <simon.pickert at t-online.de> wrote:

How?s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector, ?[token]?,dataframe$text_column)  )
4. General question: How to speed up string operations on ?large' data sets?


Please let me know what more information you need in order to reproduce this example?
It?s more a general type of question, while I think the description above gives you a specific picture of what I?m doing right now.






General question:
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnewmil at dcn.davis.CA.us>:

Example not reproducible. Communication fail. Please refer to Posting Guide.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                     Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pickert at t-online.de> wrote:

Hi R?lers,

I?m running into speeding issues, performing a bunch of

?gsub(patternvector, [token],dataframe$text_column)"

on a data frame containing >4millionentries.

(The ?patternvectors? contain up to 500 elements)

Is there any better/faster way than performing like 20 gsub commands in
a row?


Thanks!
Simon

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Tue, Nov 5, 2013 4:18 AM #

Thanks everybody! Now I understand the need for more details:

the patterns for the gsubs are of different kinds.First, I have character strings, I need to replace. Therefore, I have around 5000 stock ticker symbols (e.g. c(?AAPL?, ?EBAY?,?) distributed across 10 vectors. 
Second, I have four vectors with regular expressions, all similar to this on: replace_url <- c(?https?://.*\\s|www.*\\s?) 

The text strings I perform the gsub commands on, look like this (no string is longer than 200 characters):

'GOOGL announced new partnership www.url.com. Stock price is up +5%?

After performing several gsubs in a row, like

gsub(replace_url, ?[url]?,dataframe$text_column) 
gsub(replace_ticker_sp500, ?[sp500_ticker]?,dataframe$text_column) 
etc. 

this string will look like this:

'[sp500_ticker] announced new partnership [url]. Stock price is up [positive_percentage]?


The dataset contains 4 million entries. The code works, but I I cancelled the process after 1 day (my whole system was blocked while R was running). Performing the code on a smaller chunck of data (1 million) took about 12hrs. As far as I can say, replacing the ticker symbols takes the longest, while the regular expressions went quite fast

Thanks!



Am 05.11.2013 um 11:31 schrieb Prof Brian Ripley <ripley at stats.ox.ac.uk>:

But note too what the help says:

Performance considerations:

    If you are doing a lot of regular expression matching, including
    on very long strings, you will want to consider the options used.
    Generally PCRE will be faster than the default regular expression
    engine, and ?fixed = TRUE? faster still (especially when each
    pattern is matched only a few times).

(and there is more).  I don't see perl=TRUE here.

On 05/11/2013 09:06, Jim Holtman wrote:

what is missing is any idea of what the 'patterns' are that you are searching for.  Regular expressions are very sensitive to how you specify the pattern.  you indicated that you have up to 500 elements in the pattern, so what does it look like?  alternation and backtracking can be very expensive.  so a lot more specificity is required.  there are whole books written on how pattern matching works and what is hard and what is easy.  this is true for wherever regular expressions are used, not just in R.  also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert <simon.pickert at t-online.de> wrote:

How?s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector, ?[token]?,dataframe$text_column)  )
4. General question: How to speed up string operations on ?large' data sets?


Please let me know what more information you need in order to reproduce this example?
It?s more a general type of question, while I think the description above gives you a specific picture of what I?m doing right now.






General question:
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnewmil at dcn.davis.CA.us>:

Example not reproducible. Communication fail. Please refer to Posting Guide.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                    Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pickert at t-online.de> wrote:

Hi R?lers,

I?m running into speeding issues, performing a bunch of

?gsub(patternvector, [token],dataframe$text_column)"

on a data frame containing >4millionentries.

(The ?patternvectors? contain up to 500 elements)

Is there any better/faster way than performing like 20 gsub commands in
a row?


Thanks!
Simon

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Tue, Nov 5, 2013 5:00 AM #

My feeling is that the **result** you want is far more easily achievable via
a substitution table or a hash table.  Someone better versed in those areas
may want to chime in.  I'm thinking more or less of splitting your character
strings into vectors (separate elements at whitespace) and chunking away.

Something like  charvec[charvec==dataframe$text_column[k]] <-
dataframe$replace_column[k]




Simon Pickert wrote

--
View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679769.html
Sent from the R help mailing list archive at Nabble.com.

1 day later

SPi

Wed, Nov 6, 2013 10:57 AM #

Good idea! 

I'm trying your approach right now, but I am wondering if using str_split
(package: 'stringr') or strsplit is the right way to go in terms of speed? I
ran str_split over the text column of the data frame and it's processing for
2 hours now..? 

I did: 
splittedStrings<-str_split(dataframe$text, " ")

The $text column already contains cleaned text, so no double blanks etc or
unnecessary symbols. Just full words.




--
View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679904.html
Sent from the R help mailing list archive at Nabble.com.

SPi

Wed, Nov 6, 2013 11:06 AM #

I'll answer myself:
using strsplit with fixed=true took like 2minutes!



--
View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679905.html
Sent from the R help mailing list archive at Nabble.com.

Wed, Nov 6, 2013 11:29 AM #

If you could, please identify which responder's idea you used, as well as the
"strsplit" -- related code you ended up with.
That may help someone who browses the mail archives in the future.

Carl


SPi wrote

--
View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679906.html
Sent from the R help mailing list archive at Nabble.com.