speed issue: gsub on large data frame
It is not reproducible [1] because I cannot run your (representative) example. The type of regex pattern, token, and even the character of the data you are searching can affect possible optimizations. Note that a non-memory-resident tool such as sed or perl may be an appropriate tool for a problem like this. [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity.
Simon Pickert <simon.pickert at t-online.de> wrote:
How?s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, ?[token]?,dataframe$text_column) ) 4. General question: How to speed up string operations on ?large' data sets? Please let me know what more information you need in order to reproduce this example? It?s more a general type of question, while I think the description above gives you a specific picture of what I?m doing right now. General question: Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnewmil at dcn.davis.CA.us>:
Example not reproducible. Communication fail. Please refer to Posting
Guide.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#.
rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity. Simon Pickert <simon.pickert at t-online.de> wrote:
Hi R?lers, I?m running into speeding issues, performing a bunch of ?gsub(patternvector, [token],dataframe$text_column)" on a data frame containing >4millionentries. (The ?patternvectors? contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands
in
a row? Thanks! Simon
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.