Back to formatted view
Raw Message

Message-ID: <e6ccc3b3-ea29-4809-be85-7d998abe3cb2@email.android.com>
Date: 2013-11-05T08:31:16Z
From: Jeff Newmiller
Subject: speed issue: gsub on large data frame
In-Reply-To: <2925DAD9-CD46-4303-973A-A8C5A5F12B9A@t-online.de>

It is not reproducible [1] because I cannot run your (representative) example. The type of regex pattern,  token, and even the character of the data you are searching can affect possible optimizations. Note that a non-memory-resident tool such as sed or perl may be an appropriate tool for a problem like this.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pickert at t-online.de> wrote:
>How?s that not reproducible?
>
>1. Data frame, one column with text strings
>2. Size of data frame= 4million observations
>3. A bunch of gsubs in a row (  gsub(patternvector,
>?[token]?,dataframe$text_column)  )
>4. General question: How to speed up string operations on ?large' data
>sets?
>
>
>Please let me know what more information you need in order to reproduce
>this example? 
>It?s more a general type of question, while I think the description
>above gives you a specific picture of what I?m doing right now.
>
>
>
>
>
>
>General question: 
>Am 05.11.2013 um 06:59 schrieb Jeff Newmiller
><jdnewmil at dcn.davis.CA.us>:
>
>> Example not reproducible. Communication fail. Please refer to Posting
>Guide.
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>Go...
>>                                      Live:   OO#.. Dead: OO#.. 
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#. 
>rocks...1k
>>
>---------------------------------------------------------------------------
>
>> Sent from my phone. Please excuse my brevity.
>> 
>> Simon Pickert <simon.pickert at t-online.de> wrote:
>>> Hi R?lers,
>>> 
>>> I?m running into speeding issues, performing a bunch of 
>>> 
>>> ?gsub(patternvector, [token],dataframe$text_column)"
>>> 
>>> on a data frame containing >4millionentries.
>>> 
>>> (The ?patternvectors? contain up to 500 elements) 
>>> 
>>> Is there any better/faster way than performing like 20 gsub commands
>in
>>> a row?
>>> 
>>> 
>>> Thanks!
>>> Simon
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>