Skip to content
Prev 332660 / 398506 Next

speed issue: gsub on large data frame

Thanks everybody! Now I understand the need for more details:

the patterns for the gsubs are of different kinds.First, I have character strings, I need to replace. Therefore, I have around 5000 stock ticker symbols (e.g. c(?AAPL?, ?EBAY?,?) distributed across 10 vectors. 
Second, I have four vectors with regular expressions, all similar to this on: replace_url <- c(?https?://.*\\s|www.*\\s?) 

The text strings I perform the gsub commands on, look like this (no string is longer than 200 characters):

'GOOGL announced new partnership www.url.com. Stock price is up +5%?

After performing several gsubs in a row, like

gsub(replace_url, ?[url]?,dataframe$text_column) 
gsub(replace_ticker_sp500, ?[sp500_ticker]?,dataframe$text_column) 
etc. 

this string will look like this:

'[sp500_ticker] announced new partnership [url]. Stock price is up [positive_percentage]?


The dataset contains 4 million entries. The code works, but I I cancelled the process after 1 day (my whole system was blocked while R was running). Performing the code on a smaller chunck of data (1 million) took about 12hrs. As far as I can say, replacing the ticker symbols takes the longest, while the regular expressions went quite fast

Thanks!



Am 05.11.2013 um 11:31 schrieb Prof Brian Ripley <ripley at stats.ox.ac.uk>: