I am working with some large text files (up to 16 GBytes). I am interested
in extracting the words and counting each time each word appears in the
text. I have written a very simple R program by following some suggestions
and examples I found online.
If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
when executing the program on
a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
a better way to do this that will
minimize memory usage.
I am very new to R, so I would appreciate some tips on how to improve my
program or a better way to do it.
R program:
# Read in the entire file and convert all words in text to lower case
words.txt<-tolower(scan("text_file","character",sep="\n"))
# Extract words
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,words.txt)
words.txt <- regmatches(words.txt,match)
# Create a vector from the list of words
words.txt<-unlist(words.txt)
# Calculate word frequencies
words.txt<-table(words.txt,dnn="words")
# Sort by frequency, not alphabetically
words.txt<-sort(words.txt,decreasing=TRUE)
# Put into some readable form, "Name of word" and "Number of times it
occurs"
words.txt<-paste(names(words.txt),words.txt,sep="\t")
# Results to a file
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
--
View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
Sent from the R help mailing list archive at Nabble.com.
Memory usage in R grows considerably while calculating word frequencies
7 messages · mcelis, arun, Milan Bouchet-Valat +2 more
HI,
In a text file of 6834 words, I compared your program with a modified program.
sapply(strsplit(txt1," "),length)
#[1] 6834
#your program
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#?? user? system elapsed
?# 0.208?? 0.000?? 0.206
#Modified code
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
?words.txt<-sort(table(strsplit(tolower(txt1),"\\s")),decreasing=TRUE)
?words.txt<-paste(names(words.txt),words.txt,sep="\t")
?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#? user? system elapsed
?# 0.016?? 0.000?? 0.014?
A.K.
----- Original Message -----
From: mcelis <mcelis at lightminersystems.com>
To: r-help at r-project.org
Cc:
Sent: Monday, September 24, 2012 7:29 PM
Subject: [R] Memory usage in R grows considerably while calculating word frequencies
I am working with some large text files (up to 16 GBytes).? I am interested
in extracting the words and counting each time each word appears in the
text. I have written a very simple R program by following some suggestions
and examples I found online.?
If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
when executing the program on
a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
a better way to do this that will
minimize memory usage.
I am very new to R, so I would appreciate some tips on how to improve my
program or a better way to do it.
R program:
# Read in the entire file and convert all words in text to lower case
words.txt<-tolower(scan("text_file","character",sep="\n"))
# Extract words
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,words.txt)
words.txt <- regmatches(words.txt,match)
# Create a vector from the list of words
words.txt<-unlist(words.txt)
# Calculate word frequencies
words.txt<-table(words.txt,dnn="words")
# Sort by frequency, not alphabetically
words.txt<-sort(words.txt,decreasing=TRUE)
# Put into some readable form, "Name of word" and "Number of times it
occurs"
words.txt<-paste(names(words.txt),words.txt,sep="\t")
# Results to a file
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
--
View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
HI,
In the previous email, I forgot to add unlist().
With four paragraphs,
sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071
#Your code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
#?? user? system elapsed
# 11.781?? 0.004? 11.799
#Modified code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
?words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
?words.txt<-paste(names(words.txt),words.txt,sep="\t")
?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
?#user? system elapsed
?# 0.036?? 0.008?? 0.043
A.K.
----- Original Message -----
From: mcelis <mcelis at lightminersystems.com>
To: r-help at r-project.org
Cc:
Sent: Monday, September 24, 2012 7:29 PM
Subject: [R] Memory usage in R grows considerably while calculating word frequencies
I am working with some large text files (up to 16 GBytes).? I am interested
in extracting the words and counting each time each word appears in the
text. I have written a very simple R program by following some suggestions
and examples I found online.?
If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
when executing the program on
a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
a better way to do this that will
minimize memory usage.
I am very new to R, so I would appreciate some tips on how to improve my
program or a better way to do it.
R program:
# Read in the entire file and convert all words in text to lower case
words.txt<-tolower(scan("text_file","character",sep="\n"))
# Extract words
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,words.txt)
words.txt <- regmatches(words.txt,match)
# Create a vector from the list of words
words.txt<-unlist(words.txt)
# Calculate word frequencies
words.txt<-table(words.txt,dnn="words")
# Sort by frequency, not alphabetically
words.txt<-sort(words.txt,decreasing=TRUE)
# Put into some readable form, "Name of word" and "Number of times it
occurs"
words.txt<-paste(names(words.txt),words.txt,sep="\t")
# Results to a file
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
--
View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Le lundi 24 septembre 2012 ? 16:29 -0700, mcelis a ?crit :
I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it.
First, I think you should have a look at the tm package by Ingo Feinerer. It will help you to import the texts, optionally run processing steps on it, and then extract the words and create a document-term matrix counting their frequencies. No need to reinvent the wheel. Second, there's nothing wrong with using RAM as long as it's available. If other programs need it, the Linux will reclaim it. There's a problem only if R's memory use does not reduce at that point. Use gc() to check whether the RAM allocated to R is really in use. But tm should improve the efficiency of the computations. My two cents
R program:
# Read in the entire file and convert all words in text to lower case
words.txt<-tolower(scan("text_file","character",sep="\n"))
# Extract words
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,words.txt)
words.txt <- regmatches(words.txt,match)
# Create a vector from the list of words
words.txt<-unlist(words.txt)
# Calculate word frequencies
words.txt<-table(words.txt,dnn="words")
# Sort by frequency, not alphabetically
words.txt<-sort(words.txt,decreasing=TRUE)
# Put into some readable form, "Name of word" and "Number of times it
occurs"
words.txt<-paste(names(words.txt),words.txt,sep="\t")
# Results to a file
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
--
View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
arun <smartpink111 at yahoo.com>
on Mon, 24 Sep 2012 19:59:35 -0700 writes:
> HI,
> In the previous email, I forgot to add unlist().
> With four paragraphs,
> sapply(strsplit(txt1," "),length)
> #[1] 4850 9072 6400 2071
> #Your code:
> system.time({
> txt1<-tolower(scan("text_file","character",sep="\n"))
> pattern <- "(\\b[A-Za-z]+\\b)"
> match <- gregexpr(pattern,txt1)
> words.txt <- regmatches(txt1,match)
> words.txt<-unlist(words.txt)
> words.txt<-table(words.txt,dnn="words")
> words.txt<-sort(words.txt,decreasing=TRUE)
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> })
> #Read 4 items
> #?? user? system elapsed
> # 11.781?? 0.004? 11.799
> #Modified code:
> system.time({
> txt1<-tolower(scan("text_file","character",sep="\n"))
> ?words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
> ?words.txt<-paste(names(words.txt),words.txt,sep="\t")
> ?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> })
> #Read 4 items
> ?#user? system elapsed
> ?# 0.036?? 0.008?? 0.043
> A.K.
Well, dear A.K., your definition of "word" is really different,
and in my view clearly much too simplistic, compared to what the
OP (= original-poster) asked from.
E.g., from the above paragraph, your method will get words such as
"A.K.," "different," or "(="
clearly wrongly.
Martin Maechler, ETH Zurich
> ----- Original Message -----
> From: mcelis <mcelis at lightminersystems.com>
> To: r-help at r-project.org
> Cc:
> Sent: Monday, September 24, 2012 7:29 PM
> Subject: [R] Memory usage in R grows considerably while calculating word frequencies
> I am working with some large text files (up to 16 GBytes).? I am interested
> in extracting the words and counting each time each word appears in the
> text. I have written a very simple R program by following some suggestions
> and examples I found online.?
> If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
> when executing the program on
> a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
> a better way to do this that will
> minimize memory usage.
> I am very new to R, so I would appreciate some tips on how to improve my
> program or a better way to do it.
> R program:
> # Read in the entire file and convert all words in text to lower case
> words.txt<-tolower(scan("text_file","character",sep="\n"))
> # Extract words
> pattern <- "(\\b[A-Za-z]+\\b)"
> match <- gregexpr(pattern,words.txt)
> words.txt <- regmatches(words.txt,match)
> # Create a vector from the list of words
> words.txt<-unlist(words.txt)
> # Calculate word frequencies
> words.txt<-table(words.txt,dnn="words")
> # Sort by frequency, not alphabetically
> words.txt<-sort(words.txt,decreasing=TRUE)
> # Put into some readable form, "Name of word" and "Number of times it
> occurs"
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> # Results to a file
> cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> --
> View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
> Sent from the R help mailing list archive at Nabble.com.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Dear Martin,
Thanks for testing the code.? You are right.
I modified the code:
If I test it for a sample text,
txt1<-"Romney A.K. different, (= than other people.? Is it?"
OP's code:
pattern <- "(\\b[A-Za-z]+\\b)"
?match <- gregexpr(pattern,txt1)
?words.txt <- regmatches(txt1,match)
?words.txt<-unlist(words.txt)
?words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt
#words
?#?????? A different??????? Is??????? it???????? K???? other??? people??? Romney
?#?????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1
? #?? than
?? #???? 1
#My code:
?words.txt1<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]))
?? #? ak different??????? is??????? it???? other??? people??? romney????? than
?? #???? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1
?
Here, as you can see, OP's code split A.K. to two words, but my code joins it. I didn't fix it because the concern is to minimize memory usage.
I again, tested the new code with text of :
?sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071
?sum(sapply(strsplit(txt1," "),length))
#[1] 22393
: words.
#OP's code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
?# user? system elapsed
# 12.056?? 0.000? 12.066
#My code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
?words.txt<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]),decreasing=TRUE)
?words.txt<-paste(names(words.txt),words.txt,sep="\t")
?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
? # user? system elapsed
?# 0.148?? 0.000?? 0.150
There is improvement in the speed.? Output also looked similar.? This code may be still improved.
A.K.
??
----- Original Message -----
From: Martin Maechler <maechler at stat.math.ethz.ch>
To: arun <smartpink111 at yahoo.com>
Cc: mcelis <mcelis at lightminersystems.com>; R help <r-help at r-project.org>
Sent: Tuesday, September 25, 2012 9:07 AM
Subject: Re: [R] Memory usage in R grows considerably while calculating word frequencies
arun? <smartpink111 at yahoo.com> ? ? on Mon, 24 Sep 2012 19:59:35 -0700 writes:
? ? > HI,
? ? > In the previous email, I forgot to add unlist().
? ? > With four paragraphs,
? ? > sapply(strsplit(txt1," "),length)
? ? > #[1] 4850 9072 6400 2071
? ? > #Your code:
? ? > system.time({
? ? > txt1<-tolower(scan("text_file","character",sep="\n"))
? ? > pattern <- "(\\b[A-Za-z]+\\b)"
? ? > match <- gregexpr(pattern,txt1)
? ? > words.txt <- regmatches(txt1,match)
? ? > words.txt<-unlist(words.txt)
? ? > words.txt<-table(words.txt,dnn="words")
? ? > words.txt<-sort(words.txt,decreasing=TRUE)
? ? > words.txt<-paste(names(words.txt),words.txt,sep="\t")
? ? > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
? ? > })
? ? > #Read 4 items
? ? > #?? user? system elapsed
? ? > # 11.781?? 0.004? 11.799
? ? > #Modified code:
? ? > system.time({
? ? > txt1<-tolower(scan("text_file","character",sep="\n"))
? ? > ?words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
? ? > ?words.txt<-paste(names(words.txt),words.txt,sep="\t")
? ? > ?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
? ? > })
? ? > #Read 4 items
? ? > ?#user? system elapsed
? ? > ?# 0.036?? 0.008?? 0.043
? ? > A.K.
Well, dear A.K., your definition of "word" is really different,
and in my view clearly much too simplistic, compared to what the
OP (= original-poster) asked from.
E.g., from the above paragraph, your method will get words such as
"A.K.,"? "different,"? or? "(="?
clearly wrongly.
Martin Maechler, ETH Zurich
? ? > ----- Original Message -----
? ? > From: mcelis <mcelis at lightminersystems.com>
? ? > To: r-help at r-project.org
? ? > Cc:
? ? > Sent: Monday, September 24, 2012 7:29 PM
? ? > Subject: [R] Memory usage in R grows considerably while calculating word frequencies
? ? > I am working with some large text files (up to 16 GBytes).? I am interested
? ? > in extracting the words and counting each time each word appears in the
? ? > text. I have written a very simple R program by following some suggestions
? ? > and examples I found online.?
? ? > If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
? ? > when executing the program on
? ? > a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
? ? > a better way to do this that will
? ? > minimize memory usage.
? ? > I am very new to R, so I would appreciate some tips on how to improve my
? ? > program or a better way to do it.
? ? > R program:
? ? > # Read in the entire file and convert all words in text to lower case
? ? > words.txt<-tolower(scan("text_file","character",sep="\n"))
? ? > # Extract words
? ? > pattern <- "(\\b[A-Za-z]+\\b)"
? ? > match <- gregexpr(pattern,words.txt)
? ? > words.txt <- regmatches(words.txt,match)
? ? > # Create a vector from the list of words
? ? > words.txt<-unlist(words.txt)
? ? > # Calculate word frequencies
? ? > words.txt<-table(words.txt,dnn="words")
? ? > # Sort by frequency, not alphabetically
? ? > words.txt<-sort(words.txt,decreasing=TRUE)
? ? > # Put into some readable form, "Name of word" and "Number of times it
? ? > occurs"
? ? > words.txt<-paste(names(words.txt),words.txt,sep="\t")
? ? > # Results to a file
? ? > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
? ? > --
? ? > View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
? ? > Sent from the R help mailing list archive at Nabble.com.
? ? > ______________________________________________
? ? > R-help at r-project.org mailing list
? ? > https://stat.ethz.ch/mailman/listinfo/r-help
? ? > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
? ? > and provide commented, minimal, self-contained, reproducible code.
? ? > ______________________________________________
? ? > R-help at r-project.org mailing list
? ? > https://stat.ethz.ch/mailman/listinfo/r-help
? ? > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
? ? > and provide commented, minimal, self-contained, reproducible code.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 25/09/12 01:29, mcelis wrote:
I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online.
Just an idea (I have no experience with what you want to do, so it might not work): What about putting the text in a database (sqlite comes to mind) where each word is one entry. Then you could use sql to query the database, which should need much less memory. In addition, it should make further processing much easier. Cheers, Rainer
If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the
program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a
better way to do this that will minimize memory usage.
I am very new to R, so I would appreciate some tips on how to improve my program or a better
way to do it.
R program: # Read in the entire file and convert all words in text to lower case
words.txt<-tolower(scan("text_file","character",sep="\n"))
# Extract words pattern <- "(\\b[A-Za-z]+\\b)" match <- gregexpr(pattern,words.txt) words.txt
<- regmatches(words.txt,match)
# Create a vector from the list of words words.txt<-unlist(words.txt)
# Calculate word frequencies words.txt<-table(words.txt,dnn="words")
# Sort by frequency, not alphabetically words.txt<-sort(words.txt,decreasing=TRUE)
# Put into some readable form, "Name of word" and "Number of times it occurs"
words.txt<-paste(names(words.txt),words.txt,sep="\t")
# Results to a file cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
-- View this message in context:
http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
Sent from the R help mailing list archive at Nabble.com.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBitboACgkQoYgNqgF2egr1pgCgjHxE/E1qIwUbrYzB30qIk9cK z/oAoILCYn66+c9CF5tzkWeQH3E2utwi =ahI5 -----END PGP SIGNATURE-----