Memory usage in R grows considerably while calculating word frequencies

Dear Martin,

Thanks for testing the code.? You are right.
I modified the code:

If I test it for a sample text,

txt1<-"Romney A.K. different, (= than other people.? Is it?"
OP's code:
pattern <- "(\\b[A-Za-z]+\\b)"
?match <- gregexpr(pattern,txt1)
?words.txt <- regmatches(txt1,match)
?words.txt<-unlist(words.txt)
?words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt
#words
?#?????? A different??????? Is??????? it???????? K???? other??? people??? Romney 
?#?????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1 
? #?? than 
?? #???? 1 

#My code:

?words.txt1<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]))
?? #? ak different??????? is??????? it???? other??? people??? romney????? than 
?? #???? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1 
?

Here, as you can see, OP's code split A.K. to two words, but my code joins it. I didn't fix it because the concern is to minimize memory usage.

I again, tested the new code with text of :
?sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071
?sum(sapply(strsplit(txt1," "),length))
#[1] 22393
: words.

#OP's code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
?# user? system elapsed 
# 12.056?? 0.000? 12.066 

#My code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n")) 
?words.txt<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]),decreasing=TRUE)
?words.txt<-paste(names(words.txt),words.txt,sep="\t")
?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") 
})
#Read 4 items
? # user? system elapsed 
?# 0.148?? 0.000?? 0.150 

There is improvement in the speed.? Output also looked similar.? This code may be still improved.
A.K.
?? 

----- Original Message -----
From: Martin Maechler <maechler at stat.math.ethz.ch>
To: arun <smartpink111 at yahoo.com>
Cc: mcelis <mcelis at lightminersystems.com>; R help <r-help at r-project.org>
Sent: Tuesday, September 25, 2012 9:07 AM
Subject: Re: [R] Memory usage in R grows considerably while calculating word frequencies

Memory usage in R grows considerably while calculating word frequencies

Thread (7 messages)