Back to formatted view
Raw Message

Message-ID: <1348597697.86677.YahooMailNeo@web142605.mail.bf1.yahoo.com>
Date: 2012-09-25T18:28:17Z
From: arun
Subject: Memory usage in R grows considerably while calculating word frequencies
In-Reply-To: <20577.44149.10365.349840@stat.math.ethz.ch>

Dear Martin,

Thanks for testing the code.? You are right.
I modified the code:

If I test it for a sample text,

txt1<-"Romney A.K. different, (= than other people.? Is it?"
OP's code:
pattern <- "(\\b[A-Za-z]+\\b)"
?match <- gregexpr(pattern,txt1)
?words.txt <- regmatches(txt1,match)
?words.txt<-unlist(words.txt)
?words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt
#words
?#?????? A different??????? Is??????? it???????? K???? other??? people??? Romney 
?#?????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1 
? #?? than 
?? #???? 1 


#My code:

?words.txt1<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]))
?? #? ak different??????? is??????? it???? other??? people??? romney????? than 
?? #???? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1???????? 1 
?

Here, as you can see, OP's code split A.K. to two words, but my code joins it. I didn't fix it because the concern is to minimize memory usage.

I again, tested the new code with text of :
?sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071
?sum(sapply(strsplit(txt1," "),length))
#[1] 22393
: words.

#OP's code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
?# user? system elapsed 
# 12.056?? 0.000? 12.066 

#My code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n")) 
?words.txt<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]),decreasing=TRUE)
?words.txt<-paste(names(words.txt),words.txt,sep="\t")
?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") 
})
#Read 4 items
? # user? system elapsed 
?# 0.148?? 0.000?? 0.150 

There is improvement in the speed.? Output also looked similar.? This code may be still improved.
A.K.
?? 




----- Original Message -----
From: Martin Maechler <maechler at stat.math.ethz.ch>
To: arun <smartpink111 at yahoo.com>
Cc: mcelis <mcelis at lightminersystems.com>; R help <r-help at r-project.org>
Sent: Tuesday, September 25, 2012 9:07 AM
Subject: Re: [R] Memory usage in R grows considerably while calculating word frequencies

>>>>> arun? <smartpink111 at yahoo.com>
>>>>>? ?  on Mon, 24 Sep 2012 19:59:35 -0700 writes:

? ? > HI,
? ? > In the previous email, I forgot to add unlist().
? ? > With four paragraphs,
? ? > sapply(strsplit(txt1," "),length)
? ? > #[1] 4850 9072 6400 2071


? ? > #Your code:
? ? > system.time({
? ? > txt1<-tolower(scan("text_file","character",sep="\n")) 
? ? > pattern <- "(\\b[A-Za-z]+\\b)"
? ? > match <- gregexpr(pattern,txt1)
? ? > words.txt <- regmatches(txt1,match)
? ? > words.txt<-unlist(words.txt)
? ? > words.txt<-table(words.txt,dnn="words")
? ? > words.txt<-sort(words.txt,decreasing=TRUE)
? ? > words.txt<-paste(names(words.txt),words.txt,sep="\t")
? ? > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
? ? > })

? ? > #Read 4 items
? ? > #?? user? system elapsed 
? ? > # 11.781?? 0.004? 11.799 


? ? > #Modified code:
? ? > system.time({
? ? > txt1<-tolower(scan("text_file","character",sep="\n")) 
? ? > ?words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
? ? > ?words.txt<-paste(names(words.txt),words.txt,sep="\t")
? ? > ?cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") 
? ? > })
? ? > #Read 4 items
? ? > ?#user? system elapsed 
? ? > ?# 0.036?? 0.008?? 0.043 


? ? > A.K.

Well, dear A.K., your definition of "word" is really different,
and in my view clearly much too simplistic, compared to what the
OP (= original-poster) asked from.

E.g., from the above paragraph, your method will get words such as
"A.K.,"?  "different,"? or? "(="? 
clearly wrongly.

Martin Maechler, ETH Zurich



? ? > ----- Original Message -----
? ? > From: mcelis <mcelis at lightminersystems.com>
? ? > To: r-help at r-project.org
? ? > Cc: 
? ? > Sent: Monday, September 24, 2012 7:29 PM
? ? > Subject: [R] Memory usage in R grows considerably while calculating word frequencies

? ? > I am working with some large text files (up to 16 GBytes).? I am interested
? ? > in extracting the words and counting each time each word appears in the
? ? > text. I have written a very simple R program by following some suggestions
? ? > and examples I found online.? 

? ? > If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
? ? > when executing the program on
? ? > a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
? ? > a better way to do this that will
? ? > minimize memory usage.

? ? > I am very new to R, so I would appreciate some tips on how to improve my
? ? > program or a better way to do it.

? ? > R program:
? ? > # Read in the entire file and convert all words in text to lower case
? ? > words.txt<-tolower(scan("text_file","character",sep="\n"))

? ? > # Extract words
? ? > pattern <- "(\\b[A-Za-z]+\\b)"
? ? > match <- gregexpr(pattern,words.txt)
? ? > words.txt <- regmatches(words.txt,match)

? ? > # Create a vector from the list of words
? ? > words.txt<-unlist(words.txt)

? ? > # Calculate word frequencies
? ? > words.txt<-table(words.txt,dnn="words")

? ? > # Sort by frequency, not alphabetically
? ? > words.txt<-sort(words.txt,decreasing=TRUE)

? ? > # Put into some readable form, "Name of word" and "Number of times it
? ? > occurs"
? ? > words.txt<-paste(names(words.txt),words.txt,sep="\t")

? ? > # Results to a file
? ? > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")



? ? > --
? ? > View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
? ? > Sent from the R help mailing list archive at Nabble.com.

? ? > ______________________________________________
? ? > R-help at r-project.org mailing list
? ? > https://stat.ethz.ch/mailman/listinfo/r-help
? ? > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
? ? > and provide commented, minimal, self-contained, reproducible code.


? ? > ______________________________________________
? ? > R-help at r-project.org mailing list
? ? > https://stat.ethz.ch/mailman/listinfo/r-help
? ? > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
? ? > and provide commented, minimal, self-contained, reproducible code.