Back to formatted view
Raw Message

Message-ID: <20091113152609.M20029@pueo-owl.ch>
Date: 2009-11-13T15:26:19Z
From: Richard R. Liu
Subject: How can this code be improved?
In-Reply-To: <644e1f320911121553h6c42da32y1d85e10f4fa89e57@mail.gmail.com>

Jim, Dennis,

Once again, thanks for all your suggestions.  After developing a more R-like
version of the script I terminated the running one after 976 (of 1697) reports
had been processed.  At that point, the script had been running for approx.
33.5 hours!  Here is the new version:

library(filehash)
db <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_TXT", type =
"RDS")
dbLoad(db)
dba <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_Aux", type =
"RDS")
dbLoad(dba)

tokens <- sentences.all.tokenized
stopwords <- stopwords.pubmed

# Convert to lowercase, remove beginning and end punctuation, tabulate
my.func <- function(sent, stop, ...){
	list(
		freq.table = (temp.table <- table(
			sub(
				"[[:punct:]]*$", "", sub(
					"^[[:punct:]]*", "", tolower(sent)
				)
			)
		)),
		stopword.matches = (temp.matches <- match(names(temp.table), stop)),
		stopword.summary = array(tapply(temp.table, !is.na(temp.matches), sum), dim
= 2, dimnames = list(c("no.non.stopwords", "no.stopwords")))
	)
}

cat("Beginning at ", date(), ".\n", sep = "")
token.tables <- 
	lapply(1:length(tokens),
		function(i.d, doc, stop, func, ...){
			if ((i.d - 1) %% 10 == 0) cat((i.d - 1), " report(s) completed at ",
date(), ".\n", sep = "")
			lapply(1:length(doc[[i.d]]),
				function(i.s, sent, stop, func, ...){
					func(sent[[i.s]], stop, ...)
				}
				, sent = doc[[i.d]], stop = stop, func = func, ...
			)
		}
		,
		doc = tokens, stop = stopwords, func = my.func
	)
cat("Terminating at ", date(), ".\n", sep = "")

This script reaches the same point in approx. 1:09 hours, a little under 70
minutes!

What I am noticing now is a severe lack of real memory.  Activity Monitor
shows about 20MB of real memory free.  R, running in 64-bit mode, is using
6.75GB of real and 10GB of virtual memory.  I see lots of disk activity.  This
is undoubtedly the swapping between real and virtual memory.  CPU activity is
very low.  I suppose I could run the script twice, each time on half the
tokens.  That would give me two lists, which I would have to combine into a
single one.

Regards,
Richard


On Thu, 12 Nov 2009 18:53:34 -0500, jim holtman wrote
> Run the script on a small subset of the data and use Rprof to profile
> the code.  This will give you an idea of where time is being spent 
> and where to focus for improvement.  I would suggest that you do not 
> convert the output of the 'table(t)' do a dataframe.  You can just 
> extract the 'names' to get the words.  You might be spending some of 
> the time in the accessing the information in the dataframe, which is 
> really not necessary for your code.
> 
> On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo-
> owl.ch> wrote:
> > I am running the following code on a MacBook Pro 17" Unibody early 2009 with
> > 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode.
> >
> > freq.stopwords <- numeric(0)
> > freq.nonstopwords <- numeric(0)
> > token.tables <- list(0)
> > i.ss <- c(0)
> > cat("Beginning at ", date(), ".\n")
> > for (i.d in 1:length(tokens)) {
> > ? ? ? ?tt <- list(0)
> > ? ? ? ?for (i.s in 1:length(tokens[[i.d]])) {
> > ? ? ? ? ? ? ? ?t <- tolower(tokens[[i.d]][[i.s]])
> > ? ? ? ? ? ? ? ?t <- sub("^[[:punct:]]*", "", t)
> > ? ? ? ? ? ? ? ?t <- sub("[[:punct:]]*$", "", t)
> > ? ? ? ? ? ? ? ?t <- as.data.frame(table(t))
> > ? ? ? ? ? ? ? ?i.m <- match(t$t, stopwords)
> > ? ? ? ? ? ? ? ?i.m.is.na <- is.na(i.m)
> > ? ? ? ? ? ? ? ?i.ss <- i.ss + 1
> > ? ? ? ? ? ? ? ?freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
> > ? ? ? ? ? ? ? ?freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
> > ? ? ? ? ? ? ? ?tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
> > matches.stopword = i.m)
> > ? ? ? ?}
> > ? ? ? ?token.tables[[i.d]] <- tt
> > ? ? ? ?if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
> > }
> > cat("Terminating at ", date(), ".\n")
> >
> > The object in the innermost loop are:
> > * tokens: ?a list of lists. ?In the expression tokens[[i.d]][[i.s]], the
> > first index runs over 1697 reports, the second over the sentences in the
> > report, each of which consists of a vector of tokens, i.e., the character
> > strings between the white spaces in the sentence. ?One of the largest
> > reports takes up 58MB on the harddisk. ?Thus, the number of sentences can be
> > quite large, and some of the sentences are quite long (measure in tokens as
> > well as in characters).
> > * stopwords: ?is a vector of 571 words that occur very often in written
> > English.
> >
> > The code operates on sentences, converting each token in the sentence to
> > lowercase, removing punctuation at the beginning and end of the token,
> > tabulating the frequency of the unique tokens, and generating an array that
> > indicates which tokens correspond to stopwords. ?It also sums the
> > frequencies of the stopwords and that of the non-stopwords. ?The result is a
> > list of list of dataframes.
> >
> > I began running on Thursday Nov. 12, 2009 at 01:56:36. ?As of 7:52:00 510
> > reports had been processed. ?The Activity Monitor indicates no memory
> > bottleneck. ?R is using 4.31 GB of real memory, 7.23 GB of virtual memory,
> > and 1.67 GB of real memory are free.
> >
> > I admit that I am an R newbie. ?From my understanding of the "apply"
> > functions (e.g., lapply), I see no way to use them to simplify the loops. ?I
> > would appreciate any suggestions about making the code more "R-like" and,
> > above all, much faster.
> >
> > Regards,
> > Richard
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?


--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch