Skip to content
Prev 200146 / 398503 Next

How can this code be improved?

Jim, Dennis,

Once again, thanks for all your suggestions.  After developing a more R-like
version of the script I terminated the running one after 976 (of 1697) reports
had been processed.  At that point, the script had been running for approx.
33.5 hours!  Here is the new version:

library(filehash)
db <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_TXT", type =
"RDS")
dbLoad(db)
dba <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_Aux", type =
"RDS")
dbLoad(dba)

tokens <- sentences.all.tokenized
stopwords <- stopwords.pubmed

# Convert to lowercase, remove beginning and end punctuation, tabulate
my.func <- function(sent, stop, ...){
	list(
		freq.table = (temp.table <- table(
			sub(
				"[[:punct:]]*$", "", sub(
					"^[[:punct:]]*", "", tolower(sent)
				)
			)
		)),
		stopword.matches = (temp.matches <- match(names(temp.table), stop)),
		stopword.summary = array(tapply(temp.table, !is.na(temp.matches), sum), dim
= 2, dimnames = list(c("no.non.stopwords", "no.stopwords")))
	)
}

cat("Beginning at ", date(), ".\n", sep = "")
token.tables <- 
	lapply(1:length(tokens),
		function(i.d, doc, stop, func, ...){
			if ((i.d - 1) %% 10 == 0) cat((i.d - 1), " report(s) completed at ",
date(), ".\n", sep = "")
			lapply(1:length(doc[[i.d]]),
				function(i.s, sent, stop, func, ...){
					func(sent[[i.s]], stop, ...)
				}
				, sent = doc[[i.d]], stop = stop, func = func, ...
			)
		}
		,
		doc = tokens, stop = stopwords, func = my.func
	)
cat("Terminating at ", date(), ".\n", sep = "")

This script reaches the same point in approx. 1:09 hours, a little under 70
minutes!

What I am noticing now is a severe lack of real memory.  Activity Monitor
shows about 20MB of real memory free.  R, running in 64-bit mode, is using
6.75GB of real and 10GB of virtual memory.  I see lots of disk activity.  This
is undoubtedly the swapping between real and virtual memory.  CPU activity is
very low.  I suppose I could run the script twice, each time on half the
tokens.  That would give me two lists, which I would have to combine into a
single one.

Regards,
Richard


On Thu, 12 Nov 2009 18:53:34 -0500, jim holtman wrote
--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch