I am running the following code on a MacBook Pro 17" Unibody early
2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in
64-bit mode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
tt <- list(0)
for (i.s in 1:length(tokens[[i.d]])) {
t <- tolower(tokens[[i.d]][[i.s]])
t <- sub("^[[:punct:]]*", "", t)
t <- sub("[[:punct:]]*$", "", t)
t <- as.data.frame(table(t))
i.m <- match(t$t, stopwords)
i.m.is.na <- is.na(i.m)
i.ss <- i.ss + 1
freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq, matches.stopword
= i.m)
}
token.tables[[i.d]] <- tt
if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
}
cat("Terminating at ", date(), ".\n")
The object in the innermost loop are:
* tokens: a list of lists. In the expression tokens[[i.d]][[i.s]],
the first index runs over 1697 reports, the second over the sentences
in the report, each of which consists of a vector of tokens, i.e., the
character strings between the white spaces in the sentence. One of
the largest reports takes up 58MB on the harddisk. Thus, the number
of sentences can be quite large, and some of the sentences are quite
long (measure in tokens as well as in characters).
* stopwords: is a vector of 571 words that occur very often in
written English.
The code operates on sentences, converting each token in the sentence
to lowercase, removing punctuation at the beginning and end of the
token, tabulating the frequency of the unique tokens, and generating
an array that indicates which tokens correspond to stopwords. It also
sums the frequencies of the stopwords and that of the non-stopwords.
The result is a list of list of dataframes.
I began running on Thursday Nov. 12, 2009 at 01:56:36. As of 7:52:00
510 reports had been processed. The Activity Monitor indicates no
memory bottleneck. R is using 4.31 GB of real memory, 7.23 GB of
virtual memory, and 1.67 GB of real memory are free.
I admit that I am an R newbie. From my understanding of the "apply"
functions (e.g., lapply), I see no way to use them to simplify the
loops. I would appreciate any suggestions about making the code more
"R-like" and, above all, much faster.
Regards,
Richard
How can this code be improved?
4 messages · jim holtman, Richard R. Liu
Run the script on a small subset of the data and use Rprof to profile the code. This will give you an idea of where time is being spent and where to focus for improvement. I would suggest that you do not convert the output of the 'table(t)' do a dataframe. You can just extract the 'names' to get the words. You might be spending some of the time in the accessing the information in the dataframe, which is really not necessary for your code.
On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo-owl.ch> wrote:
I am running the following code on a MacBook Pro 17" Unibody early 2009 with
8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
? ? ? ?tt <- list(0)
? ? ? ?for (i.s in 1:length(tokens[[i.d]])) {
? ? ? ? ? ? ? ?t <- tolower(tokens[[i.d]][[i.s]])
? ? ? ? ? ? ? ?t <- sub("^[[:punct:]]*", "", t)
? ? ? ? ? ? ? ?t <- sub("[[:punct:]]*$", "", t)
? ? ? ? ? ? ? ?t <- as.data.frame(table(t))
? ? ? ? ? ? ? ?i.m <- match(t$t, stopwords)
? ? ? ? ? ? ? ?i.m.is.na <- is.na(i.m)
? ? ? ? ? ? ? ?i.ss <- i.ss + 1
? ? ? ? ? ? ? ?freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
? ? ? ? ? ? ? ?freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
? ? ? ? ? ? ? ?tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
matches.stopword = i.m)
? ? ? ?}
? ? ? ?token.tables[[i.d]] <- tt
? ? ? ?if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
}
cat("Terminating at ", date(), ".\n")
The object in the innermost loop are:
* tokens: ?a list of lists. ?In the expression tokens[[i.d]][[i.s]], the
first index runs over 1697 reports, the second over the sentences in the
report, each of which consists of a vector of tokens, i.e., the character
strings between the white spaces in the sentence. ?One of the largest
reports takes up 58MB on the harddisk. ?Thus, the number of sentences can be
quite large, and some of the sentences are quite long (measure in tokens as
well as in characters).
* stopwords: ?is a vector of 571 words that occur very often in written
English.
The code operates on sentences, converting each token in the sentence to
lowercase, removing punctuation at the beginning and end of the token,
tabulating the frequency of the unique tokens, and generating an array that
indicates which tokens correspond to stopwords. ?It also sums the
frequencies of the stopwords and that of the non-stopwords. ?The result is a
list of list of dataframes.
I began running on Thursday Nov. 12, 2009 at 01:56:36. ?As of 7:52:00 510
reports had been processed. ?The Activity Monitor indicates no memory
bottleneck. ?R is using 4.31 GB of real memory, 7.23 GB of virtual memory,
and 1.67 GB of real memory are free.
I admit that I am an R newbie. ?From my understanding of the "apply"
functions (e.g., lapply), I see no way to use them to simplify the loops. ?I
would appreciate any suggestions about making the code more "R-like" and,
above all, much faster.
Regards,
Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Jim and Dennis, Thanks for your suggestions. Almost 24 hours later, the script has finished a bit more than half the reports. Free RAM varies between 1.2GB and a few MB. I hesitate to interrupt it in order to implement the improvements that you have suggested, lest they do not decrease the execution time by at least an order of magnitude; however, I definitely will implement and test your and my improvements. Regards, Richard
On Nov 13, 2009, at 0:53 , jim holtman wrote:
Run the script on a small subset of the data and use Rprof to profile the code. This will give you an idea of where time is being spent and where to focus for improvement. I would suggest that you do not convert the output of the 'table(t)' do a dataframe. You can just extract the 'names' to get the words. You might be spending some of the time in the accessing the information in the dataframe, which is really not necessary for your code. On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo-owl.ch
wrote:
I am running the following code on a MacBook Pro 17" Unibody early
2009 with
8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit
mode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
tt <- list(0)
for (i.s in 1:length(tokens[[i.d]])) {
t <- tolower(tokens[[i.d]][[i.s]])
t <- sub("^[[:punct:]]*", "", t)
t <- sub("[[:punct:]]*$", "", t)
t <- as.data.frame(table(t))
i.m <- match(t$t, stopwords)
i.m.is.na <- is.na(i.m)
i.ss <- i.ss + 1
freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
matches.stopword = i.m)
}
token.tables[[i.d]] <- tt
if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(),
".\n")
}
cat("Terminating at ", date(), ".\n")
The object in the innermost loop are:
* tokens: a list of lists. In the expression tokens[[i.d]]
[[i.s]], the
first index runs over 1697 reports, the second over the sentences
in the
report, each of which consists of a vector of tokens, i.e., the
character
strings between the white spaces in the sentence. One of the largest
reports takes up 58MB on the harddisk. Thus, the number of
sentences can be
quite large, and some of the sentences are quite long (measure in
tokens as
well as in characters).
* stopwords: is a vector of 571 words that occur very often in
written
English.
The code operates on sentences, converting each token in the
sentence to
lowercase, removing punctuation at the beginning and end of the
token,
tabulating the frequency of the unique tokens, and generating an
array that
indicates which tokens correspond to stopwords. It also sums the
frequencies of the stopwords and that of the non-stopwords. The
result is a
list of list of dataframes.
I began running on Thursday Nov. 12, 2009 at 01:56:36. As of
7:52:00 510
reports had been processed. The Activity Monitor indicates no memory
bottleneck. R is using 4.31 GB of real memory, 7.23 GB of virtual
memory,
and 1.67 GB of real memory are free.
I admit that I am an R newbie. From my understanding of the "apply"
functions (e.g., lapply), I see no way to use them to simplify the
loops. I
would appreciate any suggestions about making the code more "R-
like" and,
above all, much faster.
Regards,
Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Jim, Dennis,
Once again, thanks for all your suggestions. After developing a more R-like
version of the script I terminated the running one after 976 (of 1697) reports
had been processed. At that point, the script had been running for approx.
33.5 hours! Here is the new version:
library(filehash)
db <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_TXT", type =
"RDS")
dbLoad(db)
dba <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_Aux", type =
"RDS")
dbLoad(dba)
tokens <- sentences.all.tokenized
stopwords <- stopwords.pubmed
# Convert to lowercase, remove beginning and end punctuation, tabulate
my.func <- function(sent, stop, ...){
list(
freq.table = (temp.table <- table(
sub(
"[[:punct:]]*$", "", sub(
"^[[:punct:]]*", "", tolower(sent)
)
)
)),
stopword.matches = (temp.matches <- match(names(temp.table), stop)),
stopword.summary = array(tapply(temp.table, !is.na(temp.matches), sum), dim
= 2, dimnames = list(c("no.non.stopwords", "no.stopwords")))
)
}
cat("Beginning at ", date(), ".\n", sep = "")
token.tables <-
lapply(1:length(tokens),
function(i.d, doc, stop, func, ...){
if ((i.d - 1) %% 10 == 0) cat((i.d - 1), " report(s) completed at ",
date(), ".\n", sep = "")
lapply(1:length(doc[[i.d]]),
function(i.s, sent, stop, func, ...){
func(sent[[i.s]], stop, ...)
}
, sent = doc[[i.d]], stop = stop, func = func, ...
)
}
,
doc = tokens, stop = stopwords, func = my.func
)
cat("Terminating at ", date(), ".\n", sep = "")
This script reaches the same point in approx. 1:09 hours, a little under 70
minutes!
What I am noticing now is a severe lack of real memory. Activity Monitor
shows about 20MB of real memory free. R, running in 64-bit mode, is using
6.75GB of real and 10GB of virtual memory. I see lots of disk activity. This
is undoubtedly the swapping between real and virtual memory. CPU activity is
very low. I suppose I could run the script twice, each time on half the
tokens. That would give me two lists, which I would have to combine into a
single one.
Regards,
Richard
On Thu, 12 Nov 2009 18:53:34 -0500, jim holtman wrote
Run the script on a small subset of the data and use Rprof to profile the code. This will give you an idea of where time is being spent and where to focus for improvement. I would suggest that you do not convert the output of the 'table(t)' do a dataframe. You can just extract the 'names' to get the words. You might be spending some of the time in the accessing the information in the dataframe, which is really not necessary for your code. On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo- owl.ch> wrote:
I am running the following code on a MacBook Pro 17" Unibody early 2009 with
8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
? ? ? ?tt <- list(0)
? ? ? ?for (i.s in 1:length(tokens[[i.d]])) {
? ? ? ? ? ? ? ?t <- tolower(tokens[[i.d]][[i.s]])
? ? ? ? ? ? ? ?t <- sub("^[[:punct:]]*", "", t)
? ? ? ? ? ? ? ?t <- sub("[[:punct:]]*$", "", t)
? ? ? ? ? ? ? ?t <- as.data.frame(table(t))
? ? ? ? ? ? ? ?i.m <- match(t$t, stopwords)
? ? ? ? ? ? ? ?i.m.is.na <- is.na(i.m)
? ? ? ? ? ? ? ?i.ss <- i.ss + 1
? ? ? ? ? ? ? ?freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
? ? ? ? ? ? ? ?freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
? ? ? ? ? ? ? ?tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
matches.stopword = i.m)
? ? ? ?}
? ? ? ?token.tables[[i.d]] <- tt
? ? ? ?if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
}
cat("Terminating at ", date(), ".\n")
The object in the innermost loop are:
* tokens: ?a list of lists. ?In the expression tokens[[i.d]][[i.s]], the
first index runs over 1697 reports, the second over the sentences in the
report, each of which consists of a vector of tokens, i.e., the character
strings between the white spaces in the sentence. ?One of the largest
reports takes up 58MB on the harddisk. ?Thus, the number of sentences can be
quite large, and some of the sentences are quite long (measure in tokens as
well as in characters).
* stopwords: ?is a vector of 571 words that occur very often in written
English.
The code operates on sentences, converting each token in the sentence to
lowercase, removing punctuation at the beginning and end of the token,
tabulating the frequency of the unique tokens, and generating an array that
indicates which tokens correspond to stopwords. ?It also sums the
frequencies of the stopwords and that of the non-stopwords. ?The result is a
list of list of dataframes.
I began running on Thursday Nov. 12, 2009 at 01:56:36. ?As of 7:52:00 510
reports had been processed. ?The Activity Monitor indicates no memory
bottleneck. ?R is using 4.31 GB of real memory, 7.23 GB of virtual memory,
and 1.67 GB of real memory are free.
I admit that I am an R newbie. ?From my understanding of the "apply"
functions (e.g., lapply), I see no way to use them to simplify the loops. ?I
would appreciate any suggestions about making the code more "R-like" and,
above all, much faster.
Regards,
Richard
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
-- Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard.liu at pueo-owl.ch