Hi,
I'm using the XML package to scrape data and I'm trying to figure out
how to eliminate the memory leak I'm currently experiencing. In the
searches I've done, it sounds like the existence of the leak is fairly
well known. What isn't as clear is exactly how to solve it. The
general process I'm using is this:
require(XML)
myFunction <- function(URL) {
html <- readLines(URL)
tables <- readHTMLTable(html, stringsAsFactors = FALSE)
myData <- data.frame(Value = tables[[1]][, 2],
row.names = make.unique(tables[[1]][, 1]),
stringsAsFactors = FALSE)
rm(list = c("html", "tables")) # here, and
free(tables) # here, my attempt to solve the
memory leak
return(myData)
}
x <- lapply(myURLs, myFunction)
I've tried using rm() and free() to try to free up the memory each
time the function is called, but it hasn't worked as far as I can
tell. By the time lapply is finished woking through my list of url's,
I'm swapping about 3GB of memory.
I've also tried using gc(), but that seems to also have no effect on
the problem.
I'm running RStudio 0.96.330 and latest version of XML.
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
Any suggestions on how to solve this memory issue? Thanks.
James
memory leak using XML readHTMLTable
6 messages · James, Duncan Temple Lang, Yihui Xie
Hi James
Unfortunately, I am not certain if the "latest version"
of the XML package has the garbage collection activated for the nodes.
It is quite complicated and that feature was turned off in some versions
of the package. I suggest that you install the version of the package on github
git at github-omg:omegahat/XML.git
I believe that will handle the garbage collection of nodes, and I'd like
to know if it doesn't.
Best,
D.
On 9/16/12 8:30 PM, J Toll wrote:
Hi,
I'm using the XML package to scrape data and I'm trying to figure out
how to eliminate the memory leak I'm currently experiencing. In the
searches I've done, it sounds like the existence of the leak is fairly
well known. What isn't as clear is exactly how to solve it. The
general process I'm using is this:
require(XML)
myFunction <- function(URL) {
html <- readLines(URL)
tables <- readHTMLTable(html, stringsAsFactors = FALSE)
myData <- data.frame(Value = tables[[1]][, 2],
row.names = make.unique(tables[[1]][, 1]),
stringsAsFactors = FALSE)
rm(list = c("html", "tables")) # here, and
free(tables) # here, my attempt to solve the
memory leak
return(myData)
}
x <- lapply(myURLs, myFunction)
I've tried using rm() and free() to try to free up the memory each
time the function is called, but it hasn't worked as far as I can
tell. By the time lapply is finished woking through my list of url's,
I'm swapping about 3GB of memory.
I've also tried using gc(), but that seems to also have no effect on
the problem.
I'm running RStudio 0.96.330 and latest version of XML.
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
Any suggestions on how to solve this memory issue? Thanks.
James
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I think the correct address for GIT should be git://github.com/omegahat/XML.git :) Or just https://github.com/omegahat/XML Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: 515-294-2465 Web: http://yihui.name Department of Statistics, Iowa State University 2215 Snedecor Hall, Ames, IA On Mon, Sep 17, 2012 at 11:16 AM, Duncan Temple Lang
<duncan at wald.ucdavis.edu> wrote:
Hi James
Unfortunately, I am not certain if the "latest version"
of the XML package has the garbage collection activated for the nodes.
It is quite complicated and that feature was turned off in some versions
of the package. I suggest that you install the version of the package on github
git at github-omg:omegahat/XML.git
I believe that will handle the garbage collection of nodes, and I'd like
to know if it doesn't.
Best,
D.
On Mon, Sep 17, 2012 at 12:51 PM, Yihui Xie <xie at yihui.name> wrote:
I think the correct address for GIT should be git://github.com/omegahat/XML.git :) Or just https://github.com/omegahat/XML Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: 515-294-2465 Web: http://yihui.name Department of Statistics, Iowa State University 2215 Snedecor Hall, Ames, IA On Mon, Sep 17, 2012 at 11:16 AM, Duncan Temple Lang <duncan at wald.ucdavis.edu> wrote:
Hi James
Unfortunately, I am not certain if the "latest version"
of the XML package has the garbage collection activated for the nodes.
It is quite complicated and that feature was turned off in some versions
of the package. I suggest that you install the version of the package on github
git at github-omg:omegahat/XML.git
I believe that will handle the garbage collection of nodes, and I'd like
to know if it doesn't.
Best,
D.
Hi, Thanks for your response and I'm sorry, I should have been more specific regarding the version of XML. I'm using XML 3.9-4. As a sort of follow-on question? Is there a preferable way to install this version of XML from github? Do I have to use git to clone it, or maybe use the install_github function from Hadley's devtools package? I note that the README indicates that: "This R package is not in the R package format in the github repository. It was initially developed in 1999 and was intended for use in both S-Plus and R and so requires a different structure for each." So I was wondering what the general procedure is and whether there's anything special I need to do to install it? Thanks. James
Thanks Yihui for normalizing my customized git URL. The version of the package on github is in the standard R format and that part of the README is no longer relevant. Sorry for the confusion. It might be simplest to pick up a tar.gz file of the source at http://www.omegahat.org/RSXML/XML_3.94-0.tar.gz D
On 9/17/12 12:31 PM, J Toll wrote:
On Mon, Sep 17, 2012 at 12:51 PM, Yihui Xie <xie at yihui.name> wrote:
I think the correct address for GIT should be git://github.com/omegahat/XML.git :) Or just https://github.com/omegahat/XML Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: 515-294-2465 Web: http://yihui.name Department of Statistics, Iowa State University 2215 Snedecor Hall, Ames, IA On Mon, Sep 17, 2012 at 11:16 AM, Duncan Temple Lang <duncan at wald.ucdavis.edu> wrote:
Hi James
Unfortunately, I am not certain if the "latest version"
of the XML package has the garbage collection activated for the nodes.
It is quite complicated and that feature was turned off in some versions
of the package. I suggest that you install the version of the package on github
git at github-omg:omegahat/XML.git
I believe that will handle the garbage collection of nodes, and I'd like
to know if it doesn't.
Best,
D.
Hi, Thanks for your response and I'm sorry, I should have been more specific regarding the version of XML. I'm using XML 3.9-4. As a sort of follow-on question? Is there a preferable way to install this version of XML from github? Do I have to use git to clone it, or maybe use the install_github function from Hadley's devtools package? I note that the README indicates that: "This R package is not in the R package format in the github repository. It was initially developed in 1999 and was intended for use in both S-Plus and R and so requires a different structure for each." So I was wondering what the general procedure is and whether there's anything special I need to do to install it? Thanks. James
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
1 day later
On Mon, Sep 17, 2012 at 3:16 PM, Duncan Temple Lang
<dtemplelang at ucdavis.edu> wrote:
The version of the package on github is in the standard R format and that part of the README is no longer relevant. Sorry for the confusion. It might be simplest to pick up a tar.gz file of the source at http://www.omegahat.org/RSXML/XML_3.94-0.tar.gz
Duncan, Thank you for the suggestion to use the alternative version. That version worked perfectly. Going forward, is there any way to know by the version number which versions of XML have the garbage collection turned on? Thanks again for the package, and the help in sorting out the memory issue. Best, James