Skip to content

Failure message in R on Mac with xmlTreeParse

2 messages · Armin Goralczyk, Duncan Temple Lang

#
Hello

In the following thread (R-help) the possibilities of analyzing
publications from pubmed via XML were discussed:

http://www.nabble.com/Analyzing-Publications-from-Pubmed-via-XML-to14328779.html#a14343090

Using xmlTreeParse in a function results in a failure message on my
Mac which is not reproduced in R for Windows:
+ 	srch.stem <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
+ 	srch.mode <- "db=pubmed&retmax=10000&retmode=xml&term="
+ 	doc <-xmlTreeParse(paste(srch.stem,srch.mode,term,sep=""),isURL = TRUE,
+ 		useInternalNodes = TRUE)
+ 	sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
+ 	}
Fehler in .Call("RS_XML_ParseTree", as.character(file), handlers,
as.logical(ignoreBlanks),  :
  error in creating parser for
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=10000&retmode=xml&term=meyer[au]
I/O warning : failed to load external entity
"http%3A//eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi%3Fdb=pubmed&retmax=10000&retmode=xml&term=meyer%5Bau%5D"
The problem seems to be the search tag [au].
I am not very familiar with XML or the xmlTreeParse function, so I
don't know what is wrong. Can anybody help?

Thanks

My version:
$platform
[1] "powerpc-apple-darwin8.10.1"

$arch
[1] "powerpc"

$os
[1] "darwin8.10.1"

$system
[1] "powerpc, darwin8.10.1"

$status
[1] "Patched"

$major
[1] "2"

$minor
[1] "6.0"

$year
[1] "2007"

$month
[1] "11"

$day
[1] "09"

$`svn rev`
[1] "43408"

$language
[1] "R"

$version.string
[1] "R version 2.6.0 Patched (2007-11-09 r43408)"
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The [au] portion seems to be causing the problem.
So escape the [ and ] by mapping them to %5B and %5D respectively
_before_ handing the URL string to xmlTreeParse().  (The error message
indicates that the internals have already performed the conversion, but
if you do it yourself, things should work as I can reproduce your error
message and can get the desired result by escaping the [ and ] first.)

There is more information about what needs to be escaped at
http://publib.boulder.ibm.com/infocenter/discover/v8r4/index.jsp?topic=/com.ibm.discovery.ds.ref.doc/t_RG_Escape_Sequences.htm

The HTTP/FTP code built into the xmlTreeParse(), htmlTreeParse() and
xmlEventParse() functions (specifically from libxml2) is minimalistic.
For better or worse, it is the code that is also in R to implement
url() connections.  It does not handle aspects of HTTP other than simple
request.  So when I run into problems with xmlTreeParse() and a URL,
I first fetch the content of the document using the RCurl package.

And
library(RCurl)
getURL("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=10000&retmode=xml&term=meyer[au]")

does fetch the document and the result can be passed directly to
xmlTreeParse().

RCurl is an interface to libcurl which is a very solid, stable
and feature rich library for performing HTTP, HTTPS, FTP, ... client
queries which allows us to do, in R, pretty much anything a Web browser
can do but programmatically.

 D.
Armin Goralczyk wrote:
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHaYvZ9p/Jzwa2QP4RAhwbAJoC+KK8tMGWnL5vQehBPWyUWqzDFwCbBxKP
iwWaeL7eDgUI1jg988fYD0A=
=WsL3
-----END PGP SIGNATURE-----