Analyzing Publications from Pubmed via XML

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20071213/f2d4b9db/attachment.pl

I would like to track in which journals articles about a particular  
disease
are being published. Creating a pubmed search is trivial. The search
provides data but obviously not as an R dataframe. I can get the  
search to
export the data as an xml feed and the xml package seems to be able  
to read
it.

xmlTreeParse("
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi? 
rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
",isURL=TRUE)

But getting from there to a dataframe in which one column would be  
the name
of the journal and another column would be the year (to keep things  
simple)
seems to be beyond my capabilities.
If you're comfortable with Python (or Perl, Ruby etc), it'd be easier  
to just extract the required stuff from the raw feed - using  
ElementTree in Python makes this a trivial task

Once you have the raw data you can read it into R

-------------------------------------------------------------------
Rajarshi Guha  <rguha at indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
A committee is a group that keeps the minutes and loses hours.
	-- Milton Berle
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20071213/8610572a/attachment.pl
I would like to track in which journals articles about a particular disease
are being published. Creating a pubmed search is trivial. The search
provides data but obviously not as an R dataframe. I can get the search to
export the data as an xml feed and the xml package seems to be able to read
it.

xmlTreeParse("
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
",isURL=TRUE)

But getting from there to a dataframe in which one column would be the name
of the journal and another column would be the year (to keep things simple)
seems to be beyond my capabilities.

Has anyone ever done this and could you share your script? Are there any
published examples where the end result is a dataframe.

I guess what I am looking for is an easy and simple way to parse the feed
and extract the data. Alternatively how does one turn an RSS feed into a CSV
file?
Try this:

library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE)
sapply(c("//author", "//category"), xpathApply, doc = doc, fun = xmlValue)

I am afraid not! The only thing I know about Python (or Perl, Ruby  
etc) is that they exist and that I have been able to download some  
amazing freeware or open source software thanks to their existence.
The XML package and specifically the xmlTreeParse function looks as  
if it is begging to do the task for me. Is that not true?
Certainly - probably as a better Python programmer than an R  
programmer, it's faster and neater for me to do it in Python:

from elementtree.ElementTree import XML
import urllib

url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi? 
rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-'
con = urllib.urlopen(url)
dat = con.read()
root = XML(dat)
items = root.findall("channel/item")
for item in items:
     category = item.find("category")
     print category.text

The problem is that the RSS feed you linked to, does not contain the  
year of the article in an easily accessible XML element. Rather you  
have to process the HTML content of the description element - which,  
is something R could do, but you'd be using the wrong tool for the job.

In general, if you're planning to analyze article data from Pubmed  
I'd suggest going through the Entrez CGI's (ESearch and EFetch)   
which will give you all the details of the articles in an XML format  
which can then be easily parsed in your language of choice.

This is something that can be done in R (the rpubchem package  
contains functions to process XML files from Pubchem, which might  
provide some pointers)

-------------------------------------------------------------------
Rajarshi Guha  <rguha at indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Writing software is more fun than working.
or just try looking in the annotate package from Bioconductor
On Dec 13, 2007 9:03 PM, Farrel Buchinsky <fjbuch at gmail.com> wrote:
I would like to track in which journals articles about a particular disease
are being published. Creating a pubmed search is trivial. The search
provides data but obviously not as an R dataframe. I can get the search to
export the data as an xml feed and the xml package seems to be able to read
it.

xmlTreeParse("
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
",isURL=TRUE)

But getting from there to a dataframe in which one column would be the name
of the journal and another column would be the year (to keep things simple)
seems to be beyond my capabilities.

Has anyone ever done this and could you share your script? Are there any
published examples where the end result is a dataframe.

I guess what I am looking for is an easy and simple way to parse the feed
and extract the data. Alternatively how does one turn an RSS feed into a CSV
file?
Try this:

library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE)
sapply(c("//author", "//category"), xpathApply, doc = doc, fun = xmlValue)

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
The problem is that the RSS feed you linked to, does not contain the
year of the article in an easily accessible XML element. Rather you
have to process the HTML content of the description element - which,
is something R could do, but you'd be using the wrong tool for the job.

Yes. I have noticed that there two sorts of xml that pubmed will
provide. The kind I had hooked into was an rss feed which provides a
lot of the information simply as a formatted table for viewing in a
rss reader. There is another way to get the xml to come out with more
tags. However, I found the best way to do this is probably through the
bioconductor annotate package

x <- pubmed("18046565", "17978930", "17975511")
a <- xmlRoot(x)
numAbst <- length(xmlChildren(a))
absts <- list()
for (i in 1:numAbst) {
absts[[i]] <- buildPubMedAbst(a[[i]])
   }

I am now trying to work through that approach to see what I can come up with.
Farrel Buchinsky
or just try looking in the annotate package from Bioconductor

Yip. annotate seems to be the most streamlined way to do this.
1) How does one turn the list that is created into a dataframe whose
column names are along the lines of date, title, journal, authors etc
2) I have already created a standing search in pubmed using MyNCBI.
There are many ways I can feed those results to the pubmed() function.
The most brute force way of doing it is by running the search and
outputing the data as a UI List and getting that into the pubmed
brackets. A way that involved more finesse would allow me to create a
rss feed based on my search and then give the rss feed url to the
pubmed function. Or perhaps once could just plop the query inside the
pubmed functions
pubmed(somefunction("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH])
OR ((("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms]
OR recurrent[Text Word]) AND respiratory[All Fields] AND
(("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR
papillomatosis[Text Word])))

Does "somefunction" exist?

If there are any further questions do you think I should migrate this
conversation to the bioconductor mailing list?

Farrel Buchinsky
The problem is that the RSS feed you linked to, does not contain the
year of the article in an easily accessible XML element. Rather you
have to process the HTML content of the description element - which,
is something R could do, but you'd be using the wrong tool for the job.

Yes. I have noticed that there two sorts of xml that pubmed will
provide. The kind I had hooked into was an rss feed which provides a
lot of the information simply as a formatted table for viewing in a
rss reader. There is another way to get the xml to come out with more
tags. However, I found the best way to do this is probably through the
bioconductor annotate package

x <- pubmed("18046565", "17978930", "17975511")
a <- xmlRoot(x)
numAbst <- length(xmlChildren(a))
absts <- list()
for (i in 1:numAbst) {
absts[[i]] <- buildPubMedAbst(a[[i]])
  }

I am now trying to work through that approach to see what I can come up with.
Note that the lines after a<-xmlRoot(x) could be reduced to:

xmlSApply(a, buildPubMedAbst)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
The problem is that the RSS feed you linked to, does not contain the
year of the article in an easily accessible XML element. Rather you
have to process the HTML content of the description element - which,
is something R could do, but you'd be using the wrong tool for the job.

Yes. I have noticed that there two sorts of xml that pubmed will
provide. The kind I had hooked into was an rss feed which provides a
lot of the information simply as a formatted table for viewing in a
rss reader. There is another way to get the xml to come out with more
tags. However, I found the best way to do this is probably through the
bioconductor annotate package

x <- pubmed("18046565", "17978930", "17975511")
a <- xmlRoot(x)
numAbst <- length(xmlChildren(a))
absts <- list()
for (i in 1:numAbst) {
absts[[i]] <- buildPubMedAbst(a[[i]])
   }
You can simplify the final 5 lines to

   absts = xmlApply(a, buildPubMedAbst)

which is shorter, fractionally faster and handles cases where there are
no abstracts.
I am now trying to work through that approach to see what I can come up with.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHYv6Z9p/Jzwa2QP4RAp0NAJ4pfGS7Jy9nwHMOGpT1jVM+IMedywCeOZPG
9GER8GI62Y24a+cQT7KbW08=
=4TVP
-----END PGP SIGNATURE-----
"Farrel Buchinsky" <fjbuch at gmail.com> wrote in
news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com:
On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org> wrote:
or just try looking in the annotate package from Bioconductor

Yip. annotate seems to be the most streamlined way to do this.
1) How does one turn the list that is created into a dataframe whose
column names are along the lines of date, title, journal, authors etc
Gabor's example already did that task.
2) I have already created a standing search in pubmed using MyNCBI.
There are many ways I can feed those results to the pubmed() function.
The most brute force way of doing it is by running the search and
outputing the data as a UI List and getting that into the pubmed
brackets. A way that involved more finesse would allow me to create a
rss feed based on my search and then give the rss feed url to the
pubmed function. Or perhaps once could just plop the query inside the
pubmed functions
pubmed(somefunction("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH])
OR ((("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms]
OR recurrent[Text Word]) AND respiratory[All Fields] AND
(("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR
papillomatosis[Text Word])))

Does "somefunction" exist?
I could not find it. The pubmed function appears to assume that you will 
already have a list of PMIDs. When I set up a function to take an 
arbitrary  PubMed search string (quoted by the user) and return the 
PMIDs, I had success by following Gabor's example:
pm.srch<- function (){
srch.stem <-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
   query <-as.character(scan(file="",what="character"))
   doc <-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE, 
         useInternalNodes = TRUE)
   sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
     }
pm.srch()
1: "laryngeal neoplasms[mh]"
2: 
Read 1 item
      //Id      
 [1,] "18042931"
 [2,] "18038886"
 [3,] "17978930"
 [4,] "17974987"
 [5,] "17972507"
 [6,] "17970149"
 [7,] "17967299"
 [8,] "17962724"
 [9,] "17954109"
[10,] "17942038"
[11,] "17940076"
[12,] "17848290"
[13,] "17848288"
[14,] "17848287"
[15,] "17848278"
[16,] "17938330"
[17,] "17938329"
[18,] "17918311"
[19,] "17910347"
[20,] "17908862"

Emboldened by that minor success, I pushed on. Pubmed said your example 
was malformed and I took their suggested modification:
("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH]) OR (("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms] OR recurrent[Text Word]) AND respiratory[All Fields] AND (("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR papillomatosis[Text Word]) 

That returned 400+ citations, and I put it into a text file.

After quite a bit of hacking (in the sense of ineffective chopping with 
a dull ax), I finally came up with:

pm.srch<- function (){
  srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
  query<-readLines(con=file.choose())
  query<-gsub("\\\"","",x=query)
  doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE, 
                     useInternalNodes = TRUE)
  return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
     }

pm.srch()  #choosing the search-file
      //Id      
 [1,] "18046565"
 [2,] "17978930"
 [3,] "17975511"
 [4,] "17935912"
 [5,] "17851940"
 [6,] "17765779"
 [7,] "17688640"
 [8,] "17638782"
 [9,] "17627059"
[10,] "17599582"
[11,] "17589729"
[12,] "17585283"
[13,] "17568846"
[14,] "17560665"
[15,] "17547971"
[16,] "17428551"
[17,] "17419899"
[18,] "17419519"
[19,] "17385606"
[20,] "17366752"
David Winsemius
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A077F740B4A0dNOTwinscomcast at 80.91.229.13:
"Farrel Buchinsky" <fjbuch at gmail.com> wrote in
news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com: 

On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org>
wrote: 
or just try looking in the annotate package from Bioconductor

Yip. annotate seems to be the most streamlined way to do this.
1) How does one turn the list that is created into a dataframe whose
column names are along the lines of date, title, journal, authors etc
Gabor's example already did that task.

Actually the object returned by Gabor's method was a list of lists. Here 
is one way (probably very inefficient) of getting "doc" into a 
data.frame:

colvals <-sapply(c("//title", "//author", "//category"), xpathApply, 
           doc = doc, fun = xmlValue)

titles=as.vector(unlist(colvals[1])[3:17])

# needed to drop extraneous titles for search name and an NCBI header
#>str(colvals)
#List of 3
# $ //title   :List of 17
#  ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
#  ..$ : chr "NCBI PubMed"

authors=colvals[[2]]
jrnls=colvals[[3]]

# not sure why, but trying to do it in one step failed:
#  cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),  
#                     authors=colvals[[2]],jnrls=colvals[[3]])
# Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]), 
# authors = colvals[[2]],  : 
#  arguments imply differing number of rows: 15, 1
# but the following worked

 cites<-data.frame(titles=as.vector(titles))
 cites$author<-authors
 cites$jrnls<-jrnls
 cites

I am still wondering how to extract material that does not have an XML 
tag.  Each item looks like:

 <item>
   <title>Gastroesophageal reflux in patients with recurrent laryngeal 
papillomatosis.</title>
   <link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
tmpl=NoSidebarfile&amp;db=PubMed&amp;cmd=Retrieve&amp;list_uids=17589729
&amp;dopt=Abstract</link>
   <description>
    <![CDATA[
    <table border="0" width="100%"><tr><td align="left"><a 
href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S0034-
72992007000200011&amp;lng=en&amp;nrm=iso&amp;tlng=en"><img 
src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
img-scielo_en.gif" border="0"/></a> </td><td align="right"><a 
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
db=PubMed&amp;cmd=Display&amp;dopt=PubMed_PubMed&amp;from_uid=17589729">
Related Articles</a></td></tr></table>
        <p><b>Gastroesophageal reflux in patients with recurrent 
laryngeal papillomatosis.</b></p>
        <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
</p>
        <p>Authors:  Pignatari SS, Liriano RY, Avelino MA, Testa JR, 
Fujita R, De Marco EK</p>
        <p>Evidence of a relation between gastroesophaeal reflux and 
pediatric respiratory disorders increases every year. Many respiratory 
symptoms and clinical conditions such as stridor, chronic cough, and 
recurrent pneumonia and bronchitis appear to be related to 
gastroesophageal reflux. Some studies have also suggested that 
gastroesophageal reflux may be associated with recurrent laryngeal 
papillomatosis, contributing to its recurrence and severity. AIM: the aim 
of this study was to verify the frequency and intensity of 
gastroesophageal reflux in children with recurrent laryngeal 
papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged 
between 3 and 12 years, presenting laryngeal papillomatosis, were 
included in this study. The children underwent 24-hour double-probe pH-
metry. RESULTS: fifty percent of the patients had evidence of 
gastroesophageal reflux at the distal sphincter; 90% presented reflux at 
the proximal sphincter. CONCLUSION: the frequency of proximal 
gastroesophageal reflux is significantly increased in patients with 
recurrent laryngeal papillomatosis.</p>
        <p>PMID: 17589729 [PubMed - in process]</p>    ]]>
   </description>
   <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De 
Marco EK</author>
   <category>Rev Bras Otorrinolaringol (Engl Ed)</category>
   <guid isPermaLink="false">PubMed:17589729</guid>
  </item>

I would like to access, for instance, the PMID or the abstract within the 
<description> element, but I do not think that they have names in the the 
same way that <author> or <category> have xml named nodes. I suspect that 
getting the output in a different format, say as MEDLINE, might produce 
output that was tagged more completely.
David Winsemius
If we can assume that the abstract is always the 4th paragraph then we
can try something like this:

library(XML)
doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE, trim = TRUE)

out <- cbind(
	Author = unlist(xpathApply(doc, "//author", xmlValue)),
	PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", xmlValue))),
	Abstract = unlist(xpathApply(doc, "//description",
		function(x) {
			on.exit(free(doc2))
			doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
				useInternalNodes = TRUE, trim = TRUE)
			xpathApply(doc2, "//p[4]", xmlValue)
		}
	)))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field

The last line produces (it may look messed up in this email):
substring(out, 1, 25) # display it
Author                      PMID       Abstract
 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
 [5,] " Hopp R, Natarajan N, Lew" "17908862" ""
 [6,] " Preuss SF, Klussmann JP," "17851940" "CONCLUSIONS: The presente"
 [7,] " Mouadeb DA, Belafsky PC"  "17765779" "OBJECTIVES: The 585nm pul"
 [8,] " Thompson L"               "17702311" ""
 [9,] " Schaffer A, Brotherton J" "17688640" ""
[10,] " Stephen JK, Vaught LE, C" "17638782" "OBJECTIVE: To investigate"
[11,] " Shah KV, Westra WH"       "17627059" ""
[12,] " Koufman JA, Rees CJ, Fra" "17599582" "BACKGROUND: Unsedated off"
[13,] " Akst LM, Broadhurst MS, " "17592395" ""
[14,] " Pignatari SS, Liriano RY" "17589729" "Evidence of a relation be"
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A077F740B4A0dNOTwinscomcast at 80.91.229.13:

"Farrel Buchinsky" <fjbuch at gmail.com> wrote in
news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com:

On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org>
wrote:
or just try looking in the annotate package from Bioconductor

Yip. annotate seems to be the most streamlined way to do this.
1) How does one turn the list that is created into a dataframe whose
column names are along the lines of date, title, journal, authors etc
Gabor's example already did that task.

Actually the object returned by Gabor's method was a list of lists. Here
is one way (probably very inefficient) of getting "doc" into a
data.frame:

colvals <-sapply(c("//title", "//author", "//category"), xpathApply,
          doc = doc, fun = xmlValue)

titles=as.vector(unlist(colvals[1])[3:17])

# needed to drop extraneous titles for search name and an NCBI header
#>str(colvals)
#List of 3
# $ //title   :List of 17
#  ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
#  ..$ : chr "NCBI PubMed"

authors=colvals[[2]]
jrnls=colvals[[3]]

# not sure why, but trying to do it in one step failed:
#  cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),
#                     authors=colvals[[2]],jnrls=colvals[[3]])
# Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]),
# authors = colvals[[2]],  :
#  arguments imply differing number of rows: 15, 1
# but the following worked

 cites<-data.frame(titles=as.vector(titles))
 cites$author<-authors
 cites$jrnls<-jrnls
 cites

I am still wondering how to extract material that does not have an XML
tag.  Each item looks like:

 <item>
  <title>Gastroesophageal reflux in patients with recurrent laryngeal
papillomatosis.</title>
  <link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
tmpl=NoSidebarfile&amp;db=PubMed&amp;cmd=Retrieve&amp;list_uids=17589729
&amp;dopt=Abstract</link>
  <description>
   <![CDATA[
   <table border="0" width="100%"><tr><td align="left"><a
href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S0034-
72992007000200011&amp;lng=en&amp;nrm=iso&amp;tlng=en"><img
src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
img-scielo_en.gif" border="0"/></a> </td><td align="right"><a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
db=PubMed&amp;cmd=Display&amp;dopt=PubMed_PubMed&amp;from_uid=17589729">
Related Articles</a></td></tr></table>
       <p><b>Gastroesophageal reflux in patients with recurrent
laryngeal papillomatosis.</b></p>
       <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
</p>
       <p>Authors:  Pignatari SS, Liriano RY, Avelino MA, Testa JR,
Fujita R, De Marco EK</p>
       <p>Evidence of a relation between gastroesophaeal reflux and
pediatric respiratory disorders increases every year. Many respiratory
symptoms and clinical conditions such as stridor, chronic cough, and
recurrent pneumonia and bronchitis appear to be related to
gastroesophageal reflux. Some studies have also suggested that
gastroesophageal reflux may be associated with recurrent laryngeal
papillomatosis, contributing to its recurrence and severity. AIM: the aim
of this study was to verify the frequency and intensity of
gastroesophageal reflux in children with recurrent laryngeal
papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged
between 3 and 12 years, presenting laryngeal papillomatosis, were
included in this study. The children underwent 24-hour double-probe pH-
metry. RESULTS: fifty percent of the patients had evidence of
gastroesophageal reflux at the distal sphincter; 90% presented reflux at
the proximal sphincter. CONCLUSION: the frequency of proximal
gastroesophageal reflux is significantly increased in patients with
recurrent laryngeal papillomatosis.</p>
       <p>PMID: 17589729 [PubMed - in process]</p>    ]]>
  </description>
  <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De
Marco EK</author>
  <category>Rev Bras Otorrinolaringol (Engl Ed)</category>
  <guid isPermaLink="false">PubMed:17589729</guid>
 </item>

I would like to access, for instance, the PMID or the abstract within the
<description> element, but I do not think that they have names in the the
same way that <author> or <category> have xml named nodes. I suspect that
getting the output in a different format, say as MEDLINE, might produce
output that was tagged more completely.

--
David Winsemius

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:
If we can assume that the abstract is always the 4th paragraph then we
can try something like this:

library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
useInternalNodes = TRUE, trim = TRUE) 

out <- cbind(
     Author = unlist(xpathApply(doc, "//author", xmlValue)),
     PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
     xmlValue))), 
     Abstract = unlist(xpathApply(doc, "//description",
          function(x) {
               on.exit(free(doc2))
               doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
                    useInternalNodes = TRUE, trim = TRUE)
               xpathApply(doc2, "//p[4]", xmlValue)
          }
     )))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field

The last line produces (it may look messed up in this email):

substring(out, 1, 25) # display it
      Author                      PMID       Abstract
[1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
snip

It looked beautifully regular in my newsreader. It is helpful to see an 
example showing the indexed access to nodes. It was also helpful to see the 
example of substring for column display. Thank you (for this and all of 
your other contributions.)

I find upon further browsing that the pmfetch access point is obsolete. 
Experimentation with the PubMed eFetch server access point results in fully 
xml-tagged results:

e.fetch.doc<- function (){
   fetch.stem <-
        "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
   src.mode <- "db=pubmed&retmode=xml&"
   request <- "id=11045395"
   doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
                          isURL = TRUE, useInternalNodes = TRUE)
     }
# in the debugging phase I needed to set useInternalNodes = TRUE to see the  
tags. Never did find a way to "print" them when internal.

doc<-e.fetch.doc()
get.info<- function(doc){
         df<-cbind(
 	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
 	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
 	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
                   )
   return(df)
   } 

# this works
substring(get.info(doc), 1, 25)
Abstract                    Journal                     Pmid      
[1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"
David Winsemius
# in the debugging phase I needed to set useInternalNodes = TRUE to see the
tags. Never did find a way to "print" them when internal.
I assume you mean FALSE.  See:
?saveXML
"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in
news:971536df0712161226j2cddb7c6qa99992ae7366ed63 at mail.gmail.com:
On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net>
wrote: 
# in the debugging phase I needed to set useInternalNodes = TRUE to
see the tags. Never did find a way to "print" them when internal.
I assume you mean FALSE.  See:
?saveXML
You're correct, yet again; I did a copy/paste/forget-to-edit. And thanks 
for the further tip.
David
"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in
news:971536df0712161226j2cddb7c6qa99992ae7366ed63 at mail.gmail.com:
On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net>
wrote: 
# Never did find a way to "print" them when internal.
?saveXML
And now I understand where that odd "\n      <text>" originated before I 
changed the searched-for node name from \\Abstract to \\AbstractText. It's 
a remnant from the pretty-printing of the XML tree after excising the 
intervening node name.
David Winsemius
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:

If we can assume that the abstract is always the 4th paragraph then we
can try something like this:

library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
useInternalNodes = TRUE, trim = TRUE) 

out <- cbind(
     Author = unlist(xpathApply(doc, "//author", xmlValue)),
     PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
     xmlValue))), 
     Abstract = unlist(xpathApply(doc, "//description",
          function(x) {
               on.exit(free(doc2))
               doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
                    useInternalNodes = TRUE, trim = TRUE)
               xpathApply(doc2, "//p[4]", xmlValue)
          }
     )))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field

The last line produces (it may look messed up in this email):

substring(out, 1, 25) # display it
      Author                      PMID       Abstract
 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
snip

It looked beautifully regular in my newsreader. It is helpful to see an 
example showing the indexed access to nodes. It was also helpful to see the 
example of substring for column display. Thank you (for this and all of 
your other contributions.)

I find upon further browsing that the pmfetch access point is obsolete. 
Experimentation with the PubMed eFetch server access point results in fully 
xml-tagged results:

e.fetch.doc<- function (){
   fetch.stem <-
        "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
   src.mode <- "db=pubmed&retmode=xml&"
   request <- "id=11045395"
   doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
                          isURL = TRUE, useInternalNodes = TRUE)
     }
# in the debugging phase I needed to set useInternalNodes = TRUE to see the  
tags. Never did find a way to "print" them when internal.
saveXML(node)

will return a string giving the XML content of that node as tree.
doc<-e.fetch.doc()
get.info<- function(doc){
         df<-cbind(
 	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
 	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
 	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
                   )
   return(df)
   } 

# this works
substring(get.info(doc), 1, 25)
     Abstract                    Journal                     Pmid      
[1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT
zdLB6mkNPFh5lWgNgb70sDc=
=SR2E
-----END PGP SIGNATURE-----

After quite a bit of hacking (in the sense of ineffective chopping with
a dull ax), I finally came up with:

pm.srch<- function (){
  srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
  query<-readLines(con=file.choose())
  query<-gsub("\\\"","",x=query)
  doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
                     useInternalNodes = TRUE)
  return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
     }

pm.srch()  #choosing the search-file
      //Id
 [1,] "18046565"
 [2,] "17978930"
 [3,] "17975511"
 [4,] "17935912"
 [5,] "17851940"
 [6,] "17765779"
 [7,] "17688640"
 [8,] "17638782"
 [9,] "17627059"
[10,] "17599582"
[11,] "17589729"
[12,] "17585283"
[13,] "17568846"
[14,] "17560665"
[15,] "17547971"
[16,] "17428551"
[17,] "17419899"
[18,] "17419519"
[19,] "17385606"
[20,] "17366752"
I tried the example above, but only the first 20 PMIDs will be
returned. How can I circumvent this (I guesss its a restraint from
pubmed)?
Armin Goralczyk, M.D.
--
Universit?tsmedizin G?ttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 G?ttingen
--
Dept. of General Surgery
University of G?ttingen
G?ttingen, Germany
--
http://www.chirurgie-goettingen.de
Hi Armin -- 

See the help page for esearch

http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

especially the 'retmax' key.

A couple of other thoughts on this thread...

1) using the full path, e.g.,

ids <- xpathApply(doc, "/eSearchResult/IdList/Id", xmlValue)

is likely to lead to less grief in the long run, as you'll only select
elements of the node you're interested in, rather than any element,
anywhere in the document, labeled 'Id'

2) From a different post in the thread, things like
[snip]
get.info<- function(doc){
         df<-cbind(
 	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
 	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
 	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
                   )
   return(df)
   } 
will lead to more trouble, because they assume that AbstractText, etc
occur exactly once in each record. It would seem better to extract the
relevant node, and query that, probably defining appropriate
defaults. I started with

xpath_or_na <- function(doc, q) {
    res <- xpathApply(doc, q, xmlValue)
    if (length(res)==1) res[[1]]
    else NA_character_
}

citn <- function(citation){
 	Abstract <- xpath_or_na(citation,
                           "/MedlineCitation/Article/Abstract/AbstractText")
 	Journal <- xpath_or_na(citation,
                          "/MedlineCitation/Article/Journal/Title")
 	Pmid <- xpath_or_na(citation,
                       "/MedlineCitation/PMID")
    c(Abstract=Abstract, Journal=Journal, Pmid=Pmid)
}

medline_q <- "/PubmedArticleSet/PubmedArticle/MedlineCitation"
res <- xpathApply(doc, medline_q, citn)

One would still have to coerce res into a data.frame. Also worth
thinking about each of the lines in citn -- e.g., clearly only applies
to Journals.  Eventually one wants to consult the DTD (basically, the
contract spelling out the content) of the document, confirm that the
xpath queries will perform correctly, and verify that the document
actually conforms to its DTD.

Following my own advice, I quickly found that doing things 'more
right' becomes quite complicated, and suddenly became satisfied with
the information I can get out of the 'annotate' package.

Martin

"Armin Goralczyk" <agoralczyk at gmail.com> writes:
On Dec 15, 2007 6:31 PM, David Winsemius <dwinsemius at comcast.net> wrote:

After quite a bit of hacking (in the sense of ineffective chopping with
a dull ax), I finally came up with:

pm.srch<- function (){
  srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
  query<-readLines(con=file.choose())
  query<-gsub("\\\"","",x=query)
  doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
                     useInternalNodes = TRUE)
  return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
     }

pm.srch()  #choosing the search-file
      //Id
 [1,] "18046565"
 [2,] "17978930"
 [3,] "17975511"
 [4,] "17935912"
 [5,] "17851940"
 [6,] "17765779"
 [7,] "17688640"
 [8,] "17638782"
 [9,] "17627059"
[10,] "17599582"
[11,] "17589729"
[12,] "17585283"
[13,] "17568846"
[14,] "17560665"
[15,] "17547971"
[16,] "17428551"
[17,] "17419899"
[18,] "17419519"
[19,] "17385606"
[20,] "17366752"
I tried the example above, but only the first 20 PMIDs will be
returned. How can I circumvent this (I guesss its a restraint from
pubmed)?
-- 
Armin Goralczyk, M.D.
--
Universit?tsmedizin G?ttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 G?ttingen
--
Dept. of General Surgery
University of G?ttingen
G?ttingen, Germany
--
http://www.chirurgie-goettingen.de
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

pm.srch<- function (){
   srch.stem <-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
   query <-as.character(scan(file="",what="character"))
   doc <-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
         useInternalNodes = TRUE)
   sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
     }
pm.srch()
1: "laryngeal neoplasms[mh]"
2:
Read 1 item
      //Id
 [1,] "18042931"
 [2,] "18038886"
 [3,] "17978930"
 [4,] "17974987"
 [5,] "17972507"
 [6,] "17970149"
 [7,] "17967299"
 [8,] "17962724"
 [9,] "17954109"
[10,] "17942038"
[11,] "17940076"
[12,] "17848290"
[13,] "17848288"
[14,] "17848287"
[15,] "17848278"
[16,] "17938330"
[17,] "17938329"
[18,] "17918311"
[19,] "17910347"
[20,] "17908862"

I tried the above function with simple search terms and it worked fine
for me (also more output thanks to Martin's post) but when I use
search terms attributed to certain fields, i.e. with [au] or [ta], I
get the following error message:
pm.srch()
1: "laryngeal neoplasms[mh]"
2:
Read 1 item
Fehler in .Call("RS_XML_ParseTree", as.character(file), handlers,
as.logical(ignoreBlanks),  :
  error in creating parser for
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=laryngeal
neoplasms[mh]
I/O warning : failed to load external entity
"http%3A//eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi%3Fdb=pubmed&term=laryngeal%20neoplasms%5Bmh%5D"

What's wrong?
Thanks for any help
Armin Goralczyk, M.D.
--
Universit?tsmedizin G?ttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 G?ttingen
--
Dept. of General Surgery
University of G?ttingen
G?ttingen, Germany
--
http://www.chirurgie-goettingen.de
"Armin Goralczyk" <agoralczyk at gmail.com> wrote in
news:a695fbee0712171238g4995040x579e58f52f83376e at mail.gmail.com:
On Dec 15, 2007 6:31 PM, David Winsemius <dwinsemius at comcast.net>
wrote: 
pm.srch<- function (){
   srch.stem
   <-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pub
   med&term=" query <-as.character(scan(file="",what="character"))
   doc <-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
         useInternalNodes = TRUE)
   sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
     }
pm.srch()
1: "laryngeal neoplasms[mh]"
2:
Read 1 item
      //Id
 [1,] "18042931"
snipped list of IDs

I tried the above function with simple search terms and it worked fine
for me (also more output thanks to Martin's post) but when I use
search terms attributed to certain fields, i.e. with [au] or [ta], I
get the following error message:
pm.srch()
1: "laryngeal neoplasms[mh]"
2:
Read 1 item
Fehler in .Call("RS_XML_ParseTree", as.character(file), handlers,
as.logical(ignoreBlanks),  :
  error in creating parser for
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&ter
m=laryngeal neoplasms[mh]
I/O warning : failed to load external entity
"http%3A//eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi%3Fdb=pubme
d&term=laryngeal%20neoplasms%5Bmh%5D" 

What's wrong?
I'm not sure. You included my simple example. rather than your search string 
that provoked an error. This is an example search that one can find on 
the how-to page for literature searches with /esearch:

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=PNAS[ta]+AND+97[vi]&retstart=6&retmax=6&tool=biomed3

I am wondering if you used spaces, rather than "+"'s? If so then you may 
want your function to do more gsub-processing of the input string.

When I use the search terms in NCBI's example I get:
pm.srch<- function (){
+    srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
+              query<-as.character(scan(file="",what="character"))
+              doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE, useInternalNodes = TRUE)
+              sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
+      }
doc.xml<-pm.srch()
1: "PNAS[ta]+AND+97[vi]"
2: 
Read 1 item
doc.xml
//Id      
 [1,] "16578858"
 [2,] "11186225"
 [3,] "11121081"
 [4,] "11121080"
 [5,] "11121079"
 [6,] "11121078"
 [7,] "11121077"
 [8,] "11121076"
 [9,] "11121075"
[10,] "11121074"
[11,] "11121073"
[12,] "11121072"
[13,] "11121071"
[14,] "11121070"
[15,] "11121069"
[16,] "11121068"
[17,] "11121067"
[18,] "11121066"
[19,] "11121065"
[20,] "11121064"
David Winsemius, MD

> Thanks for any help
> -- 
> Armin Goralczyk, M.D.
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A09CA51DB1E4dNOTwinscomcast at 80.91.229.13:
"Armin Goralczyk" <agoralczyk at gmail.com> wrote in
news:a695fbee0712171238g4995040x579e58f52f83376e at mail.gmail.com: 
I tried the above function with simple search terms and it worked
fine for me (also more output thanks to Martin's post) but when I use
search terms attributed to certain fields, i.e. with [au] or [ta], I
get the following error message:
pm.srch()
1: "laryngeal neoplasms[mh]"
2:
I am wondering if you used spaces, rather than "+"'s? If so then you
may want your function to do more gsub-processing of the input string.
I tried my theory that one would need "+"'s instead of spaces, but 
disproved it. Spaces in the input string seems to produce acceptable 
results on my WinXP/R.2.6.1/RGui system even with more complex search 
strings.
David Winsemius
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A09CA51DB1E4dNOTwinscomcast at 80.91.229.13:

"Armin Goralczyk" <agoralczyk at gmail.com> wrote in
news:a695fbee0712171238g4995040x579e58f52f83376e at mail.gmail.com:

I tried the above function with simple search terms and it worked
fine for me (also more output thanks to Martin's post) but when I use
search terms attributed to certain fields, i.e. with [au] or [ta], I
get the following error message:
pm.srch()
1: "laryngeal neoplasms[mh]"
2:

I am wondering if you used spaces, rather than "+"'s? If so then you
may want your function to do more gsub-processing of the input string.
I tried my theory that one would need "+"'s instead of spaces, but
disproved it. Spaces in the input string seems to produce acceptable
results on my WinXP/R.2.6.1/RGui system even with more complex search
strings.

--

It's not the spaces, the problem is the tag (sorry that I didn't
specify this), or maybe the string []. I am working on a Mac OS X 10.4
with R version 2.6. Is it maybe a string conversion problem? In the
following warning strings in the html adress seem to be different:

Fehler in .Call("RS_XML_ParseTree", as.character(file), handlers,
as.logical(ignoreBlanks),  :
 error in creating parser for
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=laryngeal
neoplasms[mh]
I/O warning : failed to load external entity
"http%3A//eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi%3Fdb=pubmed&term=laryngeal%20neoplasms%5Bmh%5D"
Armin Goralczyk, M.D.
--
Universit?tsmedizin G?ttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 G?ttingen
--
Dept. of General Surgery
University of G?ttingen
G?ttingen, Germany
--
http://www.chirurgie-goettingen.de
"Armin Goralczyk" <agoralczyk at gmail.com> wrote in
news:a695fbee0712180702k1a351b5cxca54d45b81096166 at mail.gmail.com:
On 12/18/07, David Winsemius <dwinsemius at comcast.net> wrote:
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A09CA51DB1E4dNOTwinscomcast at 80.91.229.13:

"Armin Goralczyk" <agoralczyk at gmail.com> wrote in
news:a695fbee0712171238g4995040x579e58f52f83376e at mail.gmail.com:

I tried the above function with simple search terms and it worked
fine for me (also more output thanks to Martin's post) but when I
use search terms attributed to certain fields, i.e. with [au] or
[ta], I get the following error message:
pm.srch()
1: "laryngeal neoplasms[mh]"
2:

I am wondering if you used spaces, rather than "+"'s? If so then
you may want your function to do more gsub-processing of the input
string. 
I tried my theory that one would need "+"'s instead of spaces, but
disproved it. Spaces in the input string seems to produce acceptable
results on my WinXP/R.2.6.1/RGui system even with more complex search
strings.

--

It's not the spaces, the problem is the tag (sorry that I didn't
specify this), or maybe the string []. I am working on a Mac OS X 10.4
with R version 2.6. Is it maybe a string conversion problem? In the
following warning strings in the html adress seem to be different:
Fehler in .Call("RS_XML_ParseTree", as.character(file), handlers,
as.logical(ignoreBlanks),  :
 error in creating parser for
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&ter
m=laryngeal neoplasms[mh]
I/O warning : failed to load external entity
"http%3A//eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi%3Fdb=pubme
d&term=laryngeal%20neoplasms%5Bmh%5D" 
I do not have an up-to-date version of R on my Mac, since I have not yet 
upgraded to OSX10.4. I can try with my older version of R, but failure 
(or even success) with versions OSX-10.2/R-2.0 is not likely to be very 
informative. If you will post an example of the input that is resulting 
in the error, I can try it on my WinXP machine. If we cannot reproduce it 
there, then it may be more appropriate to take further questions to the 
Mac-R mailing list. The error message suggests to me that the fault lies 
in the connection phase of the task.
David Winsemius