parsing Google search results

3 messages · Philip Leifeld, Barry Rowlingson, Tony

Original

1

3

Mon, Nov 16, 2009 11:29 AM #

Hi,

how can I parse Google search results? The following code returns 
"integer(0)" instead of "1" although the results of the query clearly 
contain the regex "cran".

####
address <- url("http://www.google.com/search?q=cran")
open(address)
lines <- readLines(address)
grep("cran", lines[3])
####

Thanks

Philip

Philip Leifeld
Max Planck Institute for     | +49 (0) 1577 6830349 (mobile)
Research on Collective Goods | +49 (0) 228 91416-73 (phone)
MaxNetAging Doctoral Fellow  | +49 (0) 228 91416-62 (fax)
Kurt-Schumacher-Str. 10      |
53113 Bonn, Germany          | http://www.philipleifeld.de

Barry Rowlingson

Tue, Nov 17, 2009 12:17 AM #

On Mon, Nov 16, 2009 at 7:29 PM, Philip Leifeld <Leifeld at coll.mpg.de> wrote:

Hmmm how could that be? It's not like you're getting any warnings or
anything...

 Or are you? I get a couple:

 > address <- url("http://www.google.com/search?q=cran")
 > open(address)
 > lines <- readLines(address)
 Warning message:
 In readLines(address) :
   incomplete final line found on 'http://www.google.com/search?q=cran'

 - but that's probably because there's no newline at the end of the
data. Ignore that.

 > grep("cran",lines[3])
 integer(0)
 Warning message:
 In grep("cran", lines[3]) : input string 1 is invalid in this locale

 Oh now that looks serious. And relevant. Did you get this warning?
You didn't say. I'll assume you didn't, because otherwise you surely
would have mentioned it. So I won't waste my time typing my solution
in now.

 Oh alright. You may need to set the encoding when you open the url to 'latin1':

 > address <- url("http://www.google.com/search?q=cran",encoding="latin1")

 > grep("cran",lines[3])
 [1] 1

So is that the problem? Did you get the warning message and not show
us? Transcripts (inputs and outputs) are good.

Barry

Tony

Tue, Nov 17, 2009 8:54 AM #

Hi Philip,

If i understood correctly, you just wish to get the urls from a given
google search? I have some old code you could adapt which extracts the
main links from a google search. It makes use of XPath expressions
using the lovely XML and RCurl packages:

+   search.term <- gsub(' ', '%20', search.term)
+   if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
+   getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
+ }

+   doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
+   html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
+   ## the next line is very important to parse the html ##
+   nodes <- getNodeSet(html, "//a[@href][@class='l']")
+   return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]))
+ }

[1] "http://cran.r-project.org/"              "http://cran.r-
project.org/web/packages/" "http://www.cranmusic.com/"
"http://www.sizes.com/units/cran.htm"
 [5] "http://www.r-project.org/"               "http://www.myspace.com/
cranmusic"        "http://www.rozcran.co.uk/"               "http://
www.cherylcran.com/"
 [9] "http://www.chriscran.com/"               "http://
www.cranhillranch.com/"           "http://www.yumsugar.com/
6262265"         "http://www.yumsugar.com/6262259"

Hope that helps a little,
Tony Breyal

On 16 Nov, 19:29, Philip Leifeld <Leif... at coll.mpg.de> wrote:

______________________________________________
R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.