Skip to content

R help - Web Scraping of Google News using R

2 messages · Kumar Gauraw, Bob Rudis

#
Hello Experts,

I am trying to scrap data from Google news for a particular topic using XML
and Curl Package of R. I am able to extract the summary part of the news
through *XPath* but in a similar way, I am trying to extract title and
Links of news which is not working.Please note this work is just for POC
purpose and I would make maximum of 500 requests per day so that Google TOS
remains intact.


library(XML)

library(RCurl)

getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)

{

  search.term <- gsub(' ', '%20', search.term)

  if(quotes) search.term <- paste('%22', search.term, '%22', sep='')

  getGoogleURL <- paste('http://www.google', domain,
'/search?hl=en&gl=in&tbm=nws&authuser=0&q=',search.term, sep='')

}

search.term <- "IPL 2016"

quotes <- "FALSE"

search.url <- getGoogleURL(search.term=search.term, quotes=quotes)

getGoogleSummary <- function(google.url) {

  doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))

  html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})

  nodes <- getNodeSet(html, "//div[@class='st']")

  return(sapply(nodes, function(x) x <- xmlValue(x)))

}

*#Problem is with this part of code*

getGoogleTitle <- function(google.url) {

  doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))

  html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})

 * nodes <- getNodeSet(html, "//a[@class='l _HId']")*

  return(sapply(nodes, function(x) x <- xmlValue(x)))

}

Kindly help me to understand where I am getting wrong so that I can rectify
the code and get the correct output.

Thank you.

With Regards,
Kumar Gauraw
#
What you are doing wrong is both trying yourself and asking others to
violate Google's Terms of Service and (amongst other things) get your
IP banned along with anyone who aids you (or worse). Please don't.
Just because something can be done does not mean it should be done.
On Tue, May 24, 2016 at 11:21 AM, Kumar Gauraw <string.gauraw at gmail.com> wrote: