URL Scan - R-help | R Mailing Lists

jmsc

Sun, Apr 17, 2011 1:40 PM #

I am wondering why when I try to input data from the first site listed below
into R using the scan() function, a different page is read in instead (the
second site listed):

http://data.visionappraisal.com/CanterburyCT/parcel.asp?pid=1242

http://www.visionappraisal.com/databases/

I am wondering if this is an issue with R or something in the source code of
the web page that I am not familiar with. Since I can access the first site
directly, I assume it is not within the source code. Any help would be
appreciated.

--
View this message in context: http://r.789695.n4.nabble.com/URL-Scan-tp3456084p3456084.html
Sent from the R help mailing list archive at Nabble.com.

Barry Rowlingson

Sun, Apr 17, 2011 3:32 PM #

On Sun, Apr 17, 2011 at 9:40 PM, jmsc <michaelfpage at gmail.com> wrote:

I can't access the first URL directly - even from my web browser
without R being involved at all. Is that "pid" a parcel ID that you
need to be logged in to see? Or not a valid parcel id anymore?

 If you want to access a web site from R that needs a login/password
then you need to send the appropriate login form info from R and keep
the cookie session info that gets returned. Web sessions from R and
from a web browser are independent.

Barry

jmsc

Sun, Apr 17, 2011 3:56 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110417/71cf4f85/attachment.pl>

Barry Rowlingson

Mon, Apr 18, 2011 2:26 AM #

On Sun, Apr 17, 2011 at 11:56 PM, jmsc <michaelfpage at gmail.com> wrote:

it doesn't require a login/pass, but it uses session cookies to
simulate a logged-in user (there's even a log out button that clears
the session).

I had a quick look for R-help posts on this ( RSiteSearch("cookies"),
RSiteSearch("session") etc) but didn't find much. You probably want to
install  RCurl and look at the examples.

 Generally what happens is that a successful login, or in this case
just visiting the database front page, causes the web server to send
back a 'cookie' with a long ID number in it. For every further access
to that web site your browser includes the cookie. The server then
looks up the ID, goes 'yup, this is a valid session', and sends you
the page you want. If the cookie isn't there, or the ID isn't valid
(and the ID numbers are big enough to make guessing impractical), then
you get the default page.

Barry

jmsc

Mon, Apr 18, 2011 11:32 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110418/b15f4beb/attachment.pl>