Skip to content

read.csv fails to read a CSV file from google docs

8 messages · Tal Galili, David Winsemius, William Dunlap +2 more

#
On Apr 29, 2011, at 11:19 AM, Tal Galili wrote:

            
I am always amused at such claims. Occasionally they are correct, but  
more often a crucial step has been omitted. In this case you have at a  
minimum embedded line-feeds in your URL string and have not  
established a connection, so it could not possibly have succeeded as  
presented.

But now it's time to admit I do not know why it is not succeeding when  
I correct those flaws.

 > closeAllConnections()
 > data_url <- url("http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv 
")
 > read.csv(data_url)
Error in open.connection(file, "rt") : cannot open the connection

 > closeAllConnections()
 > dd <- read.csv(con <-  url("http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv 
"))
Error in open.connection(file, "rt") : cannot open the connection


So, I guess I'm not reading the help pages for `url` and `read.csv` as  
well I thought I was.

  
    
#
With S+ I get:
 S+>
download.file("http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=
en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&ou
tput=csv", destfile="e:/temp/splus")
 Problem in
download.file("http://spreadsheets0.google.com/spreadsheet/pu..: Could
not get url: un
 supported protocol, libcurl was built with SSL disabled, https: not
supported!
and with cygwin's wget I get
 E:\temp\jnk>wget
"http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDT
Vek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&outpu
t=csv"
 --2011-04-29 11:00:10--
http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTV
ek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=
0&output=csv
 Resolving spreadsheets0.google.com... 74.125.224.73, 74.125.224.71,
74.125.224.64, ...
 Connecting to spreadsheets0.google.com|74.125.224.73|:80... connected.
 HTTP request sent, awaiting response... 302 Moved Temporarily
 Location:
https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDT
Vek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv [
following]
 --2011-04-29 11:00:11--
https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDT
Vek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid
=0&output=csv
 Connecting to spreadsheets0.google.com|74.125.224.73|:443... connected.
 ERROR: cannot verify spreadsheets0.google.com's certificate, issued by
`/C=US/O=Google Inc/CN=Google Internet Authority':
   Unable to locally verify the issuer's authority.
 To connect to spreadsheets0.google.com insecurely, use
`--no-check-certificate'.
 Unable to establish SSL connection.

so I suspect that the SLL/certifcate business may also be the problem
when
using R to get the document.  The R error message is not very
illuminating.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Thanks David for fixing the early issues.

The reason for the failure is that the response
from the Web server is a to redirect the requester
to another page, specifically

 https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv

Note that this is https, not http, and the built-in URL reading facilities in R don't suport https.


One way to see this is to use look at the headers in your browser (e.g. Live HTTP Headers),
or to use curl, or the RCurl package

tt = getForm("http://spreadsheets0.google.com/spreadsheet/pub",
              hl ="en", key = "0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE",
              single = "true", gid ="0",
              output = "csv",
             .opts = list(followlocation = TRUE, verbose = TRUE))


The verbose option shows the entire dialog, and tt contains the
text of the CSV document.

 read.csv(textConnection(tt))

then yields the data frame

  D.
On 4/29/11 10:36 AM, David Winsemius wrote:
#
On Fri, Apr 29, 2011 at 06:19:24PM +0300, Tal Galili wrote:
I get the same error (R 2.11.1, Debian LINUX) and don't have a
solution. But I did some tests and found the origin of the problem

I can download the file from google with wget but get some interesting
?information in the process:

------------------------------------------------------------
$ wget -v 'http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv'
--2011-04-29 20:07:40--  http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv
Resolving spreadsheets0.google.com... 209.85.148.139, 209.85.148.113, 209.85.148.138, ...
Connecting to spreadsheets0.google.com|209.85.148.139|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv [following]
--2011-04-29 20:07:41--  https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv
Connecting to spreadsheets0.google.com|209.85.148.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ?pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv.1?

    [ <=>                                                                                                       ] 41          --.-K/s   in 0s      

2011-04-29 20:07:42 (342 KB/s) - ?pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv.1? saved [41]
------------------------------------------------------------

The message that caught my attention was the http redirection: "302 Moved
Temporarily".

If you try again with the new url you get this:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") : unsupported URL scheme

?url told me "Note that ?https://? connections are not supported."
Case closed, problem unsolved...

Dirty workaround: use system() and wget or whatever command is available on
Windows for this.

cu
	Philipp
#
Hi Tal

You can add

  ssl.verifypeer = FALSE

in the .opts list so that the certificate is simply accepted.

Alternatively, you can tell libcurl where to find the certification
authority file containing signatures. This can be done via the cainfo
option, e.g.

   cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"),

Often such a collection of certificates is installed with the ssl library.

  D.
On 4/29/11 2:42 PM, Tal Galili wrote: