Yahoo bug in tseries::get.hist.quote and its::priceIts

4 messages · Douglas Bates, Peter Dalgaard, Dirk Eddelbuettel

Sat, Apr 24, 2004 5:32 PM #

Both get.hist.quote, and its derivative priceIts, rely on download.file() to
fetch financial data series from Yahoo! in .csv format. They allow for nice
interactive demonstrations of what one can do with R.

Unfortunately, both are currently broken as Yahoo! decided to add a somewhat
useless html comment at the end of the csv 'stream', breaking the regular
format of n rows with k columns.  Here is an example for the S&P500 index
since the beginning of the month (to keep it compact):

Date,Open,High,Low,Close,Volume,Adj. Close*
23-Apr-04,1140.81,1141.75,1134.89,1140.60,1820460032,1140.60
22-Apr-04,1122.01,1142.53,1121.98,1139.93,2147280000,1139.93
21-Apr-04,1119.24,1125.66,1116.07,1124.09,1995879936,1124.09
20-Apr-04,1137.60,1139.27,1118.09,1118.15,1806850048,1118.15
19-Apr-04,1132.81,1136.17,1129.87,1135.82,1374380032,1135.82
16-Apr-04,1133.86,1136.75,1126.92,1134.61,1723180032,1134.61
15-Apr-04,1130.45,1133.72,1120.85,1128.84,1895289984,1128.84
14-Apr-04,1122.44,1132.47,1122.33,1128.17,1682800000,1128.17
13-Apr-04,1145.20,1147.73,1127.72,1129.44,1616720000,1129.44
12-Apr-04,1141.98,1147.24,1139.32,1145.20,1194080000,1145.20
9-Apr-04,1149.73,1139.32,1139.32,1139.32,0,1139.32
8-Apr-04,1140.53,1148.91,1134.54,1139.32,1435520000,1139.32
7-Apr-04,1146.25,1148.16,1138.48,1140.53,1658200064,1140.53
6-Apr-04,1144.26,1150.57,1143.35,1148.16,1551449984,1148.16
5-Apr-04,1141.81,1150.57,1141.63,1150.57,1614749952,1150.57
2-Apr-04,1144.15,1144.73,1132.17,1141.81,2134489984,1141.81
1-Apr-04,1128.14,1135.53,1126.21,1132.17,1765560064,1132.17
<!-- chart2.finance.scd.yahoo.com uncompressed Sat Apr 24 15:27:40 PDT 2004 -->

Is there an _elegant and portable_ way of reading this with the last line?
I needed this, and used the somewhat clunky 

    data <- read.csv(destfile)
    unlink(destfile)
    data <- data[-(nlines-1),]          # skip very last line with commment

which uses nlines, which had already been computed (as has a offset of one
because of the header line).

I'd be happy to send this as a patch to tseries and its, but I have the
feeling we could do better.  How?

Thanks,  Dirk

The relationship between the computed price and reality is as yet unknown.  
                                             -- From the pac(8) manual page

Douglas Bates

Sat, Apr 24, 2004 5:50 PM #

Dirk Eddelbuettel <edd@debian.org> writes:

If you do not expect to encounter the "<" character in your data you
could try adding comment.char = "<" to your call to read.csv.

Peter Dalgaard

Sat, Apr 24, 2004 6:19 PM #

Dirk Eddelbuettel <edd@debian.org> writes:

Er, how does this affect get.hist.quote? I see some flakiness, but the
basic conversion appears to work:

trying URL
`http://chart.yahoo.com/table.csv?s=spc&a=0&b=01&c=1998&d=3&e=24&f=2004&g=d&q=q&y=0&z=spc&x=.csv'
Error in download.file(url, destfile, method = method) :
        cannot open URL
`http://chart.yahoo.com/table.csv?s=spc&a=0&b=01&c=1998&d=3&e=24&f=2004&g=d&q=q&y=0&z=spc&x=.csv'
In addition: Warning message:
cannot open: HTTP status was `404 Not Found'

trying URL
`http://chart.yahoo.com/table.csv?s=spc&a=0&b=01&c=1998&d=3&e=24&f=2004&g=d&q=q&y=0&z=spc&x=.csv'
Content type `application/octet-stream' length unknown
opened URL
.......... .......... .......... .......... ..........
.......... .......... ..
downloaded 72Kb

time series starts 1998-01-02
time series ends   2004-04-01

(Yes, that's the same URL, a few seconds later!)

How about this?

`data.frame':   1586 obs. of  7 variables:
 $ Date       : Factor w/ 1586 levels "1-Apr-02","1-Ap..",..: 786 732 681 629 524 368 315 263 210 157 ...
 $ Open       : num  91.0 90.5 91.2 92.0 91.9 ...
 $ High       : num  91.6 91.5 91.4 92.5 92.3 ...
 $ Low        : num  90.4 89.7 90.7 90.7 91.7 ...
 $ Close      : num  91.3 90.7 91.3 90.7 91.9 ...
 $ Volume     : int  5063200 7988000 4623400 4260200 4159400 1111800 6844200 5316300 5013600 3112600 ...
 $ Adj..Close.: num  91.3 90.7 91.3 90.7 91.9 ...

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Dirk Eddelbuettel

Sat, Apr 24, 2004 6:51 PM #

On Sun, Apr 25, 2004 at 01:16:15AM +0200, Peter Dalgaard wrote:

Ah, yes, my bad. I was working with priceIts, and it flakes out as its fails
on the NA the comment turns into:

Error in validObject(.Object) : Invalid "its" object: Missing values in dates

Doug's suggestion of simply using '<' as the comment char is good. I was so
fixated on explaining '<!--' as one that I didn't think of '<'.

But ...

that wins the price. That is pretty much what I was thinking of. Neato.
Doesn't rely on the position of '<!--' within the file, and is less likely
to trigger a false positive as hunting for '<' is.

Thanks!

Dirk

The relationship between the computed price and reality is as yet unknown.  
                                             -- From the pac(8) manual page