scan html: sep = "<td>"

entry from html:

  <tr bgcolor=#9090f0><td align="right"><b>BM</b></td><td> 
0.952</td><td> 0.136</td><td> 6.984</td><td>0.000000</td></tr>
  <tr bgcolor=#9090f0><td align="right"><b>BH</b></td><td> 
1.338</td><td> 0.136</td><td> 9.821</td><td>0.000000</td></tr>

 using
left.data<- scan(paste(path, left.file, sep = ""), what = 'character',
               sep=c("<td>", "</td>"))

yields

 > left.data
 [1] "  "                  "tr bgcolor=#9090f0>" "td align=right>"
 [4] "b>BM"                "/b>"                 "/td>"
 [7] "td> 0.952"           "/td>"                "td> 0.136"
[10] "/td>"                "td> 6.984"           "/td>"
[13] "td>0.000000"         "/td>"                "/tr>"
[16] "  "                  "tr bgcolor=#9090f0>" "td align=right>"
[19] "b>BH"                "/b>"                 "/td>"
[22] "td> 1.338"           "/td>"                "td> 0.136"
[25] "/td>"                "td> 9.821"           "/td>"
[28] "td>0.000000"         "/td>"                "/tr>"

why doesn't it detect the whole '<tr> as sep?

Uwe Ligges wrote:

Christoph Lehmann wrote:

Hi
I try to import html text and I need to split the fields at each <td> 
or </td> entry

How can I succeed? sep = '<td>' doens't yield the right result

If it fits pairwise together, use
  sep=c("<td>", "</td>")
Apologies, one should not send untested code.
"sep" must be a character rather than a string containg more than one 
character.

So you may want to try out my second suggestion.

Uwe Ligges
if not, you can read the whole lot with readLines and strsplit for 
both pattern after that, for example.

Uwe Ligges

thanks for hints

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html