Skip to content

rowspan and readHTMLTable

2 messages · Chris Stubben

#
I'm trying to read html tables with lots of rowspan attributes, for 
example...

x<-htmlParse("<table>
   <tr><td rowspan=2>ab</td><td>X</td></tr>
   <tr><td rowspan=2>YZ</td></tr>
   <tr><td>c</td></tr>
</table>")

 readHTMLTable(x, which=1)
  V1   V2
1 ab    X
2 YZ <NA>
3  c <NA>

Does anyone know how to use the rowspan attributes and repeat cell 
values  to format a table like this?

  V1   V2
1 ab    X
2 ab   YZ
3  c   YZ

Also, the actual tables I'm using are large, for example, this one has 
206 rows and rowspan attributes ranging from 2-14 scattered in all 8 
columns, so the shifted rows in t1 are not very useful right now.

t1 <- readHTMLTable( 
"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3544749/table/T1", which=1)

Thanks,
Chris












t1<-readHTMLTable( 
"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3544749/table/T1", which=1)
1 day later
#
Sorry to answer my own question - I guess here's one way to read this 
table.  Other suggestions are still welcome.

Chris

------

x<-htmlParse("<table>
<tr><td rowspan=2>ab</td><td>X</td></tr>
<tr><td rowspan=2>YZ</td></tr>
<tr><td>c</td></tr>
</table>")

# split by rows
z <- getNodeSet(x, "//tr")

# create empty data.frame - probably not the best solution...
t1<- data.frame(matrix(NA, nrow = 3,  ncol = 2 ))

for (i in 1:3){
   rowspan <- as.numeric( xpathSApply(z[[i]], ".//td", xmlGetAttr, 
"rowspan", 1) )
   val <- xpathSApply(z[[i]], ".//td", xmlValue)

   # fill values into empty cells
   n <- which(is.na(t1[i,]))
   t1[ i ,n] <- val

   if( any(rowspan > 1) ){
      for(j in 1:length( rowspan ) ){
         if(rowspan[j] > 1){
             ## repeat value down column
               t1[ (i+1):(i+ ( rowspan[j] -1) ) , n[j] ]   <- val[j]
         }
      }
   }
}


t1
  X1 X2
1 ab  X
2 ab YZ
3  c YZ


If you are interested, I used this code in the pmcTable function at 
https://github.com/cstubben/pubmed .  To get  Table 1, this now works...

doc<-pmc("PMC3544749")  # downloads XML from OAI service
t1 <- pmcTable(doc,1)  # parse table... also saves caption and footnotes 
to attributes
 t1[1:4,1:4]
                           Category Gen Name Rv 
number                                      Description
1 Lipids and Fatty Acid Metabolism     kasB    Rv2246 
3-oxoacyl-[acyl-carrier protein] synthase 2 kasb
2           Mycolic acid synthesis    mmaA4   Rv0642c                  
Methoxy mycolic acid synthase 4
3           Mycolic acid synthesis     pcaA   Rv0470c    Mycolic acid 
synthase (cyclopropane synthase)
4           Mycolic acid synthesis     pcaA   Rv0470c    Mycolic acid 
synthase (cyclopropane synthase)