as.Date (and strptime?) does not recognize " " as a blank
Depends a bit on what you mean by "automatically". This seems to work for me (note this has NOT been extensively tested on different OSes or even in different locales/encodings): library(XML) myhtml <- "<html><body><table id='hiya'><tr><th>colname</th></tr><tr><td> </td></tr><tr><td> </td></tr></table></body></html>" doc <- htmlParse(myhtml, asText = TRUE) oldway <- readHTMLTable(doc, trim = FALSE) identical(oldway$hiya$colname[1], oldway$hiya$colname[2]) # FALSE :( decode_nbsp <- function(x) gsub(rawToChar(as.raw(c(0xc2, 0xa0))), " ", x, fixed = TRUE, useBytes = TRUE) fancypants <- function(node) decode_nbsp(xmlValue(node)) newandfancy <- readHTMLTable(doc, trim = FALSE, elFun = fancypants) identical(newandfancy$hiya$colname[1], newandfancy$hiya$colname[2]) # TRUE :D Best, ~G On Fri, Jun 24, 2022 at 11:48 PM Spencer Graves <spencer.graves at prodsyse.com> wrote:
p.s. Is there a way to get XML::readHTMLTable to automatically convert " " to a normal blank space? On 6/25/22 1:37 AM, Spencer Graves wrote:
Hello, All:
When is a space not a space?
Consider the following:
> (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018"))
[1] " 2 Mar 2018"
> as.Date(pblmDate, format='%e %b %Y')
[1] NA
> as.Date(' 2 Mar 2018', format='%e %b %Y')
[1] "2018-03-02"
Is this a feature or a bug?
I can work around it, now that I know what it is, but it took me
a few hours to diagnose.
Thanks,
Spencer Graves
p.s. I got this from scraping a website with code that had worked for
me roughly 20 months ago. I suspect that in the interim, someone
probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel