as.Date (and strptime?) does not recognize " " as a blank
There is some misunderstanding here. The space is part of the format
specified by SG to as.Date(), which passes it to strptime(). So SG asked
to match a space and complained that a different character is not matched!
Reading the documentation of strptime shows
?%n? Newline on output, arbitrary whitespace on input.
?%t? Tab on output, arbitrary whitespace on input.
so one might hope that one could use those to specify whitespace instead
of ASCII space in the format. But unfortunately whether a Unicode
no-break space (U+00A0) is whitespace is a matter of opinion -- for
example the PCRE author changed his a few years back.
We don't have a reproducible example, but my attempt at reproduction
suggests that U+00A0 is not regarded as whitespace on the system I used.
We know this to be platform-specific (it uses the C function
iswspace): glibc does not regard this as whitespace and the replacement
functions used by R on macOS and Windows have followed suit.
In short, ASCII space matches only itself, and the interpretation of
'blank' (in regexps) or 'whitespace' (in strptime or regexps) is
platform-specific and liable to change.
On 25/06/2022 14:13, Spencer Graves wrote:
Hi, Maxim et al.: On 6/25/22 6:10 AM, Maxim Nazarov wrote:
Hello,
When is a space not a space?
I guess the answer is when it is a non-breaking one?..
We can observe:
? > charToRaw(textutils::HTMLdecode(" "))
? [1] c2 a0
? > charToRaw(" ")
? [1] 20
So one can argue that everything works correctly - `textutils`
function converts HTML's non-breaking space ' ' into R's
non-breaking space '\xa0', while %e format of as.Date expects a
'normal' space.
But this is obviously not user-friendly especially since both symbols
are displayed the same way on the console.
So your options might be to either:
? * manually change all 'weird' spaces into normal ones with something
like gsub("\\h", " ", ..., perl = TRUE) - for the list of other weird
spaces see
https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes
? * persuade textutils author to change into a normal space
(they seem to be working with a simple lookup table -
https://github.com/enricoschumann/textutils/blob/b813c7bd4b55daef5fa7612e3fbfe82962711940/R/char_refs.R#L1465-L1466)
? * persuade R-Core (or submit a PR) to relax expectations of
as.Date/strptime
????? Thanks for the reply.? Since "this is obviously not user-friendly", as you noted, I felt a need to bring it to the attention of this group, and let them decide what if anything they would want to do about it. ????? In any event, I found a fix for my immediate problem.? It's not as elegant as yours, but it works. ????? Best Wishes, ????? Spencer
Kind regards, Maxim Nazarov ----- On Jun 25, 2022, at 8:37 AM, Spencer Graves spencer.graves at prodsyse.com wrote:
Hello, All: ????? When is a space not a space? ????? Consider the following:
(pblmDate <- textutils::HTMLdecode(" 2 Mar 2018"))
[1] " 2 Mar 2018"
as.Date(pblmDate, format='%e %b %Y')
[1] NA
as.Date(' 2 Mar 2018', format='%e %b %Y')
[1] "2018-03-02" ????? Is this a feature or a bug? ????? I can work around it, now that I know what it is, but it took me a few hours to diagnose. ????? Thanks, ????? Spencer Graves p.s.? I got this from scraping a website with code that had worked for me roughly 20 months ago.? I suspect that in the interim, someone probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford