readLines interaction with gsub different in R-dev
Thank you for the report and analysis. Now fixed in R-devel. Tomas
On 02/17/2018 08:24 PM, William Dunlap via R-devel wrote:
I think the problem in R-devel happens when there are non-ASCII characters in any of the strings passed to gsub. txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)), as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "") txt #[1] "Am?lie" "Amelia" Encoding(txt) #[1] "unknown" "unknown" gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt) #[1] "<a" "<a" gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1]) #[1] "<a" gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2]) #[1] "<aM><eL><iA>" I can change the Encoding to "latin1" or "UTF-8" and get similar results from gsub. Bill Dunlap TIBCO Software wdunlap tibco.com On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <hugh.parsonage at gmail.com> wrote:
| Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the
regexp
| you use wrong, ie isn't R-devel giving the correct answer?
No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."
Perhaps my example was too minimal. Consider the following:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Am?lie" # OK, but very different to 'A', despite only
not specifying uppercase
R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE" # OK, but very different to 'A',
R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
"AUTHOR" # Where did everything after the first group go?
I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AM??LIE" # latin1 encoding
A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.
On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:
On 17 February 2018 at 21:10, Hugh Parsonage wrote: | I was told to re-raise this issue with R-dev: | | In the documentation of R-dev and R-3.4.3, under ?gsub | | > replacement | > ... For perl = TRUE only, it can also contain "\U" or "\L" to
convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AM?LIE" # R-3.4.3
|
| "A" # R-dev
Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the
regexp
you use wrong, ie isn't R-devel giving the correct answer?
R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AM?LIE"
R>
Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel