readLines interaction with gsub different in R-dev

I was told to re-raise this issue with R-dev:

In the documentation of R-dev and R-3.4.3, under ?gsub
replacement
   ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
However, the following code runs differently:

tempf <- tempfile()
writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
entry <- readLines(tempf, encoding = "UTF-8")
gsub("(\\w)", "\\U\\1", entry, perl = TRUE)

"AUTHOR: AM?LIE"  # R-3.4.3

"A"                              # R-dev

Best,

Hugh Parsonage.
| I was told to re-raise this issue with R-dev:
| 
| In the documentation of R-dev and R-3.4.3, under ?gsub
| 
| > replacement
| >    ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
| 
| However, the following code runs differently:
| 
| tempf <- tempfile()
| writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
| 
| 
| "AUTHOR: AM?LIE"  # R-3.4.3
| 
| "A"                              # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AM?LIE"
R> 

Dirk
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Am?lie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
 "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AM??LIE"  # latin1 encoding

A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.
On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
|
| In the documentation of R-dev and R-3.4.3, under ?gsub
|
| > replacement
| >    ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AM?LIE"  # R-3.4.3
|
| "A"                              # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AM?LIE"
R>

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
I think the problem in R-devel happens when there are non-ASCII characters
in any
of the strings passed to gsub.

txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
txt
#[1] "Am?lie" "Amelia"
Encoding(txt)
#[1] "unknown" "unknown"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
#[1] "<a" "<a"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1])
#[1] "<a"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2])
#[1] "<aM><eL><iA>"

I can change the Encoding to "latin1" or "UTF-8" and get similar results
from gsub.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <hugh.parsonage at gmail.com>
wrote:
| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Am?lie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
 "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AM??LIE"  # latin1 encoding

A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.

On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:
On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
|
| In the documentation of R-dev and R-3.4.3, under ?gsub
|
| > replacement
| >    ... For perl = TRUE only, it can also contain "\U" or "\L" to
convert the rest of the replacement to upper or lower case and "\E" to end
case conversion.
|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AM?LIE"  # R-3.4.3
|
| "A"                              # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AM?LIE"
R>

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Thank you for the report and analysis. Now fixed in R-devel.
Tomas
I think the problem in R-devel happens when there are non-ASCII characters
in any
of the strings passed to gsub.

txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
txt
#[1] "Am?lie" "Amelia"
Encoding(txt)
#[1] "unknown" "unknown"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
#[1] "<a" "<a"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1])
#[1] "<a"
gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2])
#[1] "<aM><eL><iA>"

I can change the Encoding to "latin1" or "UTF-8" and get similar results
from gsub.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <hugh.parsonage at gmail.com>
wrote:

| Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
| you use wrong, ie isn't R-devel giving the correct answer?

No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."

Perhaps my example was too minimal. Consider the following:

R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"

R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Am?lie"   # OK, but very different to 'A', despite only
not specifying uppercase

R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE"  # OK, but very different to 'A',

R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
  "AUTHOR"  # Where did everything after the first group go?

I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AM??LIE"  # latin1 encoding

A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.

On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:
On 17 February 2018 at 21:10, Hugh Parsonage wrote:
| I was told to re-raise this issue with R-dev:
|
| In the documentation of R-dev and R-3.4.3, under ?gsub
|
| > replacement
| >    ... For perl = TRUE only, it can also contain "\U" or "\L" to
convert the rest of the replacement to upper or lower case and "\E" to end
case conversion.
|
| However, the following code runs differently:
|
| tempf <- tempfile()
| writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
| entry <- readLines(tempf, encoding = "UTF-8")
| gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
|
|
| "AUTHOR: AM?LIE"  # R-3.4.3
|
| "A"                              # R-dev

Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
regexp
you use wrong, ie isn't R-devel giving the correct answer?

R> tempf <- tempfile()
R> writeLines(enc2utf8("author: Am?lie"), con = tempf, useBytes = TRUE)
R> entry <- readLines(tempf, encoding = "UTF-8")
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR"
R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
[1] "AUTHOR: AM?LIE"
R>

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel