Feature Request: Allow Underscore Separated Numbers

20 messages · Devin Marlin, @vi@e@gross m@iii@g oii gm@ii@com, Ben Bolker +7 more

Original

1

20

Devin Marlin

Thu, Jul 14, 2022 12:53 PM #

Hello,

After using R for a number of years, and venturing into other languages,
I've noticed the ones with the ability to enter numbers separated by
underscores for readability (like 100000 as 100_000) make life a whole lot
easier, especially when debugging. Is this a feature that could be
implemented in R?

Regards,

*Devin Marlin*

	[[alternative HTML version deleted]]

@vi@e@gross m@iii@g oii gm@ii@com

Thu, Jul 14, 2022 5:21 PM #

Devin,

I cannot say anyone wants to tweak R after the fact to accept numeric items
with underscores as that might impact all kinds of places.

Can I suggest a workaround that allows you to enter your integer (or
floating point which gets truncated) using this:

underint <- function(text) as.integer(gsub("_+", "", text))

Use a call to that anywhere you want an int like:

result <- underint("1_000_000") + underint("6___6__6_6") - 6000

results in: 100666

If you want to see the result with underscores, using something like
scales::comma as in

You can also make similar functions that use as.numeric() and as.double()
but note that this allows you to enter data at somewhat greater expense and
as text/strings. Obviously a similar technique can be used with regular
expressions of many kinds to wipe out or replace anything, including commas
with this:

undernumeric <- function(text) as.numeric(gsub("[,_]+", "", text))

undernumeric("123,456.789_012")
[1] 123456.8

Yes, it truncated it but I am sure any combo of underscores and commas will
be removed. It also truncates the same thing with all numerals and a period.



-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Devin Marlin
Sent: Thursday, July 14, 2022 3:54 PM
To: r-devel at r-project.org
Subject: [Rd] Feature Request: Allow Underscore Separated Numbers

Hello,

After using R for a number of years, and venturing into other languages,
I've noticed the ones with the ability to enter numbers separated by
underscores for readability (like 100000 as 100_000) make life a whole lot
easier, especially when debugging. Is this a feature that could be
implemented in R?

Regards,

--
*Devin Marlin*


______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Ben Bolker

Thu, Jul 14, 2022 5:29 PM #

On 2022-07-14 8:21 p.m., avi.e.gross at gmail.com wrote:

It's not really 'truncated', it's just printed with limited 
precision.  (Sorry if I'm telling you something you already know ...)

options(digits = 22)
undernumeric("123,456.789_012")
[1] 123456.7890119999938179

(and there's floating point inaccuracy rearing its ugly head again; 
options(digits=16) works well for this example ...)

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics

@vi@e@gross m@iii@g oii gm@ii@com

Thu, Jul 14, 2022 5:53 PM #

Yes, Ben, your point (way below) is correct. As I noted, as.numeric() also
truncated a normal notation so I did not worry about it as I could tweak the
system and both versions (underscores too) would now show more precision. 

I can think of oodles more ways to allow showing big numbers as readable
such as writing them in segments and concatenating them with paste0() as in:

assembleint <- function(...) as.integer(paste(..., sep=""))

[1] 12345678

But it really at some point is not very readable. I was a bit annoyed at the
underscore method used in other languages I know as for me the comma is the
normal separator but commas are so deeply embedded for various uses in just
about any language, that they could not be allowed within a grouping of
digits. Few things can be but "_" maybe could as it is allowed in other
identifiers and in Python, even at the start in some places.

But now I realize that others have different methods. I recently saw someone
using a CSV file with numbers that use comma as a decimal delimiter and thus
they use semicolon to keep the fields apart.  But we have R functions that
easily handle importing from that as long as once inside, we deal with them
without seeing them again unless needed.

I am thinking of all the regular expressions that would break badly if
underscores in digits are allowed. All the [0-9] constructs might need to be
[0-9_] and \d might need to be redefined. The end of a number might be
undefined if it bumped up against something else with an underscore at the
edge.

If we were re-inventing everything today, I suspect we might have started
with something like UNICODE with lots more symbols than ASCII or EBCDIC had
and that might include a globally defined comma-separator symbol that was
never used except with a number so it would be part of the definition of
what numeric digits are. But that is not going to happen.

-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Ben Bolker
Sent: Thursday, July 14, 2022 8:30 PM
To: r-devel at r-project.org
Subject: Re: [Rd] Feature Request: Allow Underscore Separated Numbers

On 2022-07-14 8:21 p.m., avi.e.gross at gmail.com wrote:

period.

It's not really 'truncated', it's just printed with limited precision.
(Sorry if I'm telling you something you already know ...)

options(digits = 22)
undernumeric("123,456.789_012")
[1] 123456.7890119999938179

(and there's floating point inaccuracy rearing its ugly head again;
options(digits=16) works well for this example ...)

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

GILLIBERT, Andre

Thu, Jul 14, 2022 11:31 PM #

On 2022-07-14 8:21 p.m., avi.e.gross at gmail.com wrote:

I am not sure that the feature request of Devin Marlin was correctly understood.
I guess that he thought about adding syntactic sugar to numeric literals in the language.
Functions such as as.numeric(), or read.csv() would not be changed.

The main difference would be to make valid code that currently is a "syntax error", such as:

Error: unexpected input in "3*100_"

Breaking code with that feature is possible but improbable.
Indeed, code expecting that str2lang("3*100_000") make a syntax error (catching the error with try) would break.
Most code generating other code then parsing it with str2lang() should be fine, because it would generate old-style code with normal numeric constants.

--
Sincerely
Andr? GILLIBERT

Jan van der Laan

Fri, Jul 15, 2022 1:27 AM #

Another R-solution would be:

`%,%` <- function(a, b) a*1000 + b

which would allow one to write large numbers as

 > 100%,%123

Resulting in 100123.

Not sure if this really helps with readability.


I actually think this could better be handled by the IDE one is working 
in. Most IDE's already do syntax highlighting and when a suitable font 
is used text as '!=' is displayed as '?' (the unequal sign). So I guess 
they can also apply special formatting to large numbers such as grouping 
the numbers, underlining groups of three, using colour, ..


Jan

On 15-07-2022 02:21, avi.e.gross at gmail.com wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Fri, Jul 15, 2022 3:01 AM #

On 14/07/2022 3:53 p.m., Devin Marlin wrote:

I think this could be done, but I doubt if anyone who could make the 
change would think the arbitrary decisions and added complexity in the 
parser was worth the effort.  (Would it be a thousands separator, or 
could it separate any number of digits?  Could a number start with an 
underscore or end with one?)

Instead, I'd suggest creating variables with meaningful names to hold 
big numeric constants, e.g.

     AU_in_km <- 149598262  # or 149.598262e6

     Jupiter <- 5.2 * AU_in_km

rather than

     Jupiter <- 5.2 * 149_598_262

Duncan Murdoch

Gabor Grothendieck

Fri, Jul 15, 2022 6:12 AM #

It would be best to simply ignore any embedded _ in numbers rather than only
accept them at fixed locations since it isn't always
a thousands separator.  For example, 1 lakh is sometimes written as
1,00,000 rupees.

On Fri, Jul 15, 2022 at 7:25 AM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

@vi@e@gross m@iii@g oii gm@ii@com

Fri, Jul 15, 2022 8:25 AM #

Andr?,

 

I am not saying a change cannot be done and am not familiar enough with the
internals of R. If you just want the interpreter to evaluate CONSTANTS in
the code as what you consider syntactic sugar and replace 1_000 with 1000,
that sounds superficially possible. But is it?

 

R normally delays evaluation so chunks of code are handed over untouched to
functions that often play with the text directly without evaluating it
until, perhaps, much later. And I have pointed out how much work is done
with things like regular expressions or reading things in from a file that
is not done in the REPL but in functions behind the scene. So if there is
any way for a number to slide in without being modified, or places where you
want the darn underscores preserved, you may well cause a glitch.

 

Languages that design in the ability have obviously dealt with issues and
presumably anyone writing code anew can use a new definition in their work
so they handle such numbers. I am not saying such a change cannot be done,
simply that existing languages are careful about making changes as they
strive to retain compatibility.

 

So even assuming your statement about not needing to change as.numeric or
read.csv functions is true, aren?t you introducing a change in which the
users will inadvertently use the feature in strings or files and assume it
is a globally recognized feature? I use CSV files and other such formats
quite a bit as a way to exchange data between R and other environments and
unless they all change and allow underscores in numbers, there can be
issues. So, yes, you are suggesting nothing in R will write out numbers with
underscores. But if others do and you import the data into R with a reader
that does not understand, we have anomalies.

 

I am not arguing with anyone about this. Like many proposed features, it
sounds reasonable just by itself. But for a language that was crafted and
then modified many times, the burden is often on those wanting a change to
convince us that it can be done benignly, effectively and cheaply AND that
it is more worthwhile than a thousand other pending ideas already submitted.

 

I have never used str2lang() in my life directly so would changing that
really help if as.numeric() and other such functions were left alone and did
not call it? What if I read in a .CSV a line at a time and use various
methods including regular expressions to split the line into parts and then
make the parts into numbers based on some primitive algorithm that maps
digits 0-9 into small integers 0-9 and then positionally multiplies digits
to the left by 10 for each level and adds them up. Will that algorithm know
about underscores and not only ignore them but keep track of how many times
it multiplies the other parts by 10? Sure, we can write a new algorithm with
added complexity but in my view, we can solve the problem in the few cases
it matters without such a change.

 

Had this been built in originally, maybe not a problem. But consider the
enormous expense of UNICODE and the truly major upheaval needed to get it
working  at a time when lots of code using pointers had a reasonable
expectation that all characters took up the same number of bytes, and
calculating the length of a string could be done by simply subtracting one
pointer from another. Now, you actually have to read the entire string and
count code points, or keep the length as a part of the structure that is
changed any time it changes and so on.

 

But arguably UNICODE support is now required in many cases. So, yes,
underscores in numbers may become commonplace and cause headaches for a
while. But mathematically, I don?t see them as needed and see many ways to
allow a programmer to see what a number is without any problems in the few
times they want it. Cut and paste in code can easily take out any snippet
accurately and pluck it into a function that displays it with commas or
whatever. But definitely, lazy humans constantly make mistakes and even with
this would still make some.

 

But if R developers seem confident this change can be done, go for it!
Numeric literals, like other constants, have often been something compiled
languages have optimized out of the way, such as combining multiple
instances of the same one into one memory location.

 

Avi

 

 

From: GILLIBERT, Andre <Andre.Gillibert at chu-rouen.fr> 
Sent: Friday, July 15, 2022 2:31 AM
To: avi.e.gross at gmail.com; r-devel at r-project.org
Subject: RE: [Rd] Feature Request: Allow Underscore Separated Numbers

 

 

 

On 2022-07-14 8:21 p.m., avi.e.gross at gmail.com

<mailto:avi.e.gross at gmail.com> wrote:

I am not sure that the feature request of Devin Marlin was correctly
understood.

I guess that he thought about adding syntactic sugar to numeric literals in
the language.

Functions such as as.numeric(), or read.csv() would not be changed.

 

The main difference would be to make valid code that currently is a "syntax
error", such as:

Error: unexpected input in "3*100_"

 

Breaking code with that feature is possible but improbable.

Indeed, code expecting that str2lang("3*100_000") make a syntax error
(catching the error with try) would break.

Most code generating other code then parsing it with str2lang() should be
fine, because it would generate old-style code with normal numeric
constants.

Sincerely

Andr? GILLIBERT


	[[alternative HTML version deleted]]

@vi@e@gross m@iii@g oii gm@ii@com

Fri, Jul 15, 2022 8:56 AM #

Jan,

Many ideas like yours are a solution that delivers something like syntactic sugar, albeit yours may be too specific. It would not be useful for say recording a social security number or other identifier that is normally clumped  (albeit it is not always treated as a number) when someone wants to accurately record 123-45-6789 but must write 123456789 and possibly not chunk it right and get a wrong number. Multiplying by a thousand here won?t get the right result.

But a low tech solution at any level is to use a polynomial format of sorts so 123456789 is written as:

123 * 10^6 + 456 + 10^3 + 789

Or some similar notation.

And, can you add a new number designator along the lines of 0x12ff meaning hexadecimal so writing some prefix like "00_" might mean what follows contains underscores you can ignore when putting it into a number format?

The main reason I guess for the change seems to be for following along what others have chosen.

Avi

-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Jan van der Laan
Sent: Friday, July 15, 2022 4:28 AM
To: r-devel at r-project.org
Subject: Re: [Rd] Feature Request: Allow Underscore Separated Numbers


Another R-solution would be:

`%,%` <- function(a, b) a*1000 + b

which would allow one to write large numbers as

 > 100%,%123

Resulting in 100123.

Not sure if this really helps with readability.


I actually think this could better be handled by the IDE one is working in. Most IDE's already do syntax highlighting and when a suitable font is used text as '!=' is displayed as '?' (the unequal sign). So I guess they can also apply special formatting to large numbers such as grouping the numbers, underlining groups of three, using colour, ..


Jan

On 15-07-2022 02:21, avi.e.gross at gmail.com wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Ivan Krylov

Fri, Jul 15, 2022 10:21 AM #

On Fri, 15 Jul 2022 11:25:32 -0400

<avi.e.gross at gmail.com> wrote:

Do they play with the text, or with the syntax tree after it went
through the parser? While it's true that R saves the source text of the
functions for ease of debugging, it's not guaranteed that a given
object will have source references, and typical NSE functions operate
on language objects which are tree-like structures containing R values,
not source text.

You are, of course, right that any changes to the syntax of the
language must be carefully considered, but if anyone wants to play with
this idea, it can be implemented in a very simple manner:

--- src/main/gram.y	(revision 82598)
+++ src/main/gram.y	(working copy)
@@ -2526,7 +2526,7 @@
     YYTEXT_PUSH(c, yyp);
     /* We don't care about other than ASCII digits */
     while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-	   || c == 'x' || c == 'X' || c == 'L')
+	   || c == 'x' || c == 'X' || c == 'L' || c == '_')
     {
 	count++;
 	if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present). */
@@ -2533,6 +2533,9 @@
 	{   YYTEXT_PUSH(c, yyp);
 	    break;
 	}
+	if (c == '_') { /* allow an underscore anywhere inside the literal */
+	    continue;
+	}
 	
 	if (c == 'x' || c == 'X') {
 	    if (count > 2 || last != '0') break;  /* 0x must be first */

To an NSE function, the underscored literals are indistinguishable from
normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000))
f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y)))
f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
 list(\() 1000000, \() 1_000_000),
 \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
# 
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores
in succession and underscores at the end of the number literal. Perl
does so too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

Best regards,
Ivan

Jim Hester

Fri, Jul 15, 2022 10:58 AM #

Allowing underscores in numeric literals is becoming a very common
feature in computing languages. All of these languages (and more) now
support it

python: https://peps.python.org/pep-0515/
javascript: https://v8.dev/features/numeric-separators
julia: https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers
java: https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html#:~:text=In%20Java%20SE%207%20and,the%20readability%20of%20your%20code.
ruby: https://docs.ruby-lang.org/en/2.0.0/syntax/literals_rdoc.html#label-Numbers
perl: https://perldoc.perl.org/perldata#Scalar-value-constructors
rust: https://doc.rust-lang.org/rust-by-example/primitives/literals.html
C#: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types#real-literals
go: https://go.dev/ref/spec#Integer_literals

Its use in this context also dates back to at least Ada 83
(http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#:~:text=A%20decimal%20literal%20is%20a,the%20base%20is%20implicitly%20ten).&text=An%20underline%20character%20inserted%20between,value%20of%20this%20numeric%20literal.)

Many other communities see the benefit of this feature, I think R's
community would benefit from it as well.

On Fri, Jul 15, 2022 at 1:22 PM Ivan Krylov <krylov.r00t at gmail.com> wrote:

On Fri, 15 Jul 2022 11:25:32 -0400
<avi.e.gross at gmail.com> wrote:

R normally delays evaluation so chunks of code are handed over
untouched to functions that often play with the text directly without
evaluating it until, perhaps, much later.

Do they play with the text, or with the syntax tree after it went
through the parser? While it's true that R saves the source text of the
functions for ease of debugging, it's not guaranteed that a given
object will have source references, and typical NSE functions operate
on language objects which are tree-like structures containing R values,
not source text.

You are, of course, right that any changes to the syntax of the
language must be carefully considered, but if anyone wants to play with
this idea, it can be implemented in a very simple manner:

--- src/main/gram.y     (revision 82598)
+++ src/main/gram.y     (working copy)
@@ -2526,7 +2526,7 @@
     YYTEXT_PUSH(c, yyp);
     /* We don't care about other than ASCII digits */
     while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-          || c == 'x' || c == 'X' || c == 'L')
+          || c == 'x' || c == 'X' || c == 'L' || c == '_')
     {
        count++;
        if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present). */
@@ -2533,6 +2533,9 @@
        {   YYTEXT_PUSH(c, yyp);
            break;
        }
+       if (c == '_') { /* allow an underscore anywhere inside the literal */
+           continue;
+       }

        if (c == 'x' || c == 'X') {
            if (count > 2 || last != '0') break;  /* 0x must be first */

To an NSE function, the underscored literals are indistinguishable from
normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000))
f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y)))
f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
 list(\() 1000000, \() 1_000_000),
 \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
#
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores
in succession and underscores at the end of the number literal. Perl
does so too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

--
Best regards,
Ivan

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Fri, Jul 15, 2022 11:26 AM #

Thanks for posting that list.  The Python document is the only one I've 
read so far; it has a really nice summary 
(https://peps.python.org/pep-0515/#prior-art) of the differences in 
implementations among 10 languages.  Which choice would you recommend, 
and why?

  - I think Ivan's quick solution doesn't quite match any of them.
  - C, Fortran and C++ have special support in R, but none of them use 
underscore separators.
  - C++ does support separators, but uses "'", not "_", and some ancient 
forms of Fortran ignore embedded spaces.

Duncan Murdoch

On 15/07/2022 1:58 p.m., Jim Hester wrote:

Allowing underscores in numeric literals is becoming a very common
feature in computing languages. All of these languages (and more) now
support it

python: https://peps.python.org/pep-0515/
javascript: https://v8.dev/features/numeric-separators
julia: https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers
java: https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html#:~:text=In%20Java%20SE%207%20and,the%20readability%20of%20your%20code.
ruby: https://docs.ruby-lang.org/en/2.0.0/syntax/literals_rdoc.html#label-Numbers
perl: https://perldoc.perl.org/perldata#Scalar-value-constructors
rust: https://doc.rust-lang.org/rust-by-example/primitives/literals.html
C#: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types#real-literals
go: https://go.dev/ref/spec#Integer_literals

Its use in this context also dates back to at least Ada 83
(http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#:~:text=A%20decimal%20literal%20is%20a,the%20base%20is%20implicitly%20ten).&text=An%20underline%20character%20inserted%20between,value%20of%20this%20numeric%20literal.)

Many other communities see the benefit of this feature, I think R's
community would benefit from it as well.

On Fri, Jul 15, 2022 at 1:22 PM Ivan Krylov <krylov.r00t at gmail.com> wrote:

On Fri, 15 Jul 2022 11:25:32 -0400
<avi.e.gross at gmail.com> wrote:

R normally delays evaluation so chunks of code are handed over
untouched to functions that often play with the text directly without
evaluating it until, perhaps, much later.

Do they play with the text, or with the syntax tree after it went
through the parser? While it's true that R saves the source text of the
functions for ease of debugging, it's not guaranteed that a given
object will have source references, and typical NSE functions operate
on language objects which are tree-like structures containing R values,
not source text.

You are, of course, right that any changes to the syntax of the
language must be carefully considered, but if anyone wants to play with
this idea, it can be implemented in a very simple manner:

--- src/main/gram.y     (revision 82598)
+++ src/main/gram.y     (working copy)
@@ -2526,7 +2526,7 @@
      YYTEXT_PUSH(c, yyp);
      /* We don't care about other than ASCII digits */
      while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-          || c == 'x' || c == 'X' || c == 'L')
+          || c == 'x' || c == 'X' || c == 'L' || c == '_')
      {
         count++;
         if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present). */
@@ -2533,6 +2533,9 @@
         {   YYTEXT_PUSH(c, yyp);
             break;
         }
+       if (c == '_') { /* allow an underscore anywhere inside the literal */
+           continue;
+       }

         if (c == 'x' || c == 'X') {
             if (count > 2 || last != '0') break;  /* 0x must be first */

To an NSE function, the underscored literals are indistinguishable from
normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000))
f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y)))
f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
  list(\() 1000000, \() 1_000_000),
  \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
#
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores
in succession and underscores at the end of the number literal. Perl
does so too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

--
Best regards,
Ivan

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Jim Hester

Fri, Jul 15, 2022 12:25 PM #

I think keeping it simple and less restrictive is the best approach,
for ease of implementation, limiting future maintenance, and so users
have the flexibility to format these however they wish. So I would
probably lean towards allowing multiple delimiters anywhere (including
trailing) or possibly just between digits.

On Fri, Jul 15, 2022 at 2:26 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

Thanks for posting that list.  The Python document is the only one I've
read so far; it has a really nice summary
(https://peps.python.org/pep-0515/#prior-art) of the differences in
implementations among 10 languages.  Which choice would you recommend,
and why?

  - I think Ivan's quick solution doesn't quite match any of them.
  - C, Fortran and C++ have special support in R, but none of them use
underscore separators.
  - C++ does support separators, but uses "'", not "_", and some ancient
forms of Fortran ignore embedded spaces.

Duncan Murdoch

On 15/07/2022 1:58 p.m., Jim Hester wrote:

Allowing underscores in numeric literals is becoming a very common
feature in computing languages. All of these languages (and more) now
support it

python: https://peps.python.org/pep-0515/
javascript: https://v8.dev/features/numeric-separators
julia: https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers
java: https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html#:~:text=In%20Java%20SE%207%20and,the%20readability%20of%20your%20code.
ruby: https://docs.ruby-lang.org/en/2.0.0/syntax/literals_rdoc.html#label-Numbers
perl: https://perldoc.perl.org/perldata#Scalar-value-constructors
rust: https://doc.rust-lang.org/rust-by-example/primitives/literals.html
C#: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types#real-literals
go: https://go.dev/ref/spec#Integer_literals

Its use in this context also dates back to at least Ada 83
(http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#:~:text=A%20decimal%20literal%20is%20a,the%20base%20is%20implicitly%20ten).&text=An%20underline%20character%20inserted%20between,value%20of%20this%20numeric%20literal.)

Many other communities see the benefit of this feature, I think R's
community would benefit from it as well.

On Fri, Jul 15, 2022 at 1:22 PM Ivan Krylov <krylov.r00t at gmail.com> wrote:

On Fri, 15 Jul 2022 11:25:32 -0400
<avi.e.gross at gmail.com> wrote:

R normally delays evaluation so chunks of code are handed over
untouched to functions that often play with the text directly without
evaluating it until, perhaps, much later.

Do they play with the text, or with the syntax tree after it went
through the parser? While it's true that R saves the source text of the
functions for ease of debugging, it's not guaranteed that a given
object will have source references, and typical NSE functions operate
on language objects which are tree-like structures containing R values,
not source text.

You are, of course, right that any changes to the syntax of the
language must be carefully considered, but if anyone wants to play with
this idea, it can be implemented in a very simple manner:

--- src/main/gram.y     (revision 82598)
+++ src/main/gram.y     (working copy)
@@ -2526,7 +2526,7 @@
      YYTEXT_PUSH(c, yyp);
      /* We don't care about other than ASCII digits */
      while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-          || c == 'x' || c == 'X' || c == 'L')
+          || c == 'x' || c == 'X' || c == 'L' || c == '_')
      {
         count++;
         if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present). */
@@ -2533,6 +2533,9 @@
         {   YYTEXT_PUSH(c, yyp);
             break;
         }
+       if (c == '_') { /* allow an underscore anywhere inside the literal */
+           continue;
+       }

         if (c == 'x' || c == 'X') {
             if (count > 2 || last != '0') break;  /* 0x must be first */

To an NSE function, the underscored literals are indistinguishable from
normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000))
f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y)))
f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
  list(\() 1000000, \() 1_000_000),
  \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
#
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores
in succession and underscores at the end of the number literal. Perl
does so too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

--
Best regards,
Ivan

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Bill Dunlap

Fri, Jul 15, 2022 12:34 PM #

The token '._1' (period underscore digit) is currently parsed as a symbol
(name).  It would become a number if underscore were ignored as in the
first proposal.  The just-between-digits alternative would avoid this
change.

-Bill

On Fri, Jul 15, 2022 at 12:26 PM Jim Hester <james.f.hester at gmail.com>
wrote:

I think keeping it simple and less restrictive is the best approach,
for ease of implementation, limiting future maintenance, and so users
have the flexibility to format these however they wish. So I would
probably lean towards allowing multiple delimiters anywhere (including
trailing) or possibly just between digits.

On Fri, Jul 15, 2022 at 2:26 PM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

Thanks for posting that list.  The Python document is the only one I've
read so far; it has a really nice summary
(https://peps.python.org/pep-0515/#prior-art) of the differences in
implementations among 10 languages.  Which choice would you recommend,
and why?

  - I think Ivan's quick solution doesn't quite match any of them.
  - C, Fortran and C++ have special support in R, but none of them use
underscore separators.
  - C++ does support separators, but uses "'", not "_", and some ancient
forms of Fortran ignore embedded spaces.

Duncan Murdoch

On 15/07/2022 1:58 p.m., Jim Hester wrote:

Allowing underscores in numeric literals is becoming a very common
feature in computing languages. All of these languages (and more) now
support it

python: https://peps.python.org/pep-0515/
javascript: https://v8.dev/features/numeric-separators
julia:

https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers

java:

https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html#:~:text=In%20Java%20SE%207%20and,the%20readability%20of%20your%20code
.

ruby:

https://docs.ruby-lang.org/en/2.0.0/syntax/literals_rdoc.html#label-Numbers

perl: https://perldoc.perl.org/perldata#Scalar-value-constructors
rust:

https://doc.rust-lang.org/rust-by-example/primitives/literals.html

C#:

https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types#real-literals

go: https://go.dev/ref/spec#Integer_literals

Its use in this context also dates back to at least Ada 83
(

http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#:~:text=A%20decimal%20literal%20is%20a,the%20base%20is%20implicitly%20ten).&text=An%20underline%20character%20inserted%20between,value%20of%20this%20numeric%20literal
.)

Many other communities see the benefit of this feature, I think R's
community would benefit from it as well.

On Fri, Jul 15, 2022 at 1:22 PM Ivan Krylov <krylov.r00t at gmail.com>

wrote:

On Fri, 15 Jul 2022 11:25:32 -0400
<avi.e.gross at gmail.com> wrote:

R normally delays evaluation so chunks of code are handed over
untouched to functions that often play with the text directly without
evaluating it until, perhaps, much later.

Do they play with the text, or with the syntax tree after it went
through the parser? While it's true that R saves the source text of

the

functions for ease of debugging, it's not guaranteed that a given
object will have source references, and typical NSE functions operate
on language objects which are tree-like structures containing R

values,

not source text.

You are, of course, right that any changes to the syntax of the
language must be carefully considered, but if anyone wants to play

with

this idea, it can be implemented in a very simple manner:

--- src/main/gram.y     (revision 82598)
+++ src/main/gram.y     (working copy)
@@ -2526,7 +2526,7 @@
      YYTEXT_PUSH(c, yyp);
      /* We don't care about other than ASCII digits */
      while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-          || c == 'x' || c == 'X' || c == 'L')
+          || c == 'x' || c == 'X' || c == 'L' || c == '_')
      {
         count++;
         if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at

present). */

@@ -2533,6 +2533,9 @@
         {   YYTEXT_PUSH(c, yyp);
             break;
         }
+       if (c == '_') { /* allow an underscore anywhere inside the

literal */

+           continue;
+       }

         if (c == 'x' || c == 'X') {
             if (count > 2 || last != '0') break;  /* 0x must be

first */

To an NSE function, the underscored literals are indistinguishable

from

normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000))
f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y)))
f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
  list(\() 1000000, \() 1_000_000),
  \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
#
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores
in succession and underscores at the end of the number literal. Perl
does so too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

--
Best regards,
Ivan

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

@vi@e@gross m@iii@g oii gm@ii@com

Fri, Jul 15, 2022 2:05 PM #

Yes, Ivan, obviously someone can try out a change and check if it causes
problems.

And although I would think the majority of delayed execution eventually
either is never invoked or is done as you describe using internal functions
in trees, I suspect there exist some that do not.

For example, I can write code in another mini-language I create that I will
then analyze. What stops me from leaving quotes out from around a regular
expression because I am going to read the text exactly as is and manipulate
it, as long as the RE does not contain anything that keeps it from being
accepted as an argument to a function, such as no commas. Inside I may have
something like a pattern to  match a file name starting with anything then
an underscore and then a digit or two and finally a file suffix. I would not
want anything to parse that and remove the underscore that is part of a
filename. The argument is meant to be atomic.

Many things in the tidyverse do variations on delayed evaluation and some
seem to be a piece at a time. An example would be how mutate() allows
multiple clauses for new=f(old) where later lines use columns created in
earlier lines that did not exist before and can only be used if the
preceding part went well. I may be wrong on how it is done, but it strikes
me as possible they read in the raw text till they match an end of some kind
like a top-level comma or top level close-parenthesis. My GUESS is only then
might they evaluate that chunk after substitutions or other ploys to use
some namespace. Will a column name like evil__666__ survive?

Again, I am not AGAINST any proposal but the people who have to pay the
price in terms of needing to arrange or pay for development, documentation,
testing and so on, are the ones needed to be convinced. My point is that in
some ways R is a different kind of programming language than say python. I
experimented briefly in python and note their implementation of this feature
is fairly robust. I mean casting a string to an int works as expected as in:
a= int("1" "_" "122") returns 1122.

Be warned though that the current python implementation generates an error
if you have two or more underscores in a row as in:

a=1__1
SyntaxError: invalid decimal literal
a=1___1
SyntaxError: invalid decimal literal

And it does not tolerate one or more underscore at the end with the same
error and really gets mad at an initial underscore like _1 where it asks if
you mean "_" as a single underscore is not only a valid variable, as well as
multiple consecutive underscores, but is often used as an I DON'T CARE in
code like this, albeit any variable can be used as the last instance keeps
the value:

(_,_,a) = (1,2,3)
_
2
a
3

(In the above, you are seeing commands and output alternating, if not
clear.)

And as it happens, half of python variable contain runs of underscores to
the point where some say member functions like __name__  and __init__ are
called dunder name and dunder init  as in double double underscore. And note
that python is not that much younger than R/S and this feature was added
fairly late in version 3.6, about 5 years ago, long after version 3.0 made
many programs for version 2.x incompatible. 

My point is not python but someone may want to see how the underscore in a
number feature is actually implemented in any of the languages that now
allow it and carefully document exactly in what circumstances it is allowed
in R and also where, if anywhere, it differs from other such places.

If it can be done with a very few localized changes, great. My objections
about making regular expressions more complex  by needing to handle
underscore likely are not a major obstacle as python supports those too.

Luckily, my opinion is just my own as I have no direct stake in the outcome.
I personally handle large numbers fine.

Avi




-----Original Message-----
From: Ivan Krylov <krylov.r00t at gmail.com> 
Sent: Friday, July 15, 2022 1:22 PM
To: avi.e.gross at gmail.com
Cc: r-devel at r-project.org
Subject: Re: [Rd] Feature Request: Allow Underscore Separated Numbers

On Fri, 15 Jul 2022 11:25:32 -0400

<avi.e.gross at gmail.com> wrote:

Do they play with the text, or with the syntax tree after it went through
the parser? While it's true that R saves the source text of the functions
for ease of debugging, it's not guaranteed that a given object will have
source references, and typical NSE functions operate on language objects
which are tree-like structures containing R values, not source text.

You are, of course, right that any changes to the syntax of the language
must be carefully considered, but if anyone wants to play with this idea, it
can be implemented in a very simple manner:

--- src/main/gram.y	(revision 82598)
+++ src/main/gram.y	(working copy)
@@ -2526,7 +2526,7 @@
     YYTEXT_PUSH(c, yyp);
     /* We don't care about other than ASCII digits */
     while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-	   || c == 'x' || c == 'X' || c == 'L')
+	   || c == 'x' || c == 'X' || c == 'L' || c == '_')
     {
 	count++;
 	if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present).
*/ @@ -2533,6 +2533,9 @@
 	{   YYTEXT_PUSH(c, yyp);
 	    break;
 	}
+	if (c == '_') { /* allow an underscore anywhere inside the literal
*/
+	    continue;
+	}
 	
 	if (c == 'x' || c == 'X') {
 	    if (count > 2 || last != '0') break;  /* 0x must be first */

To an NSE function, the underscored literals are indistinguishable from
normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000)) f <- function(x, y)
stopifnot(all.equal(substitute(x), substitute(y))) f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
 list(\() 1000000, \() 1_000_000),
 \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
#
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores in
succession and underscores at the end of the number literal. Perl does so
too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

--
Best regards,
Ivan

Ivan Krylov

Sat, Jul 16, 2022 2:24 AM #

On Fri, 15 Jul 2022 12:34:24 -0700

Bill Dunlap <williamwdunlap at gmail.com> wrote:

Thanks for spotting this! Here's a patch that allows underscores
only between digits and only inside the significand of a number:

--- src/main/gram.y	(revision 82598)
+++ src/main/gram.y	(working copy)
@@ -2526,7 +2526,7 @@
     YYTEXT_PUSH(c, yyp);
     /* We don't care about other than ASCII digits */
     while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-	   || c == 'x' || c == 'X' || c == 'L')
+	   || c == 'x' || c == 'X' || c == 'L' || c == '_')
     {
 	count++;
 	if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present). */
@@ -2538,11 +2538,16 @@
 	    if (count > 2 || last != '0') break;  /* 0x must be first */
 	    YYTEXT_PUSH(c, yyp);
 	    while(isdigit(c = xxgetc()) || ('a' <= c && c <= 'f') ||
-		  ('A' <= c && c <= 'F') || c == '.') {
+		  ('A' <= c && c <= 'F') || c == '.' || c == '_') {
 		if (c == '.') {
 		    if (seendot) return ERROR;
 		    seendot = 1;
 		}
+		if (c == '_') {
+		    /* disallow underscores following 0x or followed by non-digit */
+		    if (nd == 0 || typeofnext() >= 2) break;
+		    continue;
+		}
 		YYTEXT_PUSH(c, yyp);
 		nd++;
 	    }
@@ -2588,6 +2593,11 @@
 		break;
 	    seendot = 1;
 	}
+	/* underscores in significand followed by a digit must be skipped */
+	if (c == '_') {
+	    if (seenexp || typeofnext() >= 2) break;
+	    continue;
+	}
 	YYTEXT_PUSH(c, yyp);
 	last = c;
     }

Best regards,
Ivan

Sat, Jul 16, 2022 8:17 AM #

On 16/07/2022 5:24 a.m., Ivan Krylov wrote:

I think there's an issue with hex values.  For example:

 > 0xa_2
[1] 162
 > 0x2_a
Error: unexpected input in "0x2_"

So "a" counts as a digit in 0xa_2, but not as a digit in 0x2_a.

Duncan Murdoch

Sat, Jul 16, 2022 3:19 PM #

So far I would say we've had some good contributions on this thread. 
Ivan's suggested patches show that the change isn't completely trivial, 
but is doable.

However, we haven't had any input from an R Core member, so I consider 
the proposal to be essentially dead.

If an R Core member decides to resurrect it, here's what I'd suggest is 
still needed:

  - a formal definition of where the separator may occur, and a 
justificaton for that choice, and comparison to other languages.

  - patches to documentation for the changes.  These include the manuals 
and the ?NumericConstants help topic, and probably others.

  - tests to add to "R CMD check" for packages to see if the new syntax 
is being used without specifying that "R >= 4.3.0" is a requirement.

Duncan Murdoch

Ivan Krylov

Sat, Jul 16, 2022 11:57 PM #

On Sat, 16 Jul 2022 11:17:17 -0400

Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

You're right, thanks! Should have checked for hex-digits when in
hex-literal mode.

One last try because I don't want to leave a bug I had introduced
myself, but as you say in your other message, this is both incomplete
without tests and documentation and effectively dead unless an R Core
member picks the whole proposal up:

--- src/main/gram.y	(revision 82598)
+++ src/main/gram.y	(working copy)
@@ -2091,7 +2091,9 @@
     int k, c;
 
     c = xxgetc();
-    if (isdigit(c)) k = 1; else k = 2;
+    if (isdigit(c)) k = 1;
+    else if (('a' <= c && c <= 'f') || ('A' <= c && c <= 'F')) k = 2;
+    else k = 3;
     xxungetc(c);
     return k;
 }
@@ -2526,7 +2528,7 @@
     YYTEXT_PUSH(c, yyp);
     /* We don't care about other than ASCII digits */
     while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-	   || c == 'x' || c == 'X' || c == 'L')
+	   || c == 'x' || c == 'X' || c == 'L' || c == '_')
     {
 	count++;
 	if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present). */
@@ -2538,11 +2540,16 @@
 	    if (count > 2 || last != '0') break;  /* 0x must be first */
 	    YYTEXT_PUSH(c, yyp);
 	    while(isdigit(c = xxgetc()) || ('a' <= c && c <= 'f') ||
-		  ('A' <= c && c <= 'F') || c == '.') {
+		  ('A' <= c && c <= 'F') || c == '.' || c == '_') {
 		if (c == '.') {
 		    if (seendot) return ERROR;
 		    seendot = 1;
 		}
+		if (c == '_') {
+		    /* disallow underscores following 0x or followed by non-hexdigit */
+		    if (nd == 0 || typeofnext() >= 3) break;
+		    continue;
+		}
 		YYTEXT_PUSH(c, yyp);
 		nd++;
 	    }
@@ -2588,6 +2595,11 @@
 		break;
 	    seendot = 1;
 	}
+	/* underscores in significand followed by a digit must be skipped */
+	if (c == '_') {
+	    if (seenexp || typeofnext() >= 2) break;
+	    continue;
+	}
 	YYTEXT_PUSH(c, yyp);
 	last = c;
     }

I won't be sending any more unsolicited patches for this proposal.

Best regards,
Ivan