the pipe |> and line breaks in pipelines

1 message · Timothy Goodman

Original

Timothy Goodman

Wed, Dec 9, 2020 1:56 PM #

I'm thrilled to hear it!  Thank you!

- Tim

P.S. I re-added the r-devel list, since Kevin's reply was sent just to me,
but I thought there might be others interested in knowing about those work
items.  (I hope that's OK, email-etiquette-wise.)

On Wed, Dec 9, 2020 at 1:10 PM Kevin Ushey <kevinushey at gmail.com> wrote:

You might be surprised to learn that the RStudio IDE engineers might
be receptive to such a feature request. :-)

https://github.com/rstudio/rstudio/issues/8589
https://github.com/rstudio/rstudio/issues/8590

(Spoiler alert: I am one of the RStudio IDE engineers, and I think
this would be worth doing.)

Best,
Kevin

On Wed, Dec 9, 2020 at 12:16 PM Timothy Goodman <timsgoodman at gmail.com>
wrote:

Since my larger concern is being able to conveniently select and re-run

part of a multiline pipeline, I don't think wrapping in parentheses will
help.  I'd have to add a closing paren at the end of the selection, which
is no more convenient than having to highlight all but the last pipe.
(Admittedly, wrapping in parens would allow my preferred syntax of having
pipes at the start of the line, but I don't think that's worth the cost of
having to constantly move the trailing paren around.)

My back-up plan if I fail to persuade you all is indeed to beg the

developers of RStudio to add an option to do the transformation I would
want when executing notebook code, but I'm anticipating the objection of "R
Notebooks shouldn't transform invalid R code into valid R code."  I was
hoping "Let's make this new pipe |> work differently in a case that's
currently an error" would be an easier sell.

Also, just to reiterate: Only one of my two suggestions really requires

caring about newlines.  (That's my preferred solution, but I understand
it'd be the bigger change.)  The other suggestion just amounts to ignoring
a final |> when code is submitted for execution.

 -Tim

On Wed, Dec 9, 2020 at 11:58 AM Kevin Ushey <kevinushey at gmail.com>

wrote:

I agree with Duncan that the right solution is to wrap the pipe
expression with parentheses. Having the parser treat newlines
differently based on whether the session is interactive, or on what
type of operator happens to follow a newline, feels like a pretty big
can of worms.

I think this (or something similar) would accomplish what you want
while still retaining the nice aesthetics of the pipe expression, with
a minimal amount of syntax "noise":

result <- (
  data
    |> op1()
    |> op2()
)

For interactive sessions where you wanted to execute only parts of the
pipeline at a time, I could see that being accomplished by the editor
-- it could transform the expression so that it could be handled by R,
either by hoisting the pipe operator(s) up a line, or by wrapping the
to-be-executed expression in parentheses for you. If such a style of
coding became popular enough, I'm sure the developers of such editors
would be interested and willing to support this ...

Perhaps more importantly, it would be much easier to accomplish than a
change to the behavior of the R parser, and it would be work that
wouldn't have to be maintained by the R Core team.

Best,
Kevin

On Wed, Dec 9, 2020 at 11:34 AM Timothy Goodman <timsgoodman at gmail.com>

wrote:

If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute

the

command in the Notebook environment I'm using) I certainly *would*

expect R

to treat it as a complete statement.

But what I'm talking about is a different case, where I highlight a
multi-line statement in my notebook:

    my_data_frame1
        |> filter(some_conditions_1)

and then press Ctrl+Enter.  Or, I suppose the equivalent would be to

run an

R script containing those two lines of code, or to run a multi-line
statement like that from the console (which in RStudio I can do by

pressing

Shift+Enter between the lines.)

In those cases, R could either (1) Give an error message [the current
behavior], or (2) understand that the first line is meant to be piped

to

the second.  The second option would be significantly more useful,

and is

almost certainly what the user intended.

(For what it's worth, there are some languages, such as Javascript,

that

consider the first token of the next line when determining if the

previous

line was complete.  JavaScript's rules around this are overly

complicated,

but a rule like "a pipe following a line break is treated as

continuing the

previous line" would be much simpler.  And while it might be

objectionable

to treat the operator %>% different from other operators, the

addition of

|>, which isn't truly an operator at all, seems like the right time to
consider it.)

-Tim

On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <

murdoch.duncan at gmail.com>

wrote:

The requirement for operators at the end of the line comes from the
interactive nature of R.  If you type

     my_data_frame_1

how could R know that you are not done, and are planning to type the
rest of the expression

       %>% filter(some_conditions_1)
       ...

before it should consider the expression complete?  The way

languages

like C do this is by requiring a statement terminator at the end.

You

can also do it by wrapping the entire thing in parentheses ().

However, be careful: Don't use braces:  they don't work.  And parens
have the side effect of removing invisibility from the result

(which is

a design flaw or bonus, depending on your point of view).  So I

actually

wouldn't advise this workaround.

Duncan Murdoch


On 09/12/2020 12:45 a.m., Timothy Goodman wrote:

Hi,

I'm a data scientist who routinely uses R in my day-to-day work,

for

tasks

such as cleaning and transforming data, exploratory data

analysis, etc.

This includes frequent use of the pipe operator from the magrittr

and

dplyr

libraries, %>%.  So, I was pleased to hear about the recent work

on a

native pipe operator, |>.

This seems like a good time to bring up the main pain point I

encounter

when using pipes in R, and some suggestions on what could be done

about

it.  The issue is that the pipe operator can't be placed at the

start of

line of code (except in parentheses).  That's no different than

any

binary

operator in R, but I find it's a source of difficulty for the pipe

because

of how pipes are often used.

[I'm assuming here that my usage is fairly typical of a lot of

users; at

any rate, I don't think I'm *too* unusual.]

=== Why this is a problem ===

It's very common (for me, and I suspect for many users of dplyr)

to write

multi-step pipelines and put each step on its own line for

readability.

Something like this:

   ### Example 1 ###
   my_data_frame_1 %>%
     filter(some_conditions_1) %>%
     inner_join(my_data_frame_2, by = some_columns_1) %>%
     group_by(some_columns_2) %>%
     summarize(some_aggregate_functions_1) %>%
     filter(some_conditions_2) %>%
     left_join(my_data_frame_3, by = some_columns_3) %>%
     group_by(some_columns_4) %>%
     summarize(some_aggregate_functions_2) %>%
     arrange(some_columns_5)

[I guess some might consider this an overly long pipeline; for me

it's

pretty typical.  I *could* split it up by assigning intermediate

results

to

variables, but much of the value I get from the pipe is that it

lets my

code communicate which results are temporary, and which will be

used

again

later.  Assigning variables for single-use results would remove

that

expressiveness.]

I would prefer (for reasons I'll explain) to be able to write the

above

example like this, which isn't valid R:

   ### Example 2 (not valid R) ###
   my_data_frame_1
     %>% filter(some_conditions_1)
     %>% inner_join(my_data_frame_2, by = some_columns_1)
     %>% group_by(some_columns_2)
     %>% summarize(some_aggregate_functions_1)
     %>% filter(some_conditions_2)
     %>% left_join(my_data_frame_3, by = some_columns_3)
     %>% group_by(some_columns_4)
     %>% summarize(some_aggregate_functions_2)
     %>% arrange(some_columns_5)

One (minor) advantage is obvious: It lets you easily line up the

pipes,

which means that you can see at a glance that the whole block is

a single

pipeline, and you'd immediately notice if you inadvertently

omitted a

pipe,

which otherwise can lead to confusing output.  [It's also

aesthetically

pleasing, especially when %>% is replaced with |>, but that's

subjective.]

But the bigger issue happens when I want to re-run just *part* of

the

pipeline.  I do this often when debugging: if the output of the

pipeline

seems wrong, I re-run the first few steps and check the output,

then

include a little more and re-run again, etc., until I locate my

mistake.

Working in an interactive notebook environment, this involves

using the

cursor to select just the part of the code I want to re-run.

It's fast and easy to select *entire* lines of code, but

unfortunately

with

the pipes placed at the end of the line I must instead select

everything

*except* the last three characters of the line (the last two

characters

for

the new pipe).  Then when I want to re-run the same partial

pipeline with

the next line of code included, I can't just press SHIFT+Down to

select

it

as I otherwise would, but instead must move the cursor

horizontally to a

position three characters before the end of *that* line (which is

generally

different due to varying line lengths).  And so forth each time I

want to

include an additional line.

Moreover, with the staggered positions of the pipes at the end of

each

line, it's very easy to accidentally select the final pipe on a

line, and

then sit there for a moment wondering if the environment has

stopped

responding before realizing it's just waiting for further input

(i.e.,

for

the right-hand side).  These small delays and disruptions add up

over the

course of a day.

This desire to select and re-run the first part of a pipeline is

also the

reason why it doesn't suffice to achieve syntax like my "Example

2" by

wrapping the entire pipeline in parentheses.  That's of no use if

I want

to

re-run a selection that doesn't include the final close-paren.

=== Possible Solutions ===

I can think of two, but maybe there are others.  The first would

make

"Example 2" into valid code, and the second would allow you to

run a

selection that included a trailing pipe.

   Solution 1: Add a special case to how R is parsed, so if the

first

(non-whitespace) token after an end-line is a pipe, that pipe

gets moved

to

before the end-line.
     - Argument for: This lets you write code like example 2,

which

addresses the pain point around re-running part of a pipeline,

and has

advantages for readability.  Also, since starting a line with a

pipe

operator is currently invalid, the change wouldn't break any

working

code.

     - Argument against: It would make the behavior of %>%

inconsistent

with

that of other binary operators in R.  (However, this objection

might not

apply to the new pipe, |>, which I understand is being

implemented as a

syntax transformation rather than a binary operator.)

   Solution 2: Ignore the pipe operator if it occurs as the final

token

of

the code being executed.
     - Argument for: This would mean the user could select and

re-run the

first few lines of a longer pipeline (selecting *entire* lines),

avoiding

the difficulties described above.
     - Argument against: This means that %>% would be valid even

if it

occurred without a right-hand side, which is inconsistent with

other

operators in R.  (But, as above, this objection might not apply

to |>.)

Also, this solution still doesn't enable the syntax of "Example

2", with

its readability benefit.

Thanks for reading this and considering it.

- Tim Goodman

      [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel