write.csv problems - R-devel

Fri, Jun 28, 2024 9:02 AM #

Hello, All:


	  I'm getting strange errors with write.csv with some objects of class 
c('findFn', 'data.frame'). Consider the following:


df1 <- data.frame(x=1)
class(df1) <- c('findFn', 'data.frame')
write.csv(df1, 'df1.csv')
# Error in x$Package : $ operator is invalid for atomic vectors

df2 <- data.frame(a=letters[1:2],
       b=as.POSIXct('2024-06-28'))
class(df2) <- c('findFn', 'data.frame')
write.csv(df2, 'df1.csv')
# Error in tapply(rep(1, nrow(x)), xP, length) :
#  arguments must have same length


	  "write.csv" works with some objects of class c('findFn', 
'data.frame') but not others. I have 'findFn' object with 5264 rows that 
fails with the following error:


Error in `[<-.data.frame`(`*tmp*`, needconv, value = list(Count = 
c("83",  :
   replacement element 1 has 526 rows, need 5264


	  I have NOT yet been able to reproduce this error with a smaller 
example. However, starting 'write.csv' with something like the following 
should fix all these problems:


if(is.data.frame(x)) class(x) <- 'data.frame'


	  Comments?
	  Thanks for all your work to help improve the quality of statistical 
software available to the world.


	  Spencer Graves

Ivan Krylov

Fri, Jun 28, 2024 9:19 AM #

? Fri, 28 Jun 2024 11:02:12 -0500
Spencer Graves <spencer.graves at prodsyse.com> ?????:

Judging by the traceback, only data frames that have a Package column
should have a findFn class:

9: PackageSummary(xi)
8: `[.findFn`(x, needconv)
7: x[needconv]
6: lapply(x[needconv], as.character)
5: utils::write.table(df1, "df1.csv", col.names = NA, sep = ",",
       dec = ".", qmethod = "double")

write.table sees columns that aren't of type character yet and tries to
convert them one by one, subsetting the data frame as a list. The call
lands in sos:::`[.findFn`

    if (missing(j)) {
        xi <- x[i, ]
        attr(xi, "PackageSummary") <- PackageSummary(xi)
        class(xi) <- c("findFn", "data.frame")
        return(xi)
    }

Subsetting methods are hard. For complex structures like data.frames,
`[.class` must handle all of x[rows,cols]; x[rows,]; x[,cols];
x[columns]; x[], and also respect the drop argument:
https://stat.ethz.ch/pipermail/r-help/2021-December/473207.html

I think that the `[.findFn` method mistakes x[needconv] for
x[needconv,] when it should instead perform x[,needconv].

Best regards,
Ivan

Rui Barradas

Sat, Jun 29, 2024 12:24 PM #

?s 17:02 de 28/06/2024, Spencer Graves escreveu:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Hello,

I don't know if this answers to question.
I wasn't able to reproduce errors but warnings, yes I was.

A way of not giving errors or warnings is to call write.csv at the end 
of a pipe such as the following.


df1 <- findFn("mean")
df1 |> as.data.frame() |> write.csv("df1.csv")


This solution is equivalent to the code proposed in the OP without the 
need for a change in base R.

Hope this helps,

Rui Barradas

Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus.
www.avg.com

Spencer Graves

Sat, Jun 29, 2024 12:37 PM #

Hi, Rui et al.:

On 6/29/24 14:24, Rui Barradas wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Thanks for this. Ivan Krylov informed me that this was NOT a problem 
with base R but with "[.findFn". I fixed that and got help from Ivan 
fixing another problem with "sos". Now it is officially "on its way to 
CRAN."

Yes. I'm not yet facile with "|>", but I'm learning.


	  Spencer Graves

Duncan Murdoch

Sat, Jun 29, 2024 3:24 PM #

There's very little to know.  This:

      x |> f() |> g()

is just a different way of writing

     g(f(x))

If f() or g() have extra arguments, just add them afterwards:

     x |> f(a = 1) |> g(b = 2)

is just

     g(f(x, a = 1), b = 2)

This isn't quite true of the magrittr pipe, but it is exactly true of 
the base pipe.

Duncan Murdoch

Spencer Graves

Sat, Jun 29, 2024 3:57 PM #

Hi, Duncan:

On 6/29/24 17:24, Duncan Murdoch wrote:

Agreed. If I understand correctly, the supporters of the former think 
it's easier to highlight and execute a subset of the earlier character 
string, e.g., "x |> f(a = 1)" than the corresponding subset of the 
latter, "f(x, a = 1)". I remain unconvinced.


	  For debugging, I prefer the following:


	  fx1 <- f(x, a = 1)
	  g(fx1, b=2)


	  Yes, "fx1" occupies storage space that the other two do not. Ir you 
are writing code for an 8086, the difference in important. However, for 
my work, ease of debugging is important, which is why I prefer, "fx1 <- 
f(x, a = 1); g(fx1, b=2)".


	  Thanks, again, for the reply.
	  Spencer Graves

Duncan Murdoch

Sat, Jun 29, 2024 4:31 PM #

I agree with you (I think we may be similarly aged), but there is the 
`magrittr::debug_pipe()` function, which can be inserted anywhere into 
either kind of pipe.  It will call `debug()` at that point, and let you 
examine the current value, before passing it on to the next entry.

You can't single step through a pipe (as far as I know), but with that 
modification, you can see what you've got at any point.

Duncan Murdoch

On 2024-06-29 6:57 p.m., Spencer Graves wrote:

@vi@e@gross m@iii@g oii gm@ii@com

Sat, Jun 29, 2024 8:09 PM #

I suggest there is actually quite a lot to know about piping, albeit you can use it fine while knowing little.

For those who can happily write complex lines of code containing nested function calls and never have to explain it to anyone, feel free. I can do that and sometimes months later I only figure out what I did in ten minutes and then check to see if I got it right!

But for people who are used to features vaguely similar in other languages, pipes are a great way to visualize data and process flow as they show a sort of sequence.

No, they are not at all the same as a UNIX pipe but that is not a bad model as it lets you write shell scripts that do one conceptual step at a time and pass along data to the input of another program that processes it further and passes it along until you reach some goal.

Many languages, such as ones using variations on Object Oriented, have a sort of pipeline that can look like:

a.method_a(args).method_b(args)

And in some languages, that can be spread across multiple lines to look a bit more like a pipeline. This too is an inexact analogy as what really happens is that the underlying object can return perhaps another object when you call a method and then you can call a method in that object and so on. This can make it limited in some ways or quite powerful.

The many versions that have been created of an R pipe can be variations on many themes. As an example, you could take the multiple lines in a pipeline and rearrange them to look like the nested code with function calls as arguments in other functions and then evaluate it. It would, in effect, be a sort of syntactic sugar that makes it easier for SOME programmers.

But the topic now shifts to debugging and indeed, the underlying implementation of a pipeline can impact on one debugs.

The simplest case is trivial to debug. No visible pipes:

Temp1 <- f1(x, args)
Temp2 <- f2(Temp1,  args)
Result <- f3(Temp2, args)
rm(Temp1, Temp2)

So one form of piping does something like this under the table:

For code like:
X PIPED f1(args) PIPED f2(args) -> Result

It simply does something like this:

. <- x
. <- f1(., args)
.  <- f2(.,  args)
Result <- f3(., args)

The variable "." just gets re-used repeatedly. But as this code swap is done outside normal view, can a debugger follow it? And "." keeps changing. As a nice feature, some implementations may actually check and if you place "." as an argument past the beginning as in f3(args, ., more_args) allow you to pipe in not just to the first argument for the many functions that may want the data second or third or ...

There are other implementations possible that allow syntactic sugar without necessarily being run as shown. I am not sure how the native pipe that was added is implemented but it seems quite a bit faster than many other implementations and has some quirks such as requiring all functions to include parentheses, even if empty like piping to head(), and the way to do some things using anonymous functions is a tad annoying.

I think the focus for many people is the HUMAN who is programming and sees a logical way to describe what they want without much ambiguity. Of course, if you want to keep playing with your code, don't use pipes except perhaps when it is pretty much done.

An analogy to consider is another variant of piping used by ggplot where "+" is overloaded and:

ggplot(args) +
  geom_point(args) +
  geom_line(args) +
  xlab(args) +
  theme_bw() +
  coord_flip() +
  ...

Is a common way of writing a fairly complex set of operations. But what is being piped there is a growing object that each step modifies and an the end, the object is rendered into a graph based on whatever complex contents it contains. And, yes, that can be painful to debug and a simple option is:

P <- ggplot(args)
P <- P + geom_point(args)
P <- P + geom_line(args)
...
print(P)

Being able to declare incremental changes and layers to a graph this way is more intuitive to some. Not using a pipelined approach allows you to comment out parts easily, such as not making it black/white sometimes, albeit you can as easily comment out the other version.

What some people need to understand is that adding pipes of any of the varieties has never taken away to write the code in other ways. It is not in any way required. And for some people, it aligns better with how they can reason. Yet, if you need lots of debugging in your programs, writing them differently may be a better idea, at least until it is debugged.

I have written code for my clients with quite elegant pipelines as well as functions like the dplyr mutate() that allow me to do many things in one function call, and formatted it beautifully with varying levels of indentation so you can see at a glance where things line up. Parts of the code are nested function calls and when it all leads to a ggplot structure like above, it can be a tad hard for many people to appreciate what it is doing. But then, I get some requests to change things, add or subtract features, allow some parts to be commented/documented close to where the code does things, or allow parameters to be set next to where they are called. What I sometimes do is go back to the linear style of code above where each new section does mostly one thing with a comment before it and a setting of changeable parameters like colors that the customer can tune. The code can get much longer but can be absorbed step by step, and unless we remove variables no longer needed, can have some performance issues if it is processing lots of data! LOL!

There is plenty more to know, but unless you have to read other people's code and modify it, it may be optional.


-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Spencer Graves
Sent: Saturday, June 29, 2024 6:57 PM
To: Duncan Murdoch <murdoch.duncan at gmail.com>; Rui Barradas <ruipbarradas at sapo.pt>; r-devel <r-devel at r-project.org>
Subject: Re: [Rd] \>

Hi, Duncan:

On 6/29/24 17:24, Duncan Murdoch wrote:

Agreed. If I understand correctly, the supporters of the former think 
it's easier to highlight and execute a subset of the earlier character 
string, e.g., "x |> f(a = 1)" than the corresponding subset of the 
latter, "f(x, a = 1)". I remain unconvinced.


	  For debugging, I prefer the following:


	  fx1 <- f(x, a = 1)
	  g(fx1, b=2)


	  Yes, "fx1" occupies storage space that the other two do not. Ir you 
are writing code for an 8086, the difference in important. However, for 
my work, ease of debugging is important, which is why I prefer, "fx1 <- 
f(x, a = 1); g(fx1, b=2)".


	  Thanks, again, for the reply.
	  Spencer Graves

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel