Skip to content

[Rcpp-devel] Performance question about DataFrame

13 messages · Yan Zhou, Dirk Eddelbuettel, John Merrill +3 more

#
It appears that DataFrame::create is a thin layer on top of the R
data.frame call.  The guarantee correctness, but also means the performance
of an Rcpp routine which returns a large data frame is limited by the
performance of data.frame -- which is utterly horrible.

In the current version of R, there's a trivial, but borderline evil, work
around: build a list of lists meeting the basic requirements of a data
frame (they all need to be of the same length, and each component list
needs to be named) and set the type of the object to "data.frame".

I have two questions:
(1) Is it reasonable to anticipate that this hack will continue to work for
the near future in R?
(2) If so, would a patch to that effect be of interest to the developers?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/68592a98/attachment.html>
#
I am curious what usage of data.frame give you the conclusion that it is slow. You must know that data.frame IS a list of variables, which can be vectors (though not always) and can only be faster than a list of lists.

Best,

Yan
On Jan 15, 2013, at 03:20 PM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. ?The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible.

In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". ??

I have two questions:
(1) Is it reasonable to anticipate that this hack will continue to work for the near future in R?
(2) If so, would a patch to that effect be of interest to the developers? ?
_______________________________________________
Rcpp-devel mailing list
Rcpp-devel at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/61911141/attachment.html>
#
On 15 January 2013 at 07:20, John Merrill wrote:
| It appears that DataFrame::create is a thin layer on top of the R data.frame
| call. ?The guarantee correctness, but also means the performance of an Rcpp
| routine which returns a large data frame is limited by the performance of
| data.frame -- which is utterly horrible.

All correct. It really mostly a convenience layer.  When we use R, we think
of data.frame objects as accessible by row -- which is not something we can
easily do at the C++ layer.  So the DataFrame class is really mostly a
wrapper around a list (as it is internally) with a call to R to set it.
 
| In the current version of R, there's a trivial, but borderline evil, work
| around: build a list of lists meeting the basic requirements of a data frame
| (they all need to be of the same length, and each component list needs to be
| named) and set the type of the object to "data.frame". ??
| 
| I have two questions:
| (1) Is it reasonable to anticipate that this hack will continue to work for the
| near future in R?

We cannot speak for R Core.  

But this is so fundamental to so many things that I (personally speaking) am
inclined to say yes.

(Or did you mean Rcpp instead of R?  If so, example code?)

| (2) If so, would a patch to that effect be of interest to the developers? ?

We are always open to reasonable patches to bring improvements (and come with
test cases demonstrating usefulness and a testing framework).  

As I recall there is also an open bug in our DataFrame right now, so if you want
to work on it, great :)

Dirk
#
You're confusing a data frame object with the data.frame coercion function.
 Data frames themselves are fast to access.  The coercion function is not.
On Tue, Jan 15, 2013 at 7:36 AM, Yan Zhou <zhouyan at me.com> wrote:

            
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/9295ec37/attachment.html>
#
On Jan 15, 2013, at 03:38 PM, John Merrill <john.merrill at gmail.com> wrote:
You're confusing a data frame object with the data.frame coercion function. ?Data frames themselves are fast to access. ?The coercion function is not.
?
Ah, I see what you mean.
On Tue, Jan 15, 2013 at 7:36 AM, Yan Zhou <zhouyan at me.com> wrote:
I am curious what usage of data.frame give you the conclusion that it is slow. You must know that data.frame IS a list of variables, which can be vectors (though not always) and can only be faster than a list of lists.

Best,

Yan
On Jan 15, 2013, at 03:20 PM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. ?The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible.

In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". ??

I have two questions:
(1) Is it reasonable to anticipate that this hack will continue to work for the near future in R?
(2) If so, would a patch to that effect be of interest to the developers? ?
_______________________________________________
Rcpp-devel mailing list
Rcpp-devel at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/d33495ad/attachment.html>
2 days later
#
On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <john.merrill at gmail.com> wrote:
Are you certain that this claim is still true?

I was shocked/surprised by the package "dataframe" and the commentary
about it. The author said that data.frame was slow because "This
contains versions of standard data frame functions in R, modified to
avoid making extra copies of inputs. This is faster, particularly for
large data."

it was repeatedly copying some objects and he proved a substantially
faster approach.

In the release notes for R-2.15.1, I recall seeing a note that R Core
had responded by integrating several of those changes. But still
data.frame is not fast for you?

If they didn't make the core data.frame as fast, would you care to
enlighten us by installing the dataframe package and letting us know
if it is still faster?

Or perhaps you are way ahead of me and you've already imitated
Hesterberg's algorithms in your C++ design?

pj
#
As of 2.15.1, data.frame appears to no longer be O(n^2) in the number of
columns in the frame.  That's certainly an improvement, yes.

However, by eliminating calls to data.frame and replacing them with direct
class modifications, I can take a routine which takes minutes and reduce it
to a routine which takes seconds.  So, pragmatically, in Rcpp, I can get a
rough factor of sixty, it appears.
On Thu, Jan 17, 2013 at 7:46 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:

            
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130117/c20732ed/attachment.html>
#
On Thu, Jan 17, 2013 at 9:54 PM, John Merrill <john.merrill at gmail.com> wrote:
Wow.

When you have this written out, will you post links to it?  I can
learn from your examples, I think.

pj

  
    
#
Sure.  I'll write something up for the gallery, but here's the crude
outline.

Here's the C++ code:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
List BuildCheapDataFrame(List a) {
  List returned_frame = clone(a);
  GenericVector sample_row = returned_frame(1);

  StringVector row_names(sample_row.length());
  for (int i = 0; i < sample_row.length(); ++i) {
    char name[5];
    sprintf(&(name[0]), "%d", i);
    row_names(i) = name;
  }
  returned_frame.attr("row.names") = row_names;

  StringVector col_names(returned_frame.length());
  for (int j = 0; j < returned_frame.length(); ++j) {
    char name[6];
    sprintf(&(name[0]), "X.%d", j);
    col_names(j) = name;
  }
  returned_frame.attr("names") = col_names;
  returned_frame.attr("class") = "data.frame";

  return returned_frame;
}

There are some subtleties in this code:

* It turns out that one can't send super-large data frames to it because of
possible buffer overflows.  I've never seen that problem when I've written
Rcpp functions which exchanged SEXPs with R, but this one uses Rcpp:export
in order to use sourceCpp.
* Notice the invocation of clone() in the first line of the code.  If you
don't do that, you wind up side-effecting the parameter, which is not what
most people would expect.

Here's the timing, as measured on an AWS node:
user  system elapsed
  3.890   0.000   3.892
user  system elapsed
  0.020   0.000   0.022

Yes, that really is a factor of 200 speedup.
On Fri, Jan 18, 2013 at 8:16 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:

            
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130118/12a62123/attachment-0001.html>
12 days later
#
Le 15/01/13 16:20, John Merrill a ?crit :
The reason we used a callback to data.frame is close to lazyness on our 
part. With the R function, for example we know that columns of different 
sizes will be handled properly, with recylcling, etc ...

Just making a named list of vectors is not enough. We have to make sure 
they all have the same length.

Perhaps it would be worth checking this and make better 
DataFrame::create functions.



Also, you can use a shortcut to assign row names, i.e. mimic this in C++ 
(the second line contains the magic):

 > d <- list( x = 1:10, y = 1:10 )
 > attr( d, "row.names" ) <- c( NA, -10L )
 > attr( d, "class" ) <- "data.frame"
 > d
     x  y
1   1  1
2   2  2
3   3  3
4   4  4
5   5  5
6   6  6
7   7  7
8   8  8
9   9  9
10 10 10


Romain
#
On Fri, Jan 18, 2013 at 6:25 PM, John Merrill <john.merrill at gmail.com> wrote:
Is C++ really necessary here?  I have the following R function in plyr:

quickdf <- function(list) {
  rows <- unique(unlist(lapply(list, NROW)))
  stopifnot(length(rows) == 1)

  names(list) <- make_names(list, "X")
  class(list) <- "data.frame"
  attr(list, "row.names") <- c(NA_integer_, -rows)

  list
}

which is basically equivalent (although I do some tricks with
rownames).  It's even more efficient if you copy and paste the
contents instead of calling the function because then that avoids
duplicating the input list, and instead modifies it in place.

Hadley
#
And S3 methods (as.data.frame) will be dispatched upon correctly etc.

Hadley
#
I agree that this is not a complete implementation; it isn't meant to be,
although it might still be a worth incorporating this into Rcpp with the
appropriate fixes in place.

For instance, the vector recycling issue is far from the greatest
limitation of this code: it handles character vectors wrong,  The R routine
converts character vectors into factors unless overridden; I wrote the
precursor of this particular routine because I wanted to handle strings
faithfully, and so writing a stupid R routine to coerce lists of lists of
constant length to data frames.


On Thu, Jan 31, 2013 at 3:21 AM, Romain Francois
<romain at r-enthusiasts.com>wrote:

            
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130131/c7fac5c7/attachment-0001.html>