It appears that DataFrame::create is a thin layer on top of the R data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible. In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". I have two questions: (1) Is it reasonable to anticipate that this hack will continue to work for the near future in R? (2) If so, would a patch to that effect be of interest to the developers? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/68592a98/attachment.html>
[Rcpp-devel] Performance question about DataFrame
13 messages · Yan Zhou, Dirk Eddelbuettel, John Merrill +3 more
I am curious what usage of data.frame give you the conclusion that it is slow. You must know that data.frame IS a list of variables, which can be vectors (though not always) and can only be faster than a list of lists. Best, Yan
On Jan 15, 2013, at 03:20 PM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. ?The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible. In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". ?? I have two questions: (1) Is it reasonable to anticipate that this hack will continue to work for the near future in R? (2) If so, would a patch to that effect be of interest to the developers? ? _______________________________________________ Rcpp-devel mailing list Rcpp-devel at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/61911141/attachment.html>
On 15 January 2013 at 07:20, John Merrill wrote:
| It appears that DataFrame::create is a thin layer on top of the R data.frame | call. ?The guarantee correctness, but also means the performance of an Rcpp | routine which returns a large data frame is limited by the performance of | data.frame -- which is utterly horrible. All correct. It really mostly a convenience layer. When we use R, we think of data.frame objects as accessible by row -- which is not something we can easily do at the C++ layer. So the DataFrame class is really mostly a wrapper around a list (as it is internally) with a call to R to set it. | In the current version of R, there's a trivial, but borderline evil, work | around: build a list of lists meeting the basic requirements of a data frame | (they all need to be of the same length, and each component list needs to be | named) and set the type of the object to "data.frame". ?? | | I have two questions: | (1) Is it reasonable to anticipate that this hack will continue to work for the | near future in R? We cannot speak for R Core. But this is so fundamental to so many things that I (personally speaking) am inclined to say yes. (Or did you mean Rcpp instead of R? If so, example code?) | (2) If so, would a patch to that effect be of interest to the developers? ? We are always open to reasonable patches to bring improvements (and come with test cases demonstrating usefulness and a testing framework). As I recall there is also an open bug in our DataFrame right now, so if you want to work on it, great :) Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
You're confusing a data frame object with the data.frame coercion function. Data frames themselves are fast to access. The coercion function is not.
On Tue, Jan 15, 2013 at 7:36 AM, Yan Zhou <zhouyan at me.com> wrote:
I am curious what usage of data.frame give you the conclusion that it is slow. You must know that data.frame IS a list of variables, which can be vectors (though not always) and can only be faster than a list of lists. Best, Yan On Jan 15, 2013, at 03:20 PM, John Merrill <john.merrill at gmail.com> wrote: It appears that DataFrame::create is a thin layer on top of the R data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible. In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". I have two questions: (1) Is it reasonable to anticipate that this hack will continue to work for the near future in R? (2) If so, would a patch to that effect be of interest to the developers?
_______________________________________________ Rcpp-devel mailing list Rcpp-devel at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/9295ec37/attachment.html>
On Jan 15, 2013, at 03:38 PM, John Merrill <john.merrill at gmail.com> wrote:
You're confusing a data frame object with the data.frame coercion function. ?Data frames themselves are fast to access. ?The coercion function is not. ? Ah, I see what you mean.
On Tue, Jan 15, 2013 at 7:36 AM, Yan Zhou <zhouyan at me.com> wrote:
I am curious what usage of data.frame give you the conclusion that it is slow. You must know that data.frame IS a list of variables, which can be vectors (though not always) and can only be faster than a list of lists. Best, Yan
On Jan 15, 2013, at 03:20 PM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. ?The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible. In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". ?? I have two questions: (1) Is it reasonable to anticipate that this hack will continue to work for the near future in R? (2) If so, would a patch to that effect be of interest to the developers? ? _______________________________________________ Rcpp-devel mailing list Rcpp-devel at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130115/d33495ad/attachment.html>
2 days later
On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible.
Are you certain that this claim is still true? I was shocked/surprised by the package "dataframe" and the commentary about it. The author said that data.frame was slow because "This contains versions of standard data frame functions in R, modified to avoid making extra copies of inputs. This is faster, particularly for large data." it was repeatedly copying some objects and he proved a substantially faster approach. In the release notes for R-2.15.1, I recall seeing a note that R Core had responded by integrating several of those changes. But still data.frame is not fast for you? If they didn't make the core data.frame as fast, would you care to enlighten us by installing the dataframe package and letting us know if it is still faster? Or perhaps you are way ahead of me and you've already imitated Hesterberg's algorithms in your C++ design? pj
Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
As of 2.15.1, data.frame appears to no longer be O(n^2) in the number of columns in the frame. That's certainly an improvement, yes. However, by eliminating calls to data.frame and replacing them with direct class modifications, I can take a routine which takes minutes and reduce it to a routine which takes seconds. So, pragmatically, in Rcpp, I can get a rough factor of sixty, it appears.
On Thu, Jan 17, 2013 at 7:46 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R
data.frame
call. The guarantee correctness, but also means the performance of an
Rcpp
routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible.
Are you certain that this claim is still true? I was shocked/surprised by the package "dataframe" and the commentary about it. The author said that data.frame was slow because "This contains versions of standard data frame functions in R, modified to avoid making extra copies of inputs. This is faster, particularly for large data." it was repeatedly copying some objects and he proved a substantially faster approach. In the release notes for R-2.15.1, I recall seeing a note that R Core had responded by integrating several of those changes. But still data.frame is not fast for you? If they didn't make the core data.frame as fast, would you care to enlighten us by installing the dataframe package and letting us know if it is still faster? Or perhaps you are way ahead of me and you've already imitated Hesterberg's algorithms in your C++ design? pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130117/c20732ed/attachment.html>
On Thu, Jan 17, 2013 at 9:54 PM, John Merrill <john.merrill at gmail.com> wrote:
As of 2.15.1, data.frame appears to no longer be O(n^2) in the number of columns in the frame. That's certainly an improvement, yes. However, by eliminating calls to data.frame and replacing them with direct class modifications, I can take a routine which takes minutes and reduce it to a routine which takes seconds. So, pragmatically, in Rcpp, I can get a rough factor of sixty, it appears.
Wow. When you have this written out, will you post links to it? I can learn from your examples, I think. pj
On Thu, Jan 17, 2013 at 7:46 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible.
Are you certain that this claim is still true? I was shocked/surprised by the package "dataframe" and the commentary about it. The author said that data.frame was slow because "This contains versions of standard data frame functions in R, modified to avoid making extra copies of inputs. This is faster, particularly for large data." it was repeatedly copying some objects and he proved a substantially faster approach. In the release notes for R-2.15.1, I recall seeing a note that R Core had responded by integrating several of those changes. But still data.frame is not fast for you? If they didn't make the core data.frame as fast, would you care to enlighten us by installing the dataframe package and letting us know if it is still faster? Or perhaps you are way ahead of me and you've already imitated Hesterberg's algorithms in your C++ design? pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
Sure. I'll write something up for the gallery, but here's the crude
outline.
Here's the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List BuildCheapDataFrame(List a) {
List returned_frame = clone(a);
GenericVector sample_row = returned_frame(1);
StringVector row_names(sample_row.length());
for (int i = 0; i < sample_row.length(); ++i) {
char name[5];
sprintf(&(name[0]), "%d", i);
row_names(i) = name;
}
returned_frame.attr("row.names") = row_names;
StringVector col_names(returned_frame.length());
for (int j = 0; j < returned_frame.length(); ++j) {
char name[6];
sprintf(&(name[0]), "X.%d", j);
col_names(j) = name;
}
returned_frame.attr("names") = col_names;
returned_frame.attr("class") = "data.frame";
return returned_frame;
}
There are some subtleties in this code:
* It turns out that one can't send super-large data frames to it because of
possible buffer overflows. I've never seen that problem when I've written
Rcpp functions which exchanged SEXPs with R, but this one uses Rcpp:export
in order to use sourceCpp.
* Notice the invocation of clone() in the first line of the code. If you
don't do that, you wind up side-effecting the parameter, which is not what
most people would expect.
Here's the timing, as measured on an AWS node:
sourceCpp('/tmp/test_adf.cc')
a <- replicate(250, 1:100, simplify=FALSE)
system.time(replicate( { as.data.frame(a) ; NULL }, n=100))
user system elapsed 3.890 0.000 3.892
system.time(replicate( { BuildCheapDataFrame(a) ; NULL }, n=100))
user system elapsed 0.020 0.000 0.022 Yes, that really is a factor of 200 speedup.
On Fri, Jan 18, 2013 at 8:16 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:
On Thu, Jan 17, 2013 at 9:54 PM, John Merrill <john.merrill at gmail.com> wrote:
As of 2.15.1, data.frame appears to no longer be O(n^2) in the number of columns in the frame. That's certainly an improvement, yes. However, by eliminating calls to data.frame and replacing them with
direct
class modifications, I can take a routine which takes minutes and reduce
it
to a routine which takes seconds. So, pragmatically, in Rcpp, I can get
a
rough factor of sixty, it appears.
Wow. When you have this written out, will you post links to it? I can learn from your examples, I think. pj
On Thu, Jan 17, 2013 at 7:46 PM, Paul Johnson <pauljohn32 at gmail.com>
wrote:
On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <john.merrill at gmail.com> wrote:
It appears that DataFrame::create is a thin layer on top of the R data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible.
Are you certain that this claim is still true? I was shocked/surprised by the package "dataframe" and the commentary about it. The author said that data.frame was slow because "This contains versions of standard data frame functions in R, modified to avoid making extra copies of inputs. This is faster, particularly for large data." it was repeatedly copying some objects and he proved a substantially faster approach. In the release notes for R-2.15.1, I recall seeing a note that R Core had responded by integrating several of those changes. But still data.frame is not fast for you? If they didn't make the core data.frame as fast, would you care to enlighten us by installing the dataframe package and letting us know if it is still faster? Or perhaps you are way ahead of me and you've already imitated Hesterberg's algorithms in your C++ design? pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
-- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130118/12a62123/attachment-0001.html>
12 days later
Le 15/01/13 16:20, John Merrill a ?crit :
It appears that DataFrame::create is a thin layer on top of the R data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible. In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". I have two questions: (1) Is it reasonable to anticipate that this hack will continue to work for the near future in R? (2) If so, would a patch to that effect be of interest to the developers?
The reason we used a callback to data.frame is close to lazyness on our
part. With the R function, for example we know that columns of different
sizes will be handled properly, with recylcling, etc ...
Just making a named list of vectors is not enough. We have to make sure
they all have the same length.
Perhaps it would be worth checking this and make better
DataFrame::create functions.
Also, you can use a shortcut to assign row names, i.e. mimic this in C++
(the second line contains the magic):
> d <- list( x = 1:10, y = 1:10 )
> attr( d, "row.names" ) <- c( NA, -10L )
> attr( d, "class" ) <- "data.frame"
> d
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
Romain
Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 R Graph Gallery: http://gallery.r-enthusiasts.com blog: http://romainfrancois.blog.free.fr |- http://bit.ly/RE6sYH : OOP with Rcpp modules `- http://bit.ly/Thw7IK : Rcpp modules more flexible
On Fri, Jan 18, 2013 at 6:25 PM, John Merrill <john.merrill at gmail.com> wrote:
Sure. I'll write something up for the gallery, but here's the crude outline. Here's the C++ code:
Is C++ really necessary here? I have the following R function in plyr:
quickdf <- function(list) {
rows <- unique(unlist(lapply(list, NROW)))
stopifnot(length(rows) == 1)
names(list) <- make_names(list, "X")
class(list) <- "data.frame"
attr(list, "row.names") <- c(NA_integer_, -rows)
list
}
which is basically equivalent (although I do some tricks with
rownames). It's even more efficient if you copy and paste the
contents instead of calling the function because then that avoids
duplicating the input list, and instead modifies it in place.
Hadley
Chief Scientist, RStudio http://had.co.nz/
The reason we used a callback to data.frame is close to lazyness on our part. With the R function, for example we know that columns of different sizes will be handled properly, with recylcling, etc ...
And S3 methods (as.data.frame) will be dispatched upon correctly etc. Hadley
Chief Scientist, RStudio http://had.co.nz/
I agree that this is not a complete implementation; it isn't meant to be, although it might still be a worth incorporating this into Rcpp with the appropriate fixes in place. For instance, the vector recycling issue is far from the greatest limitation of this code: it handles character vectors wrong, The R routine converts character vectors into factors unless overridden; I wrote the precursor of this particular routine because I wanted to handle strings faithfully, and so writing a stupid R routine to coerce lists of lists of constant length to data frames. On Thu, Jan 31, 2013 at 3:21 AM, Romain Francois
<romain at r-enthusiasts.com>wrote:
Le 15/01/13 16:20, John Merrill a ?crit : It appears that DataFrame::create is a thin layer on top of the R
data.frame call. The guarantee correctness, but also means the performance of an Rcpp routine which returns a large data frame is limited by the performance of data.frame -- which is utterly horrible. In the current version of R, there's a trivial, but borderline evil, work around: build a list of lists meeting the basic requirements of a data frame (they all need to be of the same length, and each component list needs to be named) and set the type of the object to "data.frame". I have two questions: (1) Is it reasonable to anticipate that this hack will continue to work for the near future in R? (2) If so, would a patch to that effect be of interest to the developers?
The reason we used a callback to data.frame is close to lazyness on our part. With the R function, for example we know that columns of different sizes will be handled properly, with recylcling, etc ... Just making a named list of vectors is not enough. We have to make sure they all have the same length. Perhaps it would be worth checking this and make better DataFrame::create functions. Also, you can use a shortcut to assign row names, i.e. mimic this in C++ (the second line contains the magic):
d <- list( x = 1:10, y = 1:10 ) attr( d, "row.names" ) <- c( NA, -10L ) attr( d, "class" ) <- "data.frame" d
x y 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 Romain -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 R Graph Gallery: http://gallery.r-enthusiasts.**com<http://gallery.r-enthusiasts.com> blog: http://romainfrancois.blog.**free.fr<http://romainfrancois.blog.free.fr> |- http://bit.ly/RE6sYH : OOP with Rcpp modules `- http://bit.ly/Thw7IK : Rcpp modules more flexible
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130131/c7fac5c7/attachment-0001.html>