Hello,
I have a list below whose elements are data frames (Please see the
attached file ?try.dat?). Now I want to apply a complicated function to
each row of the data frame which returns a single value. For simplicity, you
can assume this function is ma(x) (x is the row of the data frame). [[1]]
class_id student_id 1 2
1 1 1 9 14
2 1 2 4 1
3 1 3 10 8
4 1 4 7 7
5 1 5 6 11
6 1 6 1 3
7 1 7 14 10
8 1 8 13 12
9 1 9 12 2
10 1 10 3 9
11 1 11 8 4
12 1 12 11 6
13 1 13 2 13
14 1 14 5 5[[2]]
class_id student_id 1 2
15 2 1 11 3
16 2 2 7 10
17 2 3 2 2
18 2 4 6 6
19 2 5 13 8
20 2 6 12 13
21 2 7 8 14
22 2 8 1 9
23 2 9 3 1
24 2 10 4 11
25 2 11 5 4
26 2 12 9 12
27 2 13 10 7
28 2 14 14 5[[3]]
class_id student_id 1 2
29 3 1 12 6
30 3 2 1 3
31 3 3 8 2
32 3 4 9 10
33 3 5 11 7
34 3 6 14 4
35 3 7 2 14
36 3 8 13 13
37 3 9 3 8
38 3 10 5 11
39 3 11 4 12
40 3 12 7 1
41 3 13 10 5
42 3 14 6 9
In real situation the list will be very long, and the
dataframe is much wider. That?s why I want to use Rcpp to improve the
speed.
I got stuck from the very beginning, I failed to import this
list to Rcpp, not to mention import the dataframe to Rcpp.
I?ve checked the book Seamless R and C++ integration with
Rcpp but find NO example deals with such case.
Thank you very much for your support!
Best regards,
Sky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130927/36934928/attachment.html>
[Rcpp-devel] Please help! A list containing dataframe
5 messages · sky Xue, Mark Clements, Romain Francois +1 more
Le 27/09/13 12:11, sky Xue a ?crit :
Hello,
I have a list below whose elements are data frames (Please see the
attached file ?try.dat?). Now I want to apply a complicated function to
each row of the data frame which returns a single value. For simplicity,
you can assume this function is ma(x) (x is the row of the data frame).
[[1]]
class_id student_id 1 2
1 1 1 9 14
2 1 2 4 1
3 1 3 10 8
4 1 4 7 7
5 1 5 6 11
6 1 6 1 3
7 1 7 14 10
8 1 8 13 12
9 1 9 12 2
10 1 10 3 9
11 1 11 8 4
12 1 12 11 6
13 1 13 2 13
14 1 14 5 5
[[2]]
class_id student_id 1 2
15 2 1 11 3
16 2 2 7 10
17 2 3 2 2
18 2 4 6 6
19 2 5 13 8
20 2 6 12 13
21 2 7 8 14
22 2 8 1 9
23 2 9 3 1
24 2 10 4 11
25 2 11 5 4
26 2 12 9 12
27 2 13 10 7
28 2 14 14 5
[[3]]
class_id student_id 1 2
29 3 1 12 6
30 3 2 1 3
31 3 3 8 2
32 3 4 9 10
33 3 5 11 7
34 3 6 14 4
35 3 7 2 14
36 3 8 13 13
37 3 9 3 8
38 3 10 5 11
39 3 11 4 12
40 3 12 7 1
41 3 13 10 5
42 3 14 6 9
In real situation the list will be very long, and the dataframe is much
wider. That?s why I want to use Rcpp to improve the speed.
I got stuck from the very beginning, I failed to import this list to
Rcpp, not to mention import the dataframe to Rcpp.
I?ve checked the book Seamless R and C++ integration with Rcpp but find
NO example deals with such case.
Thank you very much for your support!
Best regards,
Sky
Rcpp has the Rcpp::DataFrame class which might help you but it does not
do much.
A data.frame is merely a list of vectors of the same size, but of
arbitrary types. This makes it difficult to process rows of a data frame.
So you have to do some work to grab a row of a data frame and apply
something to it. The code below assumes that you have a data frame that
contains only numeric vectors.
#include <Rcpp.h>
using namespace Rcpp;
double fun( NumericVector x){
return sum(x) ;
}
void fill_row( NumericVector& row, const std::vector<NumericVector>&
vectors, int i, int n){
for( int j=0; j<n; j++){
row[j] = vectors[j][i] ;
}
}
// [[Rcpp::export]]
NumericVector apply_row_df( DataFrame df ){
int n = df.size() ;
int nrows = df.nrows() ;
std::vector<NumericVector> vectors(n) ;
for( int i=0; i<n; i++) vectors[i] = df[i] ;
NumericVector row(n) ;
NumericVector results(nrows) ;
for( int i=0; i<nrows; i++){
fill_row( row, vectors, i, n );
results[i]=fun(row) ;
}
return results ;
}
// [[Rcpp::export]]
List apply_all( List list ){
return lapply( list, apply_row_df) ;
}
/*** R
df <- data.frame( x = seq(0, 10, .1), y = seq(0, 10, .1), z =
seq(0, 10, .1) )
apply_row_df( df )
list_of_df <- rep( list(df), 10 )
apply_all( list_of_df )
*/
The function apply_row_df works on a single data frame, it calls the fun
function on each row of the data frame. Prior to that we fill the vector
"row" with data using the fill_row function.
Then it is just looping, etc ...
The apply_all is just a convenience that will apply apply_row_df to each
item of a list.
Hope this helps.
Romain
Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30
This can be done more generally.
Following an earlier suggestion from Romain, we can use boost::tuple from the BH package - for a row of fixed size with general types. Then we can use a template to read in the data-frame and work with the set of rows.
Variadic templates would be nice here, rather than needing to enumerate for tuples of different lengths.
Out of interest, is this poor style for Rcpp?
Sincerely, Mark.
require(inline)
testReadDf <-
rcpp(signature(df="data.frame"),
includes="
#include <boost/tuple/tuple.hpp>
#include <vector>
#include <algorithm>
// general function to read a data-frame
template <class T1, class T2, class T3, class T4>
std::vector<boost::tuple<T1,T2,T3,T4> > read_df( DataFrame df ){
typedef boost::tuple<T1,T2,T3,T4> Row;
int n = df.nrows() ;
std::vector<Row> rows(n) ;
Vector<traits::r_sexptype_traits<T1>::rtype> df0 = df[0];
Vector<traits::r_sexptype_traits<T2>::rtype> df1 = df[1];
Vector<traits::r_sexptype_traits<T3>::rtype> df2 = df[2];
Vector<traits::r_sexptype_traits<T4>::rtype> df3 = df[3];
for( int i=0; i<n; i++)
rows[i] = Row(df0[i],df1[i],df2[i],df3[i]);
return rows ;
}
// example function
typedef boost::tuple<int,int,int,int> MyRow;
int fun(MyRow row) {
return boost::get<0>(row)+2*boost::get<1>(row)+3*boost::get<2>(row)+4*boost::get<3>(row);
}
",
body="
// read in the data-frame as a vector of rows
std::vector<MyRow> v = read_df<int,int,int,int>(df);
int n = v.size();
std::vector<int> out(n);
std::transform(v.begin(),v.end(),out.begin(),fun);
return wrap(out);
")
testReadDf(data.frame(1,2,3,4))
Hello,
Storing the data frame as a vector<tuple<...>> feels very inefficient,
in essence you are copying all the data to another structure, which is
not much easier to use anyway. The fun implementation feels boiler plate :
int fun(MyRow row) {
return
boost::get<0>(row)+2*boost::get<1>(row)+3*boost::get<2>(row)+4*boost::get<3>(row);
}
The version I proposed is not restricted to 4 columns and will be more
efficient since it does not need to copy all the data. It just stores
one line at a time and processes it.
Now on variadic templates, yes they can definitely help. In Rcpp11 I'm
using them extensively and it allowed me to reduce the code size
dramatically (Rcpp11 is about 40% the size of Rcpp).
See for example :
https://github.com/romainfrancois/Rcpp11/blob/master/inst/include/Rcpp/sugar/functions/replicate.h
This is used to implement this feature:
double fun( double x, double y, int z ){
return x + y + z ;
}
NumericVector x = replicate( 10, call( fun, 1.0, 2.0, 3 ) ) ;
Another example is this 75 file:
https://github.com/romainfrancois/Rcpp11/blob/master/inst/include/Rcpp/module/FunctionInvoker.h
which replaces a file that weights about 14666 lines in Rcpp.
Romain
Le 27/09/13 16:12, Mark Clements a ?crit :
This can be done more generally.
Following an earlier suggestion from Romain, we can use boost::tuple from the BH package - for a row of fixed size with general types. Then we can use a template to read in the data-frame and work with the set of rows.
Variadic templates would be nice here, rather than needing to enumerate for tuples of different lengths.
Out of interest, is this poor style for Rcpp?
Sincerely, Mark.
require(inline)
testReadDf <-
rcpp(signature(df="data.frame"),
includes="
#include <boost/tuple/tuple.hpp>
#include <vector>
#include <algorithm>
// general function to read a data-frame
template <class T1, class T2, class T3, class T4>
std::vector<boost::tuple<T1,T2,T3,T4> > read_df( DataFrame df ){
typedef boost::tuple<T1,T2,T3,T4> Row;
int n = df.nrows() ;
std::vector<Row> rows(n) ;
Vector<traits::r_sexptype_traits<T1>::rtype> df0 = df[0];
Vector<traits::r_sexptype_traits<T2>::rtype> df1 = df[1];
Vector<traits::r_sexptype_traits<T3>::rtype> df2 = df[2];
Vector<traits::r_sexptype_traits<T4>::rtype> df3 = df[3];
for( int i=0; i<n; i++)
rows[i] = Row(df0[i],df1[i],df2[i],df3[i]);
return rows ;
}
// example function
typedef boost::tuple<int,int,int,int> MyRow;
int fun(MyRow row) {
return boost::get<0>(row)+2*boost::get<1>(row)+3*boost::get<2>(row)+4*boost::get<3>(row);
}
",
body="
// read in the data-frame as a vector of rows
std::vector<MyRow> v = read_df<int,int,int,int>(df);
int n = v.size();
std::vector<int> out(n);
std::transform(v.begin(),v.end(),out.begin(),fun);
return wrap(out);
")
testReadDf(data.frame(1,2,3,4))
Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30
This is not a Rcpp solution but be advise that as long as your function "ma" do not require values on previous lines, you might want to look at the data.table package. This package provides ultra fast operations on data.frames. Off course nothing prevents you from writing "ma" in C/C++ and to use data.table to apply the function quickly and to insert the new data into the data.table by reference -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130927/9dd5c4fe/attachment.html>