Skip to content

[Rcpp-devel] efficient ingestion of "sparse csv"

5 messages · Dirk Eddelbuettel, Vincent Carey, Serguei Sokol

#
This problem has been discussed in various places but I don't
see a clear solution.  Certain applications are generating
large comma-delimited files with mostly zero entries.  The aim
is to ingest efficiently, converting to sparse representation
a record at a time.  Presumably a triplet format would be the
initial internal representation, with an aim to convert at
the end to Matrix dgCmatrix format.  Has anyone tackled this
in Rcpp or RcppArmadillo?
#
Vincent,

In the broad terms of the question the best answer may be a simple "sure".
More seriously, there have been many approaches.  Consider for example the
recent Rcpp Gallery post lead by Zach (with some edits by me):
  https://gallery.rcpp.org/articles/sparse-matrix-class/

It's focus on not copying <i,p,x> again if we already have them as R vectors,
which is a fair point. If the goal is to get to SuperLU via (Rcpp)Armadillo
then I do not think you can avoid the (internal) copies.  As always, the
answer may be "it depends".

Hope this helps, happy to refine,  Dirk
#
Thanks Dirk, lots of useful information there.  I wonder whether the sparse
ingestion
problem would best be solved with multiple passes -- it seems one would want
to learn the dimensions and the number of nonzero elements per row to
allocate the index vectors, and then populate them and the data vector with
a final pass.
Or one could use a buffering strategy to grow the index vectors as needed
in a
one-pass approach.
On Mon, May 10, 2021 at 11:19 PM Dirk Eddelbuettel <edd at debian.org> wrote:

            

  
    
15 days later
#
On this theme, the following proved sufficient to ingest and
convert sparse csv without column headers or row names:

#include "RcppArmadillo.h"

using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List parse_sparse_csv_impl(SEXP fname) {
using namespace Rcpp;
std::string v = Rcpp::as<std::string>(fname);
arma::sp_mat D;
D.load(v, arma::csv_ascii);
return Rcpp::List::create(Rcpp::Named("sp")=D);
}
On Mon, May 10, 2021 at 11:19 PM Dirk Eddelbuettel <edd at debian.org> wrote:

            

  
    
#
Le 26/05/2021 à 16:36, Vincent Carey a écrit :
Nice to share your final solution which could be further shorten to smth 
like:

#include "RcppArmadillo.h"
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::sp_mat parse_sparse_csv_short(std::string fname) {
    arma::sp_mat D;
    D.load(fname, arma::csv_ascii);
    return D;
}

Best,
Serguei.