Skip to content

[Bioc-devel] any interest in a BiocMatrix core package?

7 messages · Tim Triche, Jr., Vincent Carey, Aaron Lun +2 more

#
Hi everyone,

I just attended the Human Cell Atlas meeting in Stanford, and people were talking about gene expression matrices for >1 million cells. If we assume that we can get non-zero expression profiles for ~5000 genes, we?d be talking about a 5000 x 1 million matrix for the raw count data. This would be 20-40 GB in size, which would clearly benefit from sparse (via Matrix) or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, etc.).

I?m wondering whether there is any appetite amongst us for making a consistent BioC API to handle these matrices, sort of like what BiocParallel does for multicore and snow. It goes without saying that the different matrix representations should have consistent functions at the R level (rbind/cbind, etc.) but it would also be nice to have an integrated C/C++ API (accessible via LinkedTo). There?s many non-trivial things that can be done with this type of data, and it is often faster and more memory efficient to do these complex operations in compiled code.

I was thinking of something that you could supply any supported matrix representation to a registered function via .Call; the C++ constructor would recognise the type of matrix during class instantiation; and operations (row/column/random read access, also possibly various ways of writing a matrix) would be overloaded and behave as required for the class. Only the implementation of the API would need to care about the nitty gritty of each representation, and we would all be free to write code that actually does the interesting analytical stuff.

Anyway, just throwing some thoughts out there. Any comments appreciated.

Cheers,

Aaron
#
yes

the DelayedArray framework that handles HDF5Array, etc. seems like the
right choice?

--t
On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au> wrote:

            

  
  
#
It's a good place to start, though it would be very handy to have a C(++) API that can be linked against. I'm not sure how much work that would entail but it would give downstream developers a lot more options. Sort of like how we can link to Rhtslib, which speeds up a lot of BAM file processing, instead of just relying on Rsamtools.


-Aaron
#
What is the data type for an expression value?  Is it assumed that double
precision will be needed?
On Fri, Feb 24, 2017 at 4:50 PM, Aaron Lun <alun at wehi.edu.au> wrote:

            

  
  
#
Yes, I think double-precision would be necessary for general use. Only the raw count data would be integer, and even then that's not guaranteed (e.g., if people are using kallisto or salmon for quantification).


-Aaron
#
On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au> wrote:

            
This seems (at least moderately) related to the alternative atomic-vector
representation work I have been doing with R-core. See
https://www.r-project.org/dsc/2016/slides/customvectors.html
and  ALTREP.md in https://svn.r-project.org/R/branches/ALTREP/ (not
necessarily fully up-to-date, you can also look at src/main/altrep.c for
implementation).

I'd also say there may be a pretty strong impedence mismatch if you want
something customizable at both the R and C levels. That's just a suspicion
at this point, though, I'm writing very quickly and will send out a more
reasoned response later because I don't have the time to do so right this
second.

Best,
~G

  
    
#
It?s not there yet, but I plan to expose a C++ API for my disk-backed matrix objects in the next version of my ?matter? package.

It?s getting easier to interchange matter/HDF5Array/bigmemory/etc. objects at the R level, especially if using a frontend like DelayedArray on top of them, but it would be nice to have a common C++ API that I could hook into as well (a la Rcpp), so new C/C++ could be re-used across various backends more easily.

Kylie

~~~
Kylie Ariel Bemis
Future Faculty Fellow
College of Computer and Information Science
Northeastern University
kuwisdelu.github.io<https://kuwisdelu.github.io>
On Feb 24, 2017, at 4:50 PM, Aaron Lun <alun at wehi.edu.au<mailto:alun at wehi.edu.au>> wrote:
It's a good place to start, though it would be very handy to have a C(++) API that can be linked against. I'm not sure how much work that would entail but it would give downstream developers a lot more options. Sort of like how we can link to Rhtslib, which speeds up a lot of BAM file processing, instead of just relying on Rsamtools.


-Aaron