Skip to content
Prev 12259 / 21307 Next

[Bioc-devel] any interest in a BiocMatrix core package?

Hi all,

To continue a variant of this conversation, with the latest BioC release, we now have quite a few packages that are implementing various matrix-related S4 generic functions, many of them relying on matrixStats as a template.

I was wondering if there is any interest or intention to create a common MatrixGenerics/ArrayGenerics package on which we can depend to import the relevant S4 generic functions. Although BiocGeneric has a few like ?rowSums()? and ?colMeans()?, etc., there are many more that are implemented across ?DelayedArray', ?DelayedMatrixStats', my own package ?matter', etc., including ?apply()?, ?rowSds()?, ?colVars()?, and so forth.

It would be nice to have a single package with minimal additional dependencies (a la BiocGenerics) where we could import the various S4 generics and avoid unwanted namespace collisions.

Have there been any thoughts on this?

Many thanks,
Kylie

~~~
Kylie Ariel Bemis
Future Faculty Fellow
College of Computer and Information Science
Northeastern University
kuwisdelu.github.io<https://kuwisdelu.github.io>
On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>> wrote:

        
On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <stvjc at channing.harvard.edu<mailto:stvjc at channing.harvard.edu>> wrote:

        
On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>> wrote:
Some comment on Aaron's stuff

One possibility for doing things like this is if your code can be done in C++ using a subset of rows or columns.  That can sometimes give the necessary speed up.  What I mean is this

Say you can safely process 1000 cells (not matrix cells, but biological cells, aka columns) at a time in RAM

iterate in R:
  get chunk i containing 1000 cells from the backend data storage
  do something on this sub matrix where everything is in a normal matrix and you just use C++
  write results out to whatever backend you're using

Then, with a million cells you iterate over 1000 chunks in R.  And you don't need to "touch" the full dataset which can be stored on an arbitrary backend.

you "touch" it, but you never ingest the whole thing at any time, is that what you mean?

Yes, you load the chunk into RAM and then just deal with it.

Think of doing 10^10 linear models.  If this was 10^6 I would just use lmFit.  But 10^10 doesn't fit into memory.  So I load 10^7 into memory, run lmFit, store results, redo.  This is bound to be much more efficient than loading a single row into memory and doing lm 10^10 times, because lmFit is written to do many linear models at the same time.

I am suggesting that this is a potential general strategy.


And this approach could be run even (potentially) with different chunks on different nodes.

that seems to me to be an important if not essential desideratum.

what then is the role of C++?  extracting a chunk?  preexisting utilities?

When I say C++ I just mean write an efficient implementation that works on a chunk, like lmFit.  It is true that anything that works on a chunk will work on a single row/column (like lmFit) but there are possibilities for optimization when you work at the chunk level.

Obviously not all computations can be done chunkwise.  But for those that can, this is a strategy which is independent of the data backend.

I wonder whether this "obviously not" needs to be rethought.  Algorithms that are implemented to work with data holistically may need
to be reexpressed so that they can succeed with chunkwise access.  Is this a new mindset needed for holist developers, or can the
effective data decompositions occur autonomously?

Well, I would say it is obvious that not all computations can be done chunkwise.  But of course, in the limit of extremely large data, algorithms which needs to cycle over everything no longer scale.  So in that case all practical computations can be done chunkwise, out of necessity.  For single cell right now where it is just millions of cells on the horizon people will think that they can get "standard" holistic approaches to work (and that is probably true).  If they had a billion cells they probably wouldn't think about that.

Kasper

If you need direct access to the data in the backend in C++  it will be extremely backend dependent what is fast and how to do it.  That doesn't mean we shouldn't do it though.

Best,
Kasper
On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <stvjc at channing.harvard.edu<mailto:stvjc at channing.harvard.edu>> wrote:
Kylie, thanks for reminding us of matter -- I saw you speak about this at
the first Bioconductor Boston Meetup, but it
went like lightning.   For developers contemplating an approach to
representing high-volume rectangular data,
where there is no dominant legacy format, it is natural to wonder whether
HDF5 would be adequate, and,
further, to wonder how to demonstrate that it is or is not dominated by
some other approach for a given set
of tasks.  Should we devise a set of bioinformatic benchmark problems to
foster comparison and informed
decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
contemplate benchmarking with it?

On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.bemis at northeastern.edu<mailto:k.bemis at northeastern.edu>>
wrote:
_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel