Skip to content

[Bioc-devel] Reading and storing single cell ATAC data

1 message · Andrew McDavid

#
Hi Caleb,
Hopefully Herve will chime in regarding SummarizedExperiment, but yes, I
think you can and should inherit from that. The `assays` slot must be an
object of type `Assays`, but that does appear to include a sparse Matrix.
See the comments at the top of Assays-class.R in the tarball for
SummarizedExperiment.  For example:

library(SummarizedExperiment)
library(Matrix)
library(GenomicRanges)
Nrow=1e6
Ncol=1e4
assay=Matrix::Matrix(0, nrow=Nrow, ncol=Ncol, sparse=TRUE)
gr <- GRanges(Rle("chr2", Nrow),
              IRanges(seq_len(Nrow), width=10))
se <- SummarizedExperiment(assays=assay, rowRanges=gr)

As far as out-of-core storage of sparse matrices, I do not know of any good
(portable) solutions.   If it makes more sense to chunk the matrix along
some dimension, you could always pickle the chunked, (sparse) Matrix
objects. In my experience, the decision to adopt sparse vs out-of-core
dense arrays has often required empirical testing to determine what is
fastest/most scalable, since you lose caching benefits from sequential
memory access once you go sparse.  I know there has been talk of extending
SummarizedExperiment to easily permit the Assays to be hdf5-based.

Is disk space really going to be a limiting factor? If so, then you will
probably be IO-bound, so you will need to distribute the data across
computing nodes for your analysis to scale anyways, which suggests some
sort of map-reduce formalism.  Which to my knowledge no one has considered
yet in Bioconductor.  But unless you are generating > 1 TB of
semi-processed data, maybe you don't need to go there?

-Andrew