[Bioc-devel] Base class for interaction data - expressions of interest
While I'm on this point, there's another, more subtle issue with using sparseMatrix(). Specifically, there's a distinction between zeros and missing values when considering a ContactMatrix. For example, in Hi-C data, a zero in the matrix means there aren't any read pairs mapping between the corresponding bins. A missing value means that the count for the bin pair is unknown, e.g., because that particular pairwise interaction was missing from the InteractionSet during conversion. This difference may be important in calculating correct statistics; one can imagine situations where assuming all missing values are zero would not be appropriate. In general, I would expect that missing values would take up most of the matrix entries after conversion from an InteractionSet. sparseMatrix() doesn't seem to support setting "NA" as the default value to collapse a sparse matrix; it's fixed at zero, which makes mathematical sense but isn't quite right for our purposes. Now, this might not be so bad for count data, depending on how you counted the reads into bin pairs; converting all NA's to zeros might be okay in such circumstances, if the occurrence of those NA's in the first place was due to the lack of reads. However, if you fill the contact matrix with other metrics (e.g., log-FCs, average log-CPMs), assuming that all missing values are zero would probably be incorrect. Anyway, food for thought. - Aaron
On 16/11/15 10:31, Aaron Lun wrote:
Thanks for the comment Nadhir. I had considered the use of a sparse matrix class. The reason I didn't implement it originally is because truly sparse interaction data would be better represented by just working with the pairwise format in the InteractionSet. You need the row/column indices to pass to the sparseMatrix constructor anyway; a memory-efficient algorithm to do, for example, compartment identification could just use that directly. Most existing algorithms for doing this (e.g., k-means/hierarchical clustering) won't operate natively from a sparseMatrix, and I suspect they'll just run as.matrix() and convert it to a full matrix. Obviously, this would defeat the purpose of using a sparse matrix. So, if you have to rewrite the algorithms anyway, you might as well rewrite them in a manner that avoids needing the sparseMatrix() as a middleman. Nonetheless, it's a good point about memory usage. I'll have a think about it; sparseMatrix() would help a bit, but as coverage increases for these experiments, the matrix will probably become fairly dense (even if it's just counts of 1 for some bin pairs). Even now, for compartment detection, fairly large bins are involved that sparseness usually isn't observed. Perhaps big.matrix() might be a better choice. Cheers, Aaron On 16/11/15 09:58, DJEKIDEL MOHAMED NADHIR wrote:
Hi Aaron,
Sounds as a great initiative.
I just have some comments about the ContactMatrix-Class.
I think with increasing Hi-C resolution the usage of the matrix class
will consume a lot of memory.
Maybe using sparseMatrix from the Matrix package has a smaller finger
print.
it can also be manipulated in cpp using RcppEigen, if for example you
plan some functionalities such as AB domains or insulation scores, ...
etc.
Regards,
- Nadhir
On Mon, Nov 16, 2015 at 5:33 PM, Aaron Lun <alun at wehi.edu.au
<mailto:alun at wehi.edu.au>> wrote:
Hello all,
I thought I might give an update on the state of affairs for the
InteractionSet package. Currently, there's three classes:
- the GInteractions class, inheriting from Vector and intended to
represent pairwise interactions between genomic regions (based on
suggestions from Malcolm Perry and Liz Ing-Simmons).
- the InteractionSet class, inheriting from SummarizedExperiment0
and containing a GInteractions object; intended to store
experimental data about pairwise interactions (one interaction per
row).
- the ContactMatrix class, inheriting from Annotated and storing
data in matrix form (where rows/columns represent genomic regions).
Getters, setters, conversion methods between classes, distance
calculation methods and overlap methods have been implemented. Man
pages and "testthat" scripts have also been written. Still missing a
vignette, though it should be easy enough to write one.
All in all, I think it's a solid first draft. Any comments would be
appreciated.
Cheers,
Aaron
On 08/11/15 19:31, Aaron Lun wrote:
Okay, some meat and bones are on GitHub now:
https://github.com/LTLA/InteractionSet
The idea is to represent genomic interactions as pairs of genomic
regions, using indices to point to a common GRanges object (a la
Hits,
though I haven't used that explicitly due to the presence of
additional
constraints on the indices). Data for each interaction is stored
using a
SummarizedExperiment framework (one row per interaction).
With regards to the methods, most of the low-hanging fruit has
been
implemented, courtesy of inheriting from SummarizedExperiment0.
I'll add
proper unit tests over the coming week. It currently passes
through R
CMD check okay, except for a warning about ":::" in the
cbind/rbind
definitions (callNextMethod() didn't seem to work inside those
methods,
and I didn't want to rewrite the SE0 'binding methods).
Any thoughts appreciated.
- Aaron
On 07/11/15 19:33, Morgan, Martin wrote:
Just to say that this is a great idea. If this starts as a
github
package (or in svn, we can create a location for you if
you'd like) I
and others would I am sure be happy to try to provide any
guidance /
insight. The main design principles are probably to reuse as
much as
possible from existing classes, especially the S4Vectors /
GRanges
world, and to integrate metadata as appropriate (like
SummarizedExepriment, for instance).
Martin
________________________________________
From: Bioc-devel [bioc-devel-bounces at r-project.org
<mailto:bioc-devel-bounces at r-project.org>] on behalf of Aaron
Lun [alun at wehi.edu.au <mailto:alun at wehi.edu.au>]
Sent: Thursday, November 05, 2015 12:27 PM
To: bioc-devel at r-project.org
<mailto:bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] Base class for interaction data -
expressions of interest
There's a growing number of Bioconductor packages dealing
with
interaction data; diffHic, GenomicInteractions, HiTC, to
name a few (and
probably more in the future). Each of these packages defines
its own
class to store interaction data - DIList for diffHic,
GenomicInteractions for GenomicInteractions, and HTClist for
HiTC.
These classes seem to share a lot of features, which
suggests that they
can be (easily?) replaced with a common class. This would
have two
advantages - one, developers of new and existing packages
don't have to
continually write and maintain new classes; and two, it
provides users
with a consistent user experience across the relevant
packages.
My question is, does anybody have anything in the pipeline
with respect
to a base package for an interaction class? If not, I'm
planning to put
something together for the next BioC release. To this end,
I'd welcome
any ideas/input/code; the aim is to make a drop-in
replacement (insofar
as that's possible) for the existing classes in each package.
Cheers,
Aaron
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged and/or
confidential
information. If you are not the intended recipient(s), or
the
employee or agent responsible for the delivery of this
message to the
intended recipient(s), you are hereby notified that any
disclosure,
copying, distribution, or use of this email message is
prohibited. If
you have received this message in error, please notify the
sender
immediately by e-mail and delete this email message from your
computer. Thank you.
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel