[Bioc-devel] SummarizedExperiments not equal after serialisation
Interesting detective work. This is nasty. Best, Kasper
On Thu, May 16, 2019 at 2:19 AM Pages, Herve <hpages at fredhutch.org> wrote:
Let's try to go to the bottom of this. But let's leave
SummarizedExperiment objects out of the picture for now and focus on what
happens with a very simple reference object.
When you create 2 instances of a reference class with the same content:
A <- setRefClass("A", fields=c(stuff="ANY"))
a0 <- A(stuff=letters)
a1 <- A(stuff=letters)
the .xData slot (which is an environment) is "different" between the 2
instances in the sense that the 2 environments live at different addresses
in memory:
a0 at .xData<mailto:a0 at .xData> # <environment:
0x3812150>
a1 at .xData<mailto:a1 at .xData> # <environment:
0x381c7e0>
identical(a0 at .xData<mailto:a0 at .xData>, a1 at .xData<mailto:a1 at .xData>) #
FALSE
However their **content** is the same:
all.equal(a0 at .xData<mailto:a0 at .xData>, a1 at .xData<mailto:a1 at .xData>) #
TRUE
and the 2 objects are considered equal:
all.equal(a0, a1) # TRUE
When the **content** of the 2 objects differ, all.equal() sees 2
environments with different contents:
b <- A(stuff=LETTERS)
isTRUE(all.equal(a0 at .xData<mailto:a0 at .xData>, b at .xData<mailto:b at .xData>))
# FALSE
and no longer considers the 2 objects equal:
all.equal(a0, b) # "Component ?stuff?: 26 string
mismatches"
So far so good.
When an object goes thru a serialization/deserialization cycle:
saveRDS(a0, "a0.rds")
a2 <- readRDS("a0.rds")
the .xData slot of the restored object also lives at a different address:
a2 at .xData<mailto:a2 at .xData> # <environment:
0x3944668>
identical(a0 at .xData<mailto:a0 at .xData>, a2 at .xData<mailto:a2 at .xData>) #
FALSE
(This is what serialization/deserialization does on environments so is
expected.)
So in that aspect 'a2' is no different from 'a1'. However for 'a2' now we
have:
all.equal(a0, a2) # "Class definitions are not identical"
So why is 'all.equal(a0, a2)' doing this? This cannot be explained only by
the fact that 'a0 at .xData<mailto:a0 at .xData>' and 'a2 at .xData<mailto:a2 at .xData>'
are non-identical environments.
Looking at the source code for all.equal.envRefClass(), we see something
like this (slightly simplified here):
...
if (!identical(target$getClass(), current$getClass())) {
...
return(sprintf("Class definitions are not identical%s", ...)
}
...
So let's try this:
identical(a0$getClass(), a1$getClass()) # TRUE
identical(a0$getClass(), a2$getClass()) # FALSE
Note that 'x$getClass()' is not the same as 'class(x)'. The latter returns
the **class name** while the former returns the **class definition** (which
is represented by a complicated object of class refClassRepresentation).
'a0' and 'a2' have identical class names:
class(a0)
# [1] "A"
# attr(,"package")
# [1] ".GlobalEnv"
class(a2)
# [1] "A"
# attr(,"package")
# [1] ".GlobalEnv"
identical(class(a0), class(a2))
# [1] TRUE
So now the question is: even though 'a0' and 'a2' have identical **class
names**, how come they do NOT have identical **class definitions**?
The big surprise (at least to me) is that reference objects, unlike
traditional S4 objects, CARRY THEIR OWN COPY OF THE CLASS DEFINITION! This
copy is stored in the '.refClassDef' variable stored in the .xData
environment of the object:
ls(a0 at .xData<mailto:a0 at .xData>, all=TRUE)
# [1] ".refClassDef" ".self" "getClass" "stuff"
ls(a2 at .xData<mailto:a2 at .xData>, all=TRUE)
# [1] ".refClassDef" ".self" "getClass" "stuff"
This private copy of the class definition is actually what 'x$getClass()'
returns:
identical(a0$getClass(), get(".refClassDef", envir=a0 at .xData<mailto:
envir=a0 at .xData>)) # TRUE
identical(a2$getClass(), get(".refClassDef", envir=a2 at .xData<mailto:
envir=a2 at .xData>)) # TRUE
Problem is that for 'a2' this copy of the class definition is not
identical to the **original class** definition:
identical(getClass("A"), a0$getClass()) # TRUE
identical(getClass("A"), a2$getClass()) # FALSE
And this in turn is because the complicated object that represents the
class definition also contains environments (e.g.
'getClass("A")@refMethods' is an environment) so going thru a
serialization/deserialization cycle is not a **strict no-op** on it (from
an identical() perspective).
Replacing the copy of the class definition stored in 'a2' with the
original class definition makes the problem go away:
rm(".refClassDef", envir=a2 at .xData<mailto:envir=a2 at .xData>)
assign(".refClassDef", getClass("A"), envir=a2 at .xData<mailto:envir=a2@
.xData>)
all.equal(a0, a2) # TRUE
Bottom line: the test 'identical(target$getClass(), current$getClass())'
performed by all.equal.envRefClass() seems too stringent. It should
probably be replaced with something a little bit more tolerant i.e.
something that considers environments that live at different addresses but
have the same content to be equal. Looks like
'isTRUE(all.equal(target$getClass(), current$getClass()))' could do the job.
Finally note that, in addition to the above test, all.equal.envRefClass()
also does this test (slightly simplified here):
if (!isTRUE(all.equal(class(target), class(current))))
return(sprintf("Classes differ: %s", ...))
Maybe that's all what it needs to do to compare the classes of the 2
objects? (Ironically this test uses all.equal() when it could use
identical().)
Michael?
H.
On 5/11/19 15:09, Aaron Lun wrote:
I would say it's much worse than mismatching class definitions.
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=
-A
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=-A>
On 5/11/19 5:07 AM, Martin Morgan wrote:
I think it has to do with the use of reference classes in the assay slot,
which have different environments
se = SummarizedExperiment()
saveRDS(se, fl <- tempfile())
se1 = readRDS(fl)
and then
all.equal(se at assays, se1 at assays)
[1] "Class definitions are not identical"
all.equal(se at assays@.xData<mailto:se at assays@.xData>, se1 at assays
@.xData<mailto:se1 at assays@.xData>)
[1] "Component \".self\": Class definitions are not identical"
se at assays@.xData<mailto:se at assays@.xData>
<environment: 0x7fb1de1ede90>
se1 at assays@.xData<mailto:se1 at assays@.xData>
<environment: 0x7fb1fc2bca78>
Martin
?On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <
bioc-devel-bounces at r-project.org on behalf of laurent.gatto at uclouvain.be
<mailto:bioc-devel-bounces at r-project.orgonbehalfoflaurent.gatto@
uclouvain.be> wrote:
I would appreciate some background about the following:
> suppressPackageStartupMessages(library("SummarizedExperiment"))
> set.seed(1L)
> m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4],
LETTERS[1:4]))
> rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
> se1 <- SummarizedExperiment(m, rowData = rowdata)
> se2 <- SummarizedExperiment(m, rowData = rowdata)
> all.equal(se1, se2)
[1] TRUE
But after serialising and reading se2, the two instances aren't
equal any more:
> saveRDS(se2, file = "se2.rds")
> rm(se2)
> se2 <- readRDS("se2.rds")
> all.equal(se1, se2)
[1] "Attributes: < Component ?assays?: Class definitions are not
identical >"
Session information provided below.
Thank you in advance,
Laurent
R version 3.6.0 RC (2019-04-21 r76417)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
datasets
[8] methods base
other attached packages:
[1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[3] BiocParallel_1.18.0 matrixStats_0.54.0
[5] Biobase_2.44.0 GenomicRanges_1.36.0
[7] GenomeInfoDb_1.20.0 IRanges_2.18.0
[9] S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] lattice_0.20-38 bitops_1.0-6 grid_3.6.0
[4] zlibbioc_1.30.0 XVector_0.24.0 Matrix_1.2-17
[7] tools_3.6.0 RCurl_1.95-4.12 compiler_3.6.0
[10] GenomeInfoDbData_1.2.1
_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing
list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org<mailto:hpages at fredhutch.org>
Phone: (206) 667-5791
Fax: (206) 667-1319
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel