Skip to content

[Bioc-devel] SummarizedExperiments not equal after serialisation

5 messages · Laurent Gatto, Martin Morgan, Aaron Lun +2 more

#
I would appreciate some background about the following:
[1] TRUE

But after serialising and reading se2, the two instances aren't equal any more:
[1] "Attributes: < Component ?assays?: Class definitions are not identical >"

Session information provided below.

Thank you in advance,

Laurent


R version 3.6.0 RC (2019-04-21 r76417)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
 [3] BiocParallel_1.18.0         matrixStats_0.54.0         
 [5] Biobase_2.44.0              GenomicRanges_1.36.0       
 [7] GenomeInfoDb_1.20.0         IRanges_2.18.0             
 [9] S4Vectors_0.22.0            BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0            
 [4] zlibbioc_1.30.0        XVector_0.24.0         Matrix_1.2-17         
 [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0        
[10] GenomeInfoDbData_1.2.1
#
I think it has to do with the use of reference classes in the assay slot, which have different environments

  se = SummarizedExperiment()
  saveRDS(se, fl <- tempfile())
  se1 = readRDS(fl)

and then
[1] "Class definitions are not identical"
[1] "Component \".self\": Class definitions are not identical"
<environment: 0x7fb1de1ede90>
<environment: 0x7fb1fc2bca78>

Martin

?On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <bioc-devel-bounces at r-project.org on behalf of laurent.gatto at uclouvain.be> wrote:

    I would appreciate some background about the following:
    
    > suppressPackageStartupMessages(library("SummarizedExperiment"))
    > set.seed(1L)
    > m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4], LETTERS[1:4]))
    > rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
    > se1 <- SummarizedExperiment(m, rowData = rowdata)
    > se2 <- SummarizedExperiment(m, rowData = rowdata)
    > all.equal(se1, se2)
    [1] TRUE
    
    But after serialising and reading se2, the two instances aren't equal any more:
    
    > saveRDS(se2, file = "se2.rds")
    > rm(se2)
    > se2 <- readRDS("se2.rds")
    > all.equal(se1, se2)
    [1] "Attributes: < Component ?assays?: Class definitions are not identical >"
    
    Session information provided below.
    
    Thank you in advance,
    
    Laurent
    
    
    R version 3.6.0 RC (2019-04-21 r76417)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 18.04.2 LTS
    
    Matrix products: default
    BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
    LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
    [8] methods   base     
    
    other attached packages:
     [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
     [3] BiocParallel_1.18.0         matrixStats_0.54.0         
     [5] Biobase_2.44.0              GenomicRanges_1.36.0       
     [7] GenomeInfoDb_1.20.0         IRanges_2.18.0             
     [9] S4Vectors_0.22.0            BiocGenerics_0.30.0        
    
    loaded via a namespace (and not attached):
     [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0            
     [4] zlibbioc_1.30.0        XVector_0.24.0         Matrix_1.2-17         
     [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0        
    [10] GenomeInfoDbData_1.2.1
    
    
    _______________________________________________
    Bioc-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel
#
I would say it's much worse than mismatching class definitions.

https://github.com/Bioconductor/SummarizedExperiment/issues/16

-A
On 5/11/19 5:07 AM, Martin Morgan wrote:
4 days later
#
Let's try to go to the bottom of this. But let's leave SummarizedExperiment objects out of the picture for now and focus on what happens with a very simple reference object.

When you create 2 instances of a reference class with the same content:

  A <- setRefClass("A", fields=c(stuff="ANY"))
  a0 <- A(stuff=letters)
  a1 <- A(stuff=letters)


the .xData slot (which is an environment) is "different" between the 2 instances in the sense that the 2 environments live at different addresses in memory:

  a0 at .xData<mailto:a0 at .xData>                        # <environment: 0x3812150>
  a1 at .xData<mailto:a1 at .xData>                        # <environment: 0x381c7e0>
  identical(a0 at .xData<mailto:a0 at .xData>, a1 at .xData<mailto:a1 at .xData>)  # FALSE


However their **content** is the same:

  all.equal(a0 at .xData<mailto:a0 at .xData>, a1 at .xData<mailto:a1 at .xData>)  # TRUE


and the 2 objects are considered equal:

  all.equal(a0, a1)                # TRUE


When the **content** of the 2 objects differ, all.equal() sees 2 environments with different contents:

  b <- A(stuff=LETTERS)
  isTRUE(all.equal(a0 at .xData<mailto:a0 at .xData>, b at .xData<mailto:b at .xData>)) # FALSE

and no longer considers the 2 objects equal:

  all.equal(a0, b)                 # "Component ?stuff?: 26 string mismatches"


So far so good.

When an object goes thru a serialization/deserialization cycle:

  saveRDS(a0, "a0.rds")
  a2 <- readRDS("a0.rds")


the .xData slot of the restored object also lives at a different address:

  a2 at .xData<mailto:a2 at .xData>                        # <environment: 0x3944668>
  identical(a0 at .xData<mailto:a0 at .xData>, a2 at .xData<mailto:a2 at .xData>)  # FALSE


(This is what serialization/deserialization does on environments so is expected.)

So in that aspect 'a2' is no different from 'a1'. However for 'a2' now we have:

  all.equal(a0, a2)                # "Class definitions are not identical"


So why is 'all.equal(a0, a2)' doing this? This cannot be explained only by the fact that 'a0 at .xData<mailto:a0 at .xData>' and 'a2 at .xData<mailto:a2 at .xData>' are non-identical environments.

Looking at the source code for all.equal.envRefClass(), we see something like this (slightly simplified here):

  ...
  if (!identical(target$getClass(), current$getClass())) {
      ...
      return(sprintf("Class definitions are not identical%s", ...)
  }
  ...


So let's try this:

  identical(a0$getClass(), a1$getClass())  # TRUE
  identical(a0$getClass(), a2$getClass())  # FALSE

Note that 'x$getClass()' is not the same as 'class(x)'. The latter returns the **class name** while the former returns the **class definition** (which is represented by a complicated object of class refClassRepresentation).

'a0' and 'a2' have identical class names:

  class(a0)
  # [1] "A"
  # attr(,"package")
  # [1] ".GlobalEnv"

  class(a2)
  # [1] "A"
  # attr(,"package")
  # [1] ".GlobalEnv"

  identical(class(a0), class(a2))
  # [1] TRUE


So now the question is: even though 'a0' and 'a2' have identical **class names**, how come they do NOT have identical **class definitions**?

The big surprise (at least to me) is that reference objects, unlike traditional S4 objects, CARRY THEIR OWN COPY OF THE CLASS DEFINITION! This copy is stored in the '.refClassDef' variable stored in the .xData environment of the object:

  ls(a0 at .xData<mailto:a0 at .xData>, all=TRUE)
  # [1] ".refClassDef" ".self"        "getClass"     "stuff"

  ls(a2 at .xData<mailto:a2 at .xData>, all=TRUE)
  # [1] ".refClassDef" ".self"        "getClass"     "stuff"

This private copy of the class definition is actually what 'x$getClass()' returns:

  identical(a0$getClass(), get(".refClassDef", envir=a0 at .xData<mailto:envir=a0 at .xData>))  # TRUE
  identical(a2$getClass(), get(".refClassDef", envir=a2 at .xData<mailto:envir=a2 at .xData>))  # TRUE


Problem is that for 'a2' this copy of the class definition is not identical to the **original class** definition:

  identical(getClass("A"), a0$getClass())  # TRUE
  identical(getClass("A"), a2$getClass())  # FALSE


And this in turn is because the complicated object that represents the class definition also contains environments (e.g. 'getClass("A")@refMethods' is an environment) so going thru a serialization/deserialization cycle is not a **strict no-op** on it (from an identical() perspective).

Replacing the copy of the class definition stored in 'a2' with the original class definition makes the problem go away:

  rm(".refClassDef", envir=a2 at .xData<mailto:envir=a2 at .xData>)
  assign(".refClassDef", getClass("A"), envir=a2 at .xData<mailto:envir=a2 at .xData>)
  all.equal(a0, a2)  # TRUE


Bottom line: the test 'identical(target$getClass(), current$getClass())' performed by all.equal.envRefClass() seems too stringent. It should probably be replaced with something a little bit more tolerant i.e. something that considers environments that live at different addresses but have the same content to be equal. Looks like 'isTRUE(all.equal(target$getClass(), current$getClass()))' could do the job.

Finally note that, in addition to the above test, all.equal.envRefClass() also does this test (slightly simplified here):

  if (!isTRUE(all.equal(class(target), class(current))))
      return(sprintf("Classes differ: %s", ...))


Maybe that's all what it needs to do to compare the classes of the 2 objects? (Ironically this test uses all.equal() when it could use identical().)

Michael?

H.
On 5/11/19 15:09, Aaron Lun wrote:
I would say it's much worse than mismatching class definitions.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=
-A
On 5/11/19 5:07 AM, Martin Morgan wrote:
I think it has to do with the use of reference classes in the assay slot, which have different environments

   se = SummarizedExperiment()
   saveRDS(se, fl <- tempfile())
   se1 = readRDS(fl)

and then

all.equal(se at assays, se1 at assays)
[1] "Class definitions are not identical"
all.equal(se at assays@.xData<mailto:se at assays@.xData>, se1 at assays@.xData<mailto:se1 at assays@.xData>)
[1] "Component \".self\": Class definitions are not identical"
se at assays@.xData<mailto:se at assays@.xData>
<environment: 0x7fb1de1ede90>
se1 at assays@.xData<mailto:se1 at assays@.xData>
<environment: 0x7fb1fc2bca78>

Martin

?On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <bioc-devel-bounces at r-project.org on behalf of laurent.gatto at uclouvain.be><mailto:bioc-devel-bounces at r-project.orgonbehalfoflaurent.gatto@uclouvain.be> wrote:

     I would appreciate some background about the following:
          > suppressPackageStartupMessages(library("SummarizedExperiment"))
     > set.seed(1L)
     > m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4], LETTERS[1:4]))
     > rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
     > se1 <- SummarizedExperiment(m, rowData = rowdata)
     > se2 <- SummarizedExperiment(m, rowData = rowdata)
     > all.equal(se1, se2)
     [1] TRUE
          But after serialising and reading se2, the two instances aren't equal any more:
          > saveRDS(se2, file = "se2.rds")
     > rm(se2)
     > se2 <- readRDS("se2.rds")
     > all.equal(se1, se2)
     [1] "Attributes: < Component ?assays?: Class definitions are not identical >"
          Session information provided below.
          Thank you in advance,
          Laurent
               R version 3.6.0 RC (2019-04-21 r76417)
     Platform: x86_64-pc-linux-gnu (64-bit)
     Running under: Ubuntu 18.04.2 LTS
          Matrix products: default
     BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
     LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
          locale:
      [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
      [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8
      [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8
      [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C
      [9] LC_ADDRESS=C               LC_TELEPHONE=C
     [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
          attached base packages:
     [1] parallel  stats4    stats     graphics  grDevices utils     datasets
     [8] methods   base
          other attached packages:
      [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
      [3] BiocParallel_1.18.0         matrixStats_0.54.0
      [5] Biobase_2.44.0              GenomicRanges_1.36.0
      [7] GenomeInfoDb_1.20.0         IRanges_2.18.0
      [9] S4Vectors_0.22.0            BiocGenerics_0.30.0
          loaded via a namespace (and not attached):
      [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0
      [4] zlibbioc_1.30.0        XVector_0.24.0         Matrix_1.2-17
      [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0
     [10] GenomeInfoDbData_1.2.1
               _______________________________________________
     Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
     https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=      _______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org<mailto:hpages at fredhutch.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319
#
Interesting detective work. This is nasty.

Best,
Kasper
On Thu, May 16, 2019 at 2:19 AM Pages, Herve <hpages at fredhutch.org> wrote: