Skip to content

[Bioc-devel] BiocParallel and AnnotationDbi: database disk image is malformed

8 messages · Ludwig Geistlinger, Vincent Carey, Gabriel Becker +1 more

#
Hi,

Within a package I am developing, I would like to enable parallel probe to gene mapping for a compendium of microarray datasets.

This accordingly makes use of annotation packages such as hgu133a.db, which in turn connect to the SQLite database via AnnotationDbi.

When running in multi-core mode (i.e. using a MulticoreParam with BiocParallel) using more than 2 cores, this causes the error:

database disk image is malformed


In a very similar problem:

https://support.bioconductor.org/p/38541/

Adi Tarca and Dan Tenenbaum identified and resolved this problem by ensuring that each process has its own unique database connection, i.e. AnnotationDbi is not loaded before sending the job to the workers.

This solution was easily realized as this analysis was carried out within a script and not a package.

However, within my package, AnnotationDbi is loaded as a dependency of my package's imports.

How to resolve this here?
I am not sure whether I perfectly understand the underlying mechanisms, but is there a way to make my workers load their own version of AnnotationDbi instead of using the one of the parent process?
Or am I supposed to unload all packages depending on AnnotationDbi, and AnnotationDbi itself, before sending the job to the workers (and reload all of them after the job has finished?)

Thanks a lot,
Ludwig



--
Dr. Ludwig Geistlinger
CUNY School of Public Health
#
good question

some of the discussion on

http://sqlite.1065341.n5.nabble.com/Parallel-access-to-read-only-in-memory-database-td91814.html

seems relevant.

converting the relatively small annotation package content to pure R
read-only tables on the master before parallelizing
might be very simple?

On Fri, Jan 19, 2018 at 11:43 AM, Ludwig Geistlinger <
Ludwig.Geistlinger at sph.cuny.edu> wrote:

            

  
  
#
IT seems like you could also force a copy of the reference object via
<dbobject>$copy() and then force a refresh of the conn slot by assigning a
new db connection into it.

I'm having trouble confirming that this would work, however, because I
actually can't reproduce the error. The naive way works for me on my mac
laptop (which is running an old R and Bioconductor) and on the linux
cluster I have access to (running Bioc 3.6):


(cluster)
+ return( AnnotationDbi::mget( x , hgu95av2SYMBOL ) )

+ }
[[1]]

[[1]]$`36090_at`

[1] "TBL2"



[[2]]

[[2]]$`38785_at`

[1] "MUC1"
R version 3.4.3 (2017-11-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)


Matrix products: default

BLAS:
/gnet/is2/p01/apps/R/3.4.3-20171201-current/x86_64-linux-2.6-rhel6/lib64/R/lib/libRblas.so

LAPACK:
/gnet/is2/p01/apps/R/3.4.3-20171201-current/x86_64-linux-2.6-rhel6/lib64/R/lib/libRlapack.so


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C

 [9] LC_ADDRESS=C               LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C


attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets

[8] methods   base


other attached packages:

[1] hgu95av2.db_3.2.3    org.Hs.eg.db_3.5.0   AnnotationDbi_1.40.0

[4] IRanges_2.12.0       S4Vectors_0.16.0     Biobase_2.38.0

[7] BiocGenerics_0.24.0


loaded via a namespace (and not attached):

 [1] Rcpp_0.12.14    digest_0.6.14   DBI_0.7         RSQLite_2.0

 [5] pillar_1.1.0    rlang_0.1.6     blob_1.1.0      bit64_0.9-8

 [9] bit_1.1-13      compiler_3.4.3  pkgconfig_2.0.1 memoise_1.1.0

[13] tibble_1.4.1
~G

On Fri, Jan 19, 2018 at 9:23 AM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

  
    
#
On 01/19/2018 12:23 PM, Vincent Carey wrote:
Can you be a little more specific here? The problem isn't likely with 
AnnotationDbi per se, but with the annotation package you use. Also, the 
connection on the worker is bad, but could be re-created (using, e.g., 
dbfile(org.Hs.eg.db)...) but probably a toy example would help.

Martin
This email message may contain legally privileged and/or...{{dropped:2}}
#
On 01/19/2018 12:37 PM, Gabe Becker wrote:
pass the data base connection to the function

getSymbol <- function ( x, db )
     ## olde schoole
     AnnotationDbi::mget(x, db)
     ## AnnotationDbi::mapIds(db, x, "SYMBOL", "PROBEID")

and arrange for the general case, i.e., distinct processes with data 
serialized between them

 > cl = parallel::makePSOCKcluster(2)
 > parLapply(cl, x, getSymbol, hgu95av2SYMBOL)
Error in checkForRemoteErrors(val) :
   2 nodes produced errors; first error: external pointer is not valid

(getSymbol would fail as originally written in the serial case, since 
the workers would not have access to hgu95av2SYMBOL

The workaround is to open the connection on the node, e.g.,

getSymbol <- function ( x, dbname ) {
     nmspc <- loadNamespace(dbname)
     db <- get(dbname, nmspc)
     AnnotationDbi::mapIds(db, x, "SYMBOL", "PROBEID")
}

lapply(x, getSymbol, "hgu95av2.db")
bplapply(x, getSymbol, "hgu95av2.db")
bplapply(x, getSymbol, "hgu95av2.db", BPPARAM = SnowParam())

Martin
This email message may contain legally privileged and/or...{{dropped:2}}
#
I apologize if I haven't been specific enough - however, I am also having trouble to reliably reproduce the error.
It does not seem to be exclusively related to the combination of AnnotationDbi and parallel computation, but also with some other packages I load.

While still trying to produce a minimal reproducible example,  here is what returns the error quite reliably on an 8-core Linux machine:
 
# loading some dependencies of my package
library(org.Hs.eg.db)
library(pathview)
library(graph)
library(BiocParallel)

# annotation packages for the datasets in the compendium 
pkgs <- rep(c("hgu133plus2.db","hgu133a.db"), 42)

#
getSymbols <- function ( anno.pkg )
{ 
    require(anno.pkg, character.only=TRUE)
    anno.pkg <- get(anno.pkg)
    syms <- AnnotationDbi::mapIds(anno.pkg, keys=keys(anno.pkg), 
                                    keytype="PROBEID", column="ENTREZID")
    return(syms)
}
Loading required package: hgu133plus2.db

Error: BiocParallel errors
  element index: 29, 30, 31, 32, 33, 34, ...
  first error: database disk image is malformed
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Desktop 12 SP3

Matrix products: default
BLAS: /mnt/raidbio/biosoft/software/R/R-3.4.1/lib/libRblas.so
LAPACK: /mnt/raidbio/biosoft/software/R/R-3.4.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] BiocParallel_1.12.0  graph_1.56.0         pathview_1.18.0     
[4] org.Hs.eg.db_3.5.0   AnnotationDbi_1.40.0 IRanges_2.12.0      
[7] S4Vectors_0.16.0     Biobase_2.38.0       BiocGenerics_0.24.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14      KEGGgraph_1.38.1  XVector_0.18.0    zlibbioc_1.24.0  
 [5] bit_1.1-12        R6_2.2.2          rlang_0.1.6       blob_1.1.0       
 [9] httr_1.3.1        tools_3.4.1       grid_3.4.1        png_0.1-7        
[13] DBI_0.7           bit64_0.9-7       digest_0.6.13     tibble_1.4.1     
[17] Rgraphviz_2.22.0  KEGGREST_1.18.0   memoise_1.1.0     RSQLite_2.0      
[21] compiler_3.4.1    pillar_1.1.0      Biostrings_2.46.0 XML_3.98-1.9     
[25] pkgconfig_2.0.1  


--
Dr. Ludwig Geistlinger
CUNY School of Public Health
#
On 01/19/2018 02:24 PM, Ludwig Geistlinger wrote:
My guess is that the database is being accessed by multiple processes 
simultaneously and, even though the data bases are opened read-only, 
this causes a corruption in the access of some sort. You can avoid 
multiple processes accessing the database at the same time by using a 'lock'

getSymbols <- function ( anno.pkg, id )
{
     nmspc <- loadNamespace(anno.pkg)
     anno.pkg <- get(anno.pkg, nmspc)

     BiocParallel::ipclock(id)
     syms <- suppressMessages({
         AnnotationDbi::mapIds(
             anno.pkg, keys=keys(anno.pkg), keytype="PROBEID",
             column="ENTREZID"
         )
     })
     BiocParallel::ipcunlock(id)

     length(syms)
}

x <- bplapply(pkgs , getSymbols, ipcid())

There are two additional considerations here.

The first is that one wants to worry about the amount of data transfered 
between worker and manager compared to the amount of time spent in 
computation. So in your previous formulation you sent back all the 
symbols -- this will be relatively expensive compared to the amount of 
work done in the function (reading the ids from the database), and you 
would rather do more work and transmit less (both to and from the 
worker) in each call to getSymbol().

The second is similar, but from the lock perspective -- since the lock 
imposes essential serial evaluation through that portion of the code, 
you'd like the locked portion of the worker's task to be just a small 
portion of the total work done by the worker.

I guess a more clever use of locks would be one per data base (generate 
two ipcid()'s in the manager, and pass these to the worker in such a way 
that the worker uses the same lock for each database.

Martin
This email message may contain legally privileged and/or...{{dropped:2}}
#
Thanks for these explanations, Martin.

What I actually want to do is sending each expression dataset on probe level to a worker, where probe level expression is summarized to gene level expression according to a chosen summarization function (such as 'mean').

The worker should then return the expression dataset on gene level back to the manager.

Taking also Vince' suggestion and your considerations on work load into account, it seems best if I first collect the probe2gene mappings for the unique set of annotation packages (which would only be 2 in the example) in serial mode on the manager.
Then, I can send the datasets together with the respective mapping as a simple table to the workers, which carry out the actual computational part, ie. the summarization.   

This would avoid parallel access on the database and would also spare retrieving the mappings on each worker over in over again, as it is likely that some datasets of the compendium share the corresponding annotation db.

Thanks,
Ludwig
 
--
Dr. Ludwig Geistlinger
CUNY School of Public Health