[Bioc-devel] Remote BigWig file access
Thanks for sharing Leo, this does interest me, especially since so much is built on BigWig access via rtracklayer at least in the recount2 ecosystem. As you alluded to, Megadepth currently supports remote access of BigWigs (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows), getting back just the byte ranges overlapping the set of regions requested so it should work for at least recount2/recount3 and anything that uses HTTP/s. I'd be open to exploring updates to the Megadepth C/C++ code side to support Rle if that makes sense to replace rtracklayer. But to do that you'd need to be involved in updating all the R packages if you're willing (both megadepth and those that currently rely on rtracklayer for this functionality). Let me know if you want to chat about this over Zoom, Chris On Tue, May 21, 2024 at 2:41?PM Leonardo Collado Torres <
lcolladotor at gmail.com> wrote:
Hi Bioc-devel, As some of you are aware, rtracklayer::import() has long provided access to import BigWig files. Those files can be shared on servers and accessed remotely thanks to all the effort from many of you in building and maintaining rtracklayer. From my side, derfinder::loadCoverage() relies on rtracklayer::import.bw(), and recount::expressed_regions() + recount::coverage_matrix() use derfinder::loadCoverage(). recountWorkflow showcases those recount functions on larger datasets. brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends up relying on rtracklayer::import.bw() through these functions. At https://github.com/lawremi/rtracklayer/issues/83 I initially reported some issues once our recount2/3 data host changed, but previously Brian Schilder also reported that one could no longer read remote files https://github.com/lawremi/rtracklayer/issues/73. https://github.com/lawremi/rtracklayer/issues/63 and/or https://github.com/lawremi/rtracklayer/issues/65 might have been related. Yesterday I updated https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270 with a comment showing some small reproducible code, and that the workaround of downloading the data first, then using rtracklayer::import() on the local data does work. However, this workaround does involve a lot of, hmm, wasteful data transfer. On the recount vignette at some point I access just chrY of a bigWig file that is about 1300 MB. On the recountWorkflow vignette I do something similar for a 7GB bigWig file. Previously accessing just chrY on these files was a small data transfer. On recountWorkflow version 1.29.2 https://github.com/LieberInstitute/recountWorkflow, I've included pre-computed results (~2 MB) to avoid downloading tons of data, though the vignette code shows how to actually fully reproduce the results if you don't mind downloading those large files. I also implemented some workarounds on recount, though I haven't yet gone the full route of including pre-computed results. I have yet to try implementing a workaround for brainflowprobes. My understanding is that rtracklayer's root issues are elsewhere and changes in dependencies rtracklayer has likely created these problems. These problems are not always in the control of rtracklayer authors to resolve, and also create an unexpected burden on them. If one considers alternatives to rtracklayer, I see that there's a new package https://github.com/PoisonAlien/trackplot/tree/master that uses bwtool (a system dependency), and older alternative https://github.com/andrelmartins/bigWig that hasn't had updates in 4 years, and a CRAN package (https://cran.r-project.org/web/packages/wig/readme/README.html) that recommends using rtracklayer for larger files. I guess that I could also try using megadepth https://research.libd.org/megadepth/, though derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for efficiency https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401 and lots of functions in that package were built for that structure (RleList objects). I likely missed other alternatives. My current line of thought is to keep implementing workarounds using local data (sometimes with pre-computed results) for recount, recountWorkflow, and brainflowprobes (derfinder only has tests with local bigWig files) without really altering the internals of those packages. That is, assume that the remote BigWig file access via rtracklayer will indefinitely be suspended, though it could be supported again at some point and when it does, those packages will work again with remote BigWig files as if nothing ever happened. But I wanted to check in if this is what others who use BigWig files are thinking of doing. Thanks! Best, Leo Leonardo Collado Torres, Ph. D. Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT Assistant Professor, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health 855 N. Wolfe St., Room 382 Baltimore, MD 21205 lcolladotor.github.io lcolladotor at gmail.com