[Bioc-devel] Incrimental writing to HDF5 / DelayedMatrix
That seems to solve my problem, I will try this way, thak you very much. Francesco On Thu, Dec 21, 2017 at 1:16 PM, Martin Morgan
<martin.morgan at roswellpark.org> wrote:
On 12/21/2017 06:22 AM, Francesco Napolitano wrote:
Hi, I need to deal with very large matrices and I was thinking of using HDF5-based data models. However, from the documentation and examples that I have been looking at, I'm not quite sure how to do this. My use case is as follows. I want to build a very large matrix one column at a time, and I need to write columns directly to disk since I would otherwise run out of memory. I need a format that, afterwards, will allow me to extract subsets of rows or columns and rank them. The subsets will be small enough to be loaded in memory. Can I achieve this with current HDF5 support in R?
this is basically straight-forward in rhdf5. The idea is to create a dataset
of the size to contain your total data
library(rhdf5)
fl <- tempfile()
h5createFile(fl)
nrow <- 10000
ncol <- 100
h5createDataset(fl, "big", c(nrow, ncol), showWarnings = FALSE)
then to fill it in chunks by specifying which start row / column you'd like
to write to and the 'count' of the number data points in each direction
you'd like to write to
chunk_ncol <- ncol / 10
j <- 1 # which column to start writing?
while (j < ncol) {
m <- matrix(seq(1, length.out = nrow * chunk_ncol), nrow)
h5write(m, fl, "big", start = c(1, j), count = c(nrow, chunk_ncol))
j <- j + chunk_ncol
}
You can read arbitrary 'slabs'
h5read(fl, "big", start = c(1, 1), count = c(5, 5))
h5read(fl, "big", start = c(1, 9), count = c(5, 2))
Probably you don't want to write 1 column at a time, but as many columns as
comfortably fit into memory. This minimizes the number of R function calls
needed to write / read the data.
The HDF5Array package provides an easy abstraction for reading (probably
writing is possible too, but it might be easier to understand the building
blocks first).
library(HDF5Array) hdf <- HDF5Array(fl, "big") hdf
HDF5Matrix object of 10000 x 100 doubles:
[,1] [,2] [,3] ... [,99] [,100]
[1,] 1 10001 20001 . 80001 90001
[2,] 2 10002 20002 . 80002 90002
[3,] 3 10003 20003 . 80003 90003
[4,] 4 10004 20004 . 80004 90004
[5,] 5 10005 20005 . 80005 90005
... . . . . . .
[9996,] 9996 19996 29996 . 89996 99996
[9997,] 9997 19997 29997 . 89997 99997
[9998,] 9998 19998 29998 . 89998 99998
[9999,] 9999 19999 29999 . 89999 99999
[10000,] 10000 20000 30000 . 90000 100000
hdf[1:5, 1:5]
DelayedMatrix object of 5 x 5 doubles:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 10001 20001 30001 40001
[2,] 2 10002 20002 30002 40002
[3,] 3 10003 20003 30003 40003
[4,] 4 10004 20004 30004 40004
[5,] 5 10005 20005 30005 40005
as.matrix(hdf[1:5, 1:5])
[,1] [,2] [,3] [,4] [,5] [1,] 1 10001 20001 30001 40001 [2,] 2 10002 20002 30002 40002 [3,] 3 10003 20003 30003 40003 [4,] 4 10004 20004 30004 40004 [5,] 5 10005 20005 30005 40005
rowSums(hdf[1:5, 1:5])
[1] 100005 100010 100015 100020 100025 Martin
Any help greatly appreciated. than you, Francesco
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.