An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20120928/25fc4cec/attachment.pl>
Quickest way to make a large "empty" file on disk?
2 messages · Jonathan Greenberg, Simon Urbanek
On Sep 28, 2012, at 12:44 PM, Jonathan Greenberg wrote:
Rui: Quick follow-up -- it looks like seek does do what I want (I see Simon suggested it some time ago) -- what do mean by "trash your disk"?
I can't speak for Rui, but the difference between seeking and explicit write is that the FS can optimize the former by not actually writing anything to disk (which is why it's so fast on some OS/FS combos). However, what this means that the layout on the disk may not be sequential depending on the write patterns of the actual data blocks, because the FS may keep a mask of unused blocks and don't write them. But that is just a FS issue and thus varies vasty by OS and FS. For your use this probably doesn't matter as you probably don't need to stream the resulting file at the end.
What I'm trying to accomplish is getting parallel, asynchronous writes to a large binary image (just a binary file) working. Each node writes to a different sector of the file via mmap, "filling in" the values as the process runs, but the file needs to be pre-created before I can mmap it. Running a writeBin with a bunch of 0s would mean I'd basically have to write the file twice, but the seek/ff trick seems to be much faster. Do I risk doing some damage to my filesystem if I use seek? I see there is a strongly worded warning in the help for ?seek: "Use of seek on Windows is discouraged. We have found so many errors in the Windows implementation of file positioning that users are advised to use it only at their own risk, and asked not to waste the *R* developers' time with bug reports on Windows' deficiencies." --> there's no detail here on which errors people have experienced, so I'm not sure if doing something as simple as just "creating" a file using seek falls under the "discouraging" category.
Quick search in my mail shows issues that were related to what Windows reports as the seek location on text files when querying. AFAICS it did not affect the side-effect of seek which is what you're interested in. Cheers, Simon
As a note, we are trying to work this up on both Windows and *nix systems, hence our wanting to have a single approach that works on both OSs. --j On Thu, Sep 27, 2012 at 3:49 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, If you really need to trash your disk, why not use seek()?
fl <- file("Test.txt", open = "wb")
seek(fl, where = 1024, origin = "start", rw = "write")
[1] 0
writeChar(character(1), fl, nchars = 1, useBytes = TRUE)
Warning message: In writeChar(character(1), fl, nchars = 1, useBytes = TRUE) : writeChar: more characters requested than are in the string - will zero-pad
close(fl)
File "Test.txt" is now 1Kb in size.
Hope this helps,
Rui Barradas
Em 27-09-2012 20:17, Jonathan Greenberg escreveu:
Folks:
Asked this question some time ago, and found what appeared (at first) to be
the best solution, but I'm now finding a new problem. First off, it seemed
like ff as Jens suggested worked:
# outdata_ncells = the number of rows * number of columns * number of bands
in an image:
out<-ff(vmode="double",length=outdata_ncells,filename=filename)
finalizer(out) <- close
close(out)
This was working fine until I attempted to set length to a VERY large
number: outdata_ncells = 17711913600. This would create a file that is
131.964GB. Big, but not obscenely so (and certainly not larger than the
filesystem can handle). However, length appears to be restricted
by .Machine$integer.max (I'm on a 64-bit windows box):
.Machine$integer.max
[1] 2147483647
Any suggestions on how to solve this problem for much larger file sizes?
--j
On Thu, May 3, 2012 at 10:44 AM, Jonathan Greenberg <jgrn at illinois.edu> <jgrn at illinois.edu>wrote:
Thanks, all! I'll try these out. I'm trying to work up something that is
platform independent (if possible) for use with mmap. I'll do some tests
on these suggestions and see which works best. I'll try to report back in a
few days. Cheers!
--j
2012/5/3 "Jens Oehlschl?gel" <jens.oehlschlaegel at truecluster.com> <jens.oehlschlaegel at truecluster.com>
Jonathan,
On some filesystems (e.g. NTFS, see below) it is possible to create
'sparse' memory-mapped files, i.e. reserving the space without the cost of
actually writing initial values.
Package 'ff' does this automatically and also allows to access the file
in parallel. Check the example below and see how big file creation is
immediate.
Jens Oehlschl?gel
library(ff)
library(snowfall)
ncpus <- 2
n <- 1e8
system.time(
+ x <- ff(vmode="double", length=n, filename="c:/Temp/x.ff")
+ )
User System verstrichen
0.01 0.00 0.02
# check finalizer, with an explicit filename we should have a 'close'
finalizer
finalizer(x)
[1] "close"
# if not, set it to 'close' inorder to not let slaves delete x on slave
shutdown
finalizer(x) <- "close"
sfInit(parallel=TRUE, cpus=ncpus, type="SOCK")
R Version: R version 2.15.0 (2012-03-30)
snowfall 1.84 initialized (using snow 0.3-9): parallel execution on 2
CPUs.
sfLibrary(ff)
Library ff loaded.
Library ff loaded in cluster.
Warnmeldung:
In library(package = "ff", character.only = TRUE, pos = 2, warn.conflicts
= TRUE, :
'keep.source' is deprecated and will be ignored
sfExport("x") # note: do not export the same ff multiple times
# explicitely opening avoids a gc problem
sfClusterEval(open(x, caching="mmeachflush")) # opening with
'mmeachflush' inststead of 'mmnoflush' is a bit slower but prevents OS
write storms when the file is larger than RAM
[[1]]
[1] TRUE
[[2]]
[1] TRUE
system.time(
+ sfLapply( chunk(x, length=ncpus), function(i){
+ x[i] <- runif(sum(i))
+ invisible()
+ })
+ )
User System verstrichen
0.00 0.00 30.78
system.time(
+ s <- sfLapply( chunk(x, length=ncpus), function(i) quantile(x[i],
c(0.05, 0.95)) )
+ )
User System verstrichen
0.00 0.00 4.38
# for completeness
sfClusterEval(close(x))
[[1]]
[1] TRUE
[[2]]
[1] TRUE
csummary(s)
5% 95%
Min. 0.04998 0.95
1st Qu. 0.04999 0.95
Median 0.05001 0.95
Mean 0.05001 0.95
3rd Qu. 0.05002 0.95
Max. 0.05003 0.95
# stop slaves
sfStop()
Stopping cluster
# with the close finalizer we are responsible for deleting the file
explicitely (unless we want to keep it)
delete(x)
[1] TRUE
# remove r-side metadata
rm(x)
# truly free memory
gc()
*Gesendet:* Donnerstag, 03. Mai 2012 um 00:23 Uhr
*Von:* "Jonathan Greenberg" <jgrn at illinois.edu> <jgrn at illinois.edu>
*An:* r-help <r-help at r-project.org> <r-help at r-project.org>, r-sig-hpc at r-project.org
*Betreff:* [R-sig-hpc] Quickest way to make a large "empty" file on
disk?
R-helpers:
What would be the absolute fastest way to make a large "empty" file (e.g.
filled with all zeroes) on disk, given a byte size and a given number
number of empty values. I know I can use writeBin, but the "object" in
this case may be far too large to store in main memory. I'm asking because
I'm going to use this file in conjunction with mmap to do parallel writes
to this file. Say, I want to create a blank file of 10,000 floating point
numbers.
Thanks!
--j
--
Jonathan A. Greenberg, PhD
Assistant Professor
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
607 South Mathews Avenue, MC 150
Urbana, IL 61801
Phone: 415-763-5476
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007http://www.geog.illinois.edu/people/JonathanGreenberg.html
[[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing listR-sig-hpc at r-project.orghttps://stat.ethz.ch/mailman/listinfo/r-sig-hpc -- Jonathan A. Greenberg, PhD Assistant Professor Department of Geography and Geographic Information Science University of Illinois at Urbana-Champaign 607 South Mathews Avenue, MC 150 Urbana, IL 61801 Phone: 415-763-5476 AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007http://www.geog.illinois.edu/people/JonathanGreenberg.html ______________________________________________R-help at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jonathan A. Greenberg, PhD Assistant Professor Department of Geography and Geographic Information Science University of Illinois at Urbana-Champaign 607 South Mathews Avenue, MC 150 Urbana, IL 61801 Phone: 217-300-1924 AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007 http://www.geog.illinois.edu/people/JonathanGreenberg.html [[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc