Skip to content

Fixing the HDF5 package: the on.exit mystery

4 messages · Brian Ripley, H C Pumphrey

#
Dear all,

I'm trying to fix a subtle bug in the hdf5 package. This package provides an 
interfaces to the HDF5 library and hence allows one to load data into R from 
files in the HDF5 format. The bug appeared during a period in which R changed 
but the package did not.

I include below both the R and C code, stripped of everything except what is 
needed to show the bug. What is supposed to happen is

(*) the user calls R function hdf5load()
(*) hdf5load() calls C function do_hdf5load()
(*) do_hdf5load() opens the HDF5 file recording its HDF5 file id (fid)
(*) do_hdf5load() calls C function setup_onexit, passing fid to it
(*) setup_onexit sets up the on.exit call to be R function hdf5cleanup with 
fid as its argument
(*)  C function do_hdf5load() walks the HDF5 file's tree structure of groups 
of groups of [...] of datasets, mapping them to an R list of lists of [...] of 
array variables. This recursive procedure may have a variety of exit points 
buried inside itself.
(*) C function do_hdf5load() exits for some reason. R function hdf5load() 
therefore exits but before doing so it calls its on.exit code (which is 
hdf5cleanup(fid) with the right value of fid), closing the file.

The problem is that when do_hdf5load() and hdf5load() exit, hdf5cleanup() is 
usually not called, meaning that the file is left open. You might not notice 
this, but if you are processing a few year's worth of data, which is stored at 
1 file per day, you may end up with the system limit number of files open and 
be unable to open any more.

I have a suspicion that the problem dates to a change in R at 2.8.0. If you do 
  help(on.exit) it notes under "Details" that: "Where ?expr? was evaluated 
changed in R 2.8.0 ..." But it is not clear how I should modify the C code to 
force hdf5cleanup() to be reliably called when do_hdf5load() exits.

Any help appreciated.

Hugh (possibly the nearest thing to a maintainer that the hdf5 package 
currently has)

(R and C code follow)

#----------------------------------------------------------------
"hdf5load" <-  function (file, load = TRUE, verbosity = 0, tidy = FALSE)
{
   call <- sys.call()
   .External("do_hdf5load", call, sys.frame(sys.parent()), file, load,
             as.integer (verbosity), as.logical(tidy),
             PACKAGE="hdf5")
}

"hdf5cleanup" <- function (fid)
{
   call <- sys.call()
   print("In hdf5cleanup: calling do_hdf5cleanup")
   invisible(.External("do_hdf5cleanup", call, sys.frame(sys.parent()), fid,
             PACKAGE="hdf5"))
}
#----------------------------------------------------------------


/*---------------------------------------------------------------*/
SEXP do_hdf5load (SEXP args)
{
/* Code to process args snipped */
  if ((fid = H5Fopen (path, H5F_ACC_RDONLY, H5P_DEFAULT)) < 0)
     errorcall (call, "unable to open HDF file: %s", path);

   setup_onexit (fid, env);
   /* Messy code to walk tree structure of file snipped */
}

/* The following function shown in its entirety */
setup_onexit (hid_t fid, SEXP env)
{
   eval (lang2 (install ("on.exit"),
                lang2 (install ("hdf5cleanup"),
                       ScalarInteger (fid))),
         env);
}

SEXP
do_hdf5cleanup (SEXP args)
{
/* Code to process args snipped */
/* various cleanup things done including this: */
H5Fclose(fid)
}
/*---------------------------------------------------------------*/
#
Looks like you should be using finalizers instead.  See the RODBC 
package for an example of this.
On Fri, 4 Mar 2011, H C Pumphrey wrote:

            

  
    
10 days later
#
H C Pumphrey wrote:

            
I considered Prof. Ripley's suggestion to use finalizers but I'm not a good 
enough C programmer and rapidly became bogged down. For the moment I have 
fixed the bug in a more crude and obvious manner, by putting the relevant 
clean-up code at the end of both hdf5save() and hdf5load() and removing all 
references to the on.exit mechanism. This now means that if things work 
properly you can open and close as many HDF5 files as you like. (I imagine 
that the downside is that your file may be left open if some sorts of errors 
occur.  But I think this was probably happening anyway, given the fact that 
the on.exit mechanism wasn't working. )

I have also altered the code behind hdf5load() so that it ignores soft links 
inside a HDF5 file that don't have a target. Ideally your file should not 
contain such things. But if it does, then it is better to skirt round them 
than to fail whenever you meet one.

I will try uploading the package to CRAN at some point, but I think I'll use 
it myself for a bit to see whether it has any other nasties. If anyone wants 
to try it in the meantime you can get it at http://xweb.geos.ed.ac.uk/~hcp 
(scroll down to the bottom). I'd be particularly interested to know if it can 
be built on Windows/MacOS and, if not, why not.

Hugh
2 days later
#
H C Pumphrey wrote:
> I'd be particularly interested to know if it can be built on Windows/MacOS
 > and, if not, why not.

Two users report that it installs OK in MacOSX: many thanks.
Two users report that it doesn't build on Windows. Brian Ripley provided fixes 
for this. I have incorporated these and bumped the package version to 1.7.1. 
(I'm afraid I don't know much about building R packages with compiled code on 
Windows: I am grateful for any help that others can provide on this.)

As the link I gave was rather buried in a long web page I have moved it to a 
shorter one that contains only relevant information: 
http://xweb.geos.ed.ac.uk/~hcp/Rhdf5.html

best wishes

Hugh