Skip to content

[Bioc-devel] BiocFileCache for developers

12 messages · Shepherd, Lori, Sean Davis, Martin Morgan +2 more

#
hi,

I'm writing a function which currently uses BiocFileCache to store a
small data.frame and one or more TxDb objects, so that these objects
are persistent and available across sessions (or possible available to
multiple users).

In the simplest case, I would call

bfc <- BiocFileCache()

inside my function, which will check the default location:

user_cache_dir(appname = "BiocFileCache")

In general, should developers also support the user specifying a
specific location for the BiocFileCache? So functions using
BiocFileCache should have an argument that overrides the above
location?

thanks,
Mike
#
If you are using it as a helper function that may be too much exposure and you may just want it running behind the scenes in default location;  but it could be given as an option to the user. I guess a coding preference. If the user specified directory is used, they will have to remember to input that each time they use your package or it will redownload.


There shouldn't be a concern of overwriting files in the default cache location, as files added to the cache get a random identifier to try to avoid overwriting and to allow for essentially duplicate entries.


You can always get the cache location of a bfc object by calling bfccache(bfc) in case a user specific directory is used.



Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263
#
On Fri, Dec 1, 2017 at 10:28 AM, Michael Love <michaelisaiahlove at gmail.com>
wrote:
On some systems, the user home directory is not large (such as on HPC
systems) or has strong quotas. The default user_cache_dir may not be the
best choice there.

Sean

  
    
#
So having a user argument might be best.  Or defining a unique cache location for your package would be another option.


Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263
#
On Fri, Dec 1, 2017 at 11:16 AM, Shepherd, Lori <
Lori.Shepherd at roswellpark.org> wrote:

            
The R package development policies actually has a statement that may be
helpful in thinking about this. Your mileage may vary in the
interpretation....
https://cran.r-project.org/web/packages/policies.html

Sean

  
    
#
Unfortunately I think there are a number of packages that don't necessarily adhere to this.


Bioconductor packages we try to always make sure any example or vignette code follows this policy.


I think the exception case may be made if it deals with main functionality of package code and if it is noted prominently in the package documentation.



Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263
#
One solution if a developer really wants to make sure the user knows
that the function will store a cache somewhere would be to leave the
BiocFileCache location argument without a default value.
#
On 12/01/2017 11:23 AM, Sean Davis wrote:
Actually, CRAN policies.

The CRAN policy is definitely appropriate for vignette and example code, 
and certainly functions by default should not write to locations where 
they will potentially overwrite existing resources. The policy makes it 
impossible to write files that persist across sessions, which is the 
objective for BiocFileCache.

For the original question, I think there's often a case for 
user_cache_dir(appname="mikes-package-name")

Martin
This email message may contain legally privileged and/or...{{dropped:2}}
#
user_cache_dir(appname="mikes-package-name")

wow, how did you guess it?

I'm storing TxDb's for use across sessions with `rname` set to the
basename of the GTF file, e.g. "gencode.v27.annotation.gtf.gz". I want
to encourage the serendipitous case that there is already a
BiocFileCache entry with this `rname` created outside of the use of my
package. I can see this happening, especially if I mention this naming
pattern in the vignette.

I'm thinking I will encourage the user to pick a good BiocFileCache
location by not setting a default value. Potentially multiple users
could be sharing the same BiocFileCache location, e.g. a lab space on
HPC.

And then actively specifying NULL for the location (or something like
this) could switch the location to:

user_cache_dir(appname = "BiocFileCache")
#
R.cache (>= 0.6.0) does the following to acquire a persistent cache
(root) folder.  This behavior was introduced after getting prompted by
CRAN not to write to disk by default (because they found "funny"
folders on their check servers) and a following email conversation
with CRAN (2011-12-29), and getting an "ok with me" from Uwe at CRAN:

1. When loaded (not only attached) it checks for the existence of a
cache folder (defaults to ~/.Rcache unless neither an R option nor an
env var is set).  If it is exists, then we're good to go.

2. If the cache folder does not exist, and in a non-interactive
session, then a temporary cache folder specific to that R session is
used.

3. If the cache folder does not exist, and in an interactive session,
then the user will be queried whether they'd like to create ~/.Rcache
(the default choice) or whether they like to use a temporary folder
(just as in the non-interactive case).  If accepting ~/.Rcache, then
that will be available across sessions (Step 1 above).

The gist is: Make sure to get the user's approval before storing
anything permanently and don't doing anything that surprises the user,
risk overwriting their files, etc.

Here is a real-world user example on a "fresh" user account:

# Non-interactive sessions or user does not approve

$ Rscript -e "R.cache::getCacheRootPath()"
[1] "/tmp/RtmpzIZT4o/.Rcache"

$ R --vanilla
The R.cache package needs to create a directory that will hold cache
files. It is convenient to use one in the user's home directory,
because it remains also after restarting R. Do you wish to create the
'~/.Rcache/' directory? If not, a temporary directory
(/tmp/RtmpMA4LTF/.Rcache) that is specific to this R session will be
used. [Y/n]: n
[1] "/tmp/Rtmp0Ic5zQ/.Rcache"
$ R --vanilla
The R.cache package needs to create a directory that will hold cache
files. It is convenient to use one in the user's home directory,
because it remains also after restarting R. Do you wish to create the
'~/.Rcache/' directory? If not, a temporary directory
(/tmp/RtmpzSJd3d/.Rcache) that is specific to this R session will be
used. [Y/n]: n
[1] "/tmp/RtmpzSJd3d/.Rcache"
$ Rscript -e "R.cache::getCacheRootPath()"
[1] "/tmp/Rtmpq1nx0H/.Rcache"


# User approves or already approved

$ R --vanilla
The R.cache package needs to create a directory that will hold cache
files. It is convenient to use one in the user's home directory,
because it remains also after restarting R. Do you wish to create the
'~/.Rcache/' directory? If not, a temporary directory
(/tmp/RtmpMA4LTF/.Rcache) that is specific to this R session will be
used. [Y/n]: Y
[1] "~/.Rcache/"
$ Rscript -e "R.cache::getCacheRootPath()"
[1] "~/.Rcache/"

$ R --vanilla
[1] "~/.Rcache/"

The same applies when using library("R.cache") as well as when the
R.cache namespace is imported by another package.

This behavior also plays well with 'R CMD check' and 'R CMD check
--as-cran' where the cache folder will default to a temporary folder.
It will also prevent run-time errors since there will always be a
cache folder available (although it'll only survive the current
session).  R.cache works the same on all OSes.  To further lower the
risk for "what is this ~/.Rcache folder doing here?", R.cache also
adds a ~/.Rcache/README.txt file explaining what that folder is and
what created it.

About what the default location should be:
On Fri, Dec 1, 2017 at 8:06 AM, Sean Davis <seandavi at gmail.com> wrote:
[...]
I agree with this but it's hard to find a solid simple alternative to
the user's home folder.  However, and on my todo list to investigate,
https://cran.r-project.org/package=rappdirs may provide a better
approach because it follows OS-specific recommendations.  Back to
writing to user's home folder: in HPC environments with limited home
quota, I simply do things like ln -s /scratch/$USER/.Rcache ~/.Rcache.

/Henrik

On Fri, Dec 1, 2017 at 8:32 AM, Michael Love
<michaelisaiahlove at gmail.com> wrote:
8 days later
#
thanks Henrik,

I like the explicitness of the `R.cache` approach and I copied it for
my current implementation.

For the BiocFileCache location that should be used for this package
I'm developing, `tximeta`, I'm now using the following logic:

* If run non-interactively, `tximeta` uses a temporary directory.
* If run interactively, and a location has not been previously saved,
  the user is prompted if she wants to use (1) the default directory or
  a (2) temporary directory.
    - If (1), then use the default directory, and save this choice.
    - If (2), then use a temporary directory for the rest of this R
      session, and ask again next R session.
* The prompt above also mentions that a specific function can be used
  to manually set the directory at any time point, and this choice is
  saved.
* The default directory is given by `rappdirs::user_cache_dir("BiocFileCache")`.
* The choice itself of the BiocFileCache directory that `tximeta`
  should use is saved in a JSON file here
  `rappdirs::user_cache_dir("tximeta")`.
12 days later
#
BiocFileCache has been updated to follow this type of behavior


- if location exists use without prompting (default user_cache_dir())

- if doesnt exit

    - prompt user to create

    - if respond N  or not an interactive session uses temporary directory


This is reflected in devel version 1.3.8





Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263