Skip to content

[R-pkg-devel] Ensuring permanence and SHA consistency of released CRAN packages for validated software

13 messages · Borini, Stefano, Henrik Bengtsson, Dirk Eddelbuettel +2 more

#
Hello,

Validated software needs to ensure consistency and reproducibility of its environment, potentially in years' time, when the audit comes. For this reason, we identify all SHA of the packages we download from CRAN to ensure that the package has not changed after the fact, something that may signal us that the package has been corrupted, or malicious code has been added after the fact, and also guarantees the auditors that the packages are indeed the correct ones as they were at the time of release.

Currently I am dealing with a package that I downloaded once in the past, MASS_7.3-54. This package used to have SHA256

    b800ccd5b5c2709b1559cf5eab126e4935c4f8826cf7891253432bb6a056e821  MASS_7.3-54.tar.gz

The current package has instead SHA:

    eb644c0e94b447c46387aa22436ef5a43192960ee9cfd0df2940f4a4116179ae  MASS_7.3-54.tar.gz

This triggers all sort of alarms. It is established poor practice to replace a package after the fact exact for these reasons. Once a package is released, it should remain immutable. Subsequent builds can be introduced with a different build number.

The change appears to be due to the fact that CRAN rebuilds packages occasionally, for reasons to me unknown. Diffing the old and the new MASS_7.3.54.tar.gz reveals the change to be due to this:

    $ diff -Naur MASS_1/ MASS_2/
    diff -Naur MASS_1/DESCRIPTION MASS_2/DESCRIPTION
    --- MASS_1/DESCRIPTION  2021-05-03 10:03:00.000000000 +0100
    +++ MASS_2/DESCRIPTION  2021-05-03 10:03:50.000000000 +0100
    @@ -33,4 +33,4 @@
       David Firth [ctb]
     Maintainer: Brian Ripley <ripley at stats.ox.ac.uk>
     Repository: CRAN
    -Date/Publication: 2021-05-03 09:03:00 UTC
    +Date/Publication: 2021-05-03 09:03:50 UTC
    diff -Naur MASS_1/MD5 MASS_2/MD5
    --- MASS_1/MD5  2021-05-03 10:03:00.000000000 +0100
    +++ MASS_2/MD5  2021-05-03 10:03:50.000000000 +0100
    @@ -1,4 +1,4 @@
    -560f72bfd93ac57532d2cf113078d2e7 *DESCRIPTION
    +ecf84f78aac3c625898be45513307d79 *DESCRIPTION
     35aff05a505ecf7e81e0473767794ca9 *INDEX
     c7acdc0fa828f781a0a5586ab9d4fa1b *LICENCE.note
     0ac7b30ad35a4c19ea69d76a6a366b02 *NAMESPACE

Please prevent SHA changes of released packages on CRAN. Once a package is released, it should not be touched again.

--

Stefano Borini
Principal Analytical Tools Developer
AstraZeneca R&D BioPharmaceuticals | Data Science & AI | Early Biometrics & Statistical Innovation



________________________________


AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
#
On 16/03/2022 2:51 p.m., Borini, Stefano wrote:
I don't know the reason that MASS was built again 50 seconds after the 
first build, and it would be more convenient for you and some other 
people if it hadn't been, but your request comes across as unreasonably 
demanding.

You work for a company with a very large budget.  CRAN is run by 
volunteers, and as far as I know, your company has not contributed 
financially to running it.

If you want to guarantee that a CRAN package can be re-installed years 
from now, *you* should be archiving a copy of it.  You may be negligent 
by not doing so:  there's no guarantee that CRAN will still be 
distributing *any* version of MASS when the auditors show up.

Duncan Murdoch
#
Hi,

I think this is a valid concern and feature request, and I believe it
has been raised by others previously on one of our mailing lists.

Related to this, there's also been discussion (here or on R-devel), of
having `R CMD build` produce identical tarballs when the input doesn't
change, but the injection of `Packaged: <timestamp>; <user>` to the
`DESCRIPTION` file prevents this. If I recall correctly, there was at
least some discussion on being able to control, or anonymize, the
<user> part.

MRAN (https://mran.microsoft.com/timemachine) provides a daily
snapshot of CRAN, and it goes back several years, but I'm not sure if
that would solve your problem. It's only stable for a particular date,
but I'd guess that in this case it could pick up one build one day,
and the other one the next day.

There are a few working groups over at the R Consortium
(https://www.r-consortium.org/projects/isc-working-groups) who are
interested in reproducibility of R packages. I suspect the 'R
Validation Hub' working group (https://www.pharmar.org/overview/)
would be interested in these type of hiccups, even if it's just to
collect rare "incidents" like this one. I suggest you ping them as
well.

/Henrik

On Wed, Mar 16, 2022 at 12:45 PM Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
#
On 16/03/2022 5:01 p.m., Henrik Bengtsson wrote:
And what solution or resources for producing one did they offer?

Here's a trivial solution that could even be implemented by a 
pharmaceutical company:  rename the file to include its SHA when you 
download it, and keep a copy and a record of the new name as part of any 
document that is produced with it.

There, it's solved.

Duncan Murdoch
#
On 16 March 2022 at 14:01, Henrik Bengtsson wrote:
| Related to this, there's also been discussion (here or on R-devel), of
| having `R CMD build` produce identical tarballs when the input doesn't
| change, but the injection of `Packaged: <timestamp>; <user>` to the
| `DESCRIPTION` file prevents this. If I recall correctly, there was at
| least some discussion on being able to control, or anonymize, the
| <user> part.

It's much bigger than R:  https://reproducible-builds.org/

Started within Debian, but grew fairly quickly beyond one distribution to
many. We patched the build to use the (fixed) time from debian/changelog
(rather than current build time) and a few more things and were at some point
compliant, but there is still more and the package I stand behind as far as
Debian is concerned currently fails this goal of reproducible (i.e. binary
identical builds) [1] (and I have limited time to chase this, but the
initiative is very very good).

If someone wants to help please get in touch off-list. It should just require
some patience and diligence and I may teach your Debian builds in the
process.  The r-cran-* packages generally pass which is good.

Dirk

[1] https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/r-base.html
#
Sure, but why rebuild the package that has already been built?
Alternatively, would it be possible to have an index containing the sha of the packages, both of the current and of the archive? It doesn?t necessarily solve (someone hacking CRAN to inject a package would certainly make sure to update the SHA as well) but at least I would have information on integrity.

And while I am here, would it be possible to have a PACKAGES index equivalent also for the Archive? I wrote my own package resolver, here

https://github.com/AstraZeneca/roo/

to create my environment. It?s similar to python poetry, but I currently can?t do backtracking when a constraint is not respected, pubgrub style, or I would have to download a lot of stuff.
If I had an index covering both the current and the archive packages, I would be able to evaluate the dependency tree without downloading the package and inspecting DESCRIPTION for constraints, which would allow me to pubgrub it more efficiently.

If you want to talk about this in more detail, I have some experience with the issue on python (I worked for a major scientific python distributor, and I had to learn my fair dose of pain). I would not mind setting up a broader conversation, mostly referring to PEP and PyPA approaches.

--
Stefano Borini
Principal Analytical Tools Developer
AstraZeneca R&D BioPharmaceuticals | Data Science & AI | Early Biometrics & Statistical Innovation



From: R-package-devel <r-package-devel-bounces at r-project.org> on behalf of Dirk Eddelbuettel <edd at debian.org>
Date: Thursday, 17 March 2022 at 02:04
To: Henrik Bengtsson <henrik.bengtsson at gmail.com>
Cc: "r-package-devel at r-project.org" <r-package-devel at r-project.org>
Subject: Re: [R-pkg-devel] Ensuring permanence and SHA consistency of released CRAN packages for validated software
On 16 March 2022 at 14:01, Henrik Bengtsson wrote:
| Related to this, there's also been discussion (here or on R-devel), of
| having `R CMD build` produce identical tarballs when the input doesn't
| change, but the injection of `Packaged: <timestamp>; <user>` to the
| `DESCRIPTION` file prevents this. If I recall correctly, there was at
| least some discussion on being able to control, or anonymize, the
| <user> part.

It's much bigger than R: https://reproducible-builds.org/<https://reproducible-builds.org>

Started within Debian, but grew fairly quickly beyond one distribution to
many. We patched the build to use the (fixed) time from debian/changelog
(rather than current build time) and a few more things and were at some point
compliant, but there is still more and the package I stand behind as far as
Debian is concerned currently fails this goal of reproducible (i.e. binary
identical builds) [1] (and I have limited time to chase this, but the
initiative is very very good).

If someone wants to help please get in touch off-list. It should just require
some patience and diligence and I may teach your Debian builds in the
process. The r-cran-* packages generally pass which is good.

Dirk

[1] https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/r-base.html<https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/r-base.html>


--
https://dirk.eddelbuettel.com<https://dirk.eddelbuettel.com> | @eddelbuettel | edd at debian.org

______________________________________________
R-package-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel<https://stat.ethz.ch/mailman/listinfo/r-package-devel>

________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
#
If you want to guarantee that a CRAN package can be re-installed years
    from now, *you* should be archiving a copy of it.

We do, in fact, but that's beside the point. The success of an opensource project depends on the user base. I don't control the budget of the company I work for, or how that money is allocated. All I can say is that I found an issue and I am reporting it, and it's an issue that in the python world has been dealt with. It does not require more effort. It actually requires less. Just don't rebuild a package that has already been built.
That said, I do have some budget of my own time, which I can use (and in fact I do use) to collaborate with opensource projects during my working hours, but as I don't have the keys to CRAN build system I can't really fix the issue myself.

  You may be negligent
    by not doing so:  there's no guarantee that CRAN will still be
    distributing *any* version of MASS when the auditors show up.

As I said, we do, but when you decide to host what is basically the official package index for a language, you acquire some responsibilities (if not contractual, at least moral), regardless if you are an opensource developer or not.


________________________________


AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
#
On Thu, 17 Mar 2022 at 10:08, Borini, Stefano
<stefano.borini at astrazeneca.com> wrote:
Because the rest of the stack evolves and changes (compilers, shared
libraries, other packages), so you need to periodically (or, better
and more efficiently, each time a dependency changes) rebuild stuff to
check that it still works. Linux distributions have dedicated services
for this (see e.g. [1]).

[1] https://koschei.fedoraproject.org/
#
On 17/03/2022 5:14 a.m., Borini, Stefano wrote:
It's hard to convey tone in an email, but to me your post read more like 
a demand than a report of an issue.  I apologize for my misreading if 
that wasn't your intention.
Offering to track down the issue and fix it is a good thing.  You can't 
commit your change, but you could write it.  However, I'd guess it's not 
as easy as you suggest:  the build time entry is not the only place a 
timestamp could slip into a package.  From Dirk's message, it sounds as 
though he knows a lot about this, so you could work with him to propose 
a change to the R build process.
Now it sounds as if you are accusing CRAN of shirking its 
responsibilities.  CRAN is not responsible for your workflow, you are. 
If your workflow doesn't fit with CRAN's practices, you could fix your 
workflow.

As I said before, I don't know how it happened that there were two 
builds of MASS on CRAN, built 50 seconds apart.  But a guess is that it 
was built and published, but something appeared to indicate that things 
failed, or someone accidentally repeated some keystrokes, and the 
process was repeated.  You were unlucky enough to download it during 
that 50 second window.  It is not reasonable to suggest that errors like 
that should be impossible, but Dirk's project seems intended to reduce 
their impact.

Duncan Muroch
#
It's hard to convey tone in an email, but to me your post read more like
    a demand than a report of an issue.  I apologize for my misreading if
    that wasn't your intention.

No problem. I just wanted to point out that it is a problem. A lot of people use R and CRAN for regulated environment development. Inside our company, we do everything we can and more to ensure reproducibility and auditability of our results, but of course people may decide to migrate to other languages and environment if guarantees are hard to obtain on these respects. I've been following the EMA recommendations for validation, and the issue is getting more and more prevalent. As an individual working for my company, all I can do is to safeguard the code I produce and put into production to monitor for such events. It is my responsibility and I do everything I can to protect the integrity of the environment my users run calculations on.

   From Dirk's message, it sounds as
    though he knows a lot about this, so you could work with him to propose
    a change to the R build process.

We could. However, be aware that my expertise in terms of R is very lacking. I've been a long time python developer. All I do is migrate my python experience and apply it to R, but the deep technicalities of R and CRAN are unknown to me.

    Now it sounds as if you are accusing CRAN of shirking its
    responsibilities.  CRAN is not responsible for your workflow, you are.

No, but CRAN is responsible for hosting packages and their integrity. I am quite sure that if CRAN were to go away, there would be a complete uproar from the whole R community. Similarly, if CRAN were compromised and packages were modified to inject malicious code, people would be _very_ angry about it. Python has a few PEPs on PyPI integrity, e.g:

https://peps.python.org/pep-0458/

https://peps.python.org/pep-0480/

and a lot more on the Python Packaging Authority site.
The workflow is to download packages. How I download packages is a different story, but I am assuming that most R users don't give much thought about package SHA. Not sure if packrat or renv checks for SHA. I don't use them. As I said I found them inadequate and built my own solution.

    As I said before, I don't know how it happened that there were two
    builds of MASS on CRAN, built 50 seconds apart.  But a guess is that it
    was built and published, but something appeared to indicate that things
    failed, or someone accidentally repeated some keystrokes, and the
    process was repeated.  You were unlucky enough to download it during
    that 50 second window.

Highly unlikely. That day was bank holiday in the UK

Early May Bank Holiday  Mon, 3 May 2021

And I am certainly not working during a bank holiday, let alone re-run locking of packages. I also have to add that this is not the first time this event occurs. I've experienced this many, many times in the past 2 years. This is the first time I actually happen to have both the old package (in roo package cache) and the new one (downloaded) so I could compare the two.



________________________________


AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
#
Then I argue that the model is wrong. Platforms change all the time, but package release and package testing are two separate operations. I also guess it hardly scales. If the number of packages were to increase, you can?t rebuild and retest them all every time a linux distribution changes something and you want to retest the whole lot against it.


--
Stefano Borini
Principal Analytical Tools Developer
AstraZeneca R&D BioPharmaceuticals | Data Science & AI | Early Biometrics & Statistical Innovation



From: I?aki Ucar <iucar at fedoraproject.org>
Date: Thursday, 17 March 2022 at 10:16
To: "Borini, Stefano" <stefano.borini at astrazeneca.com>
Cc: Dirk Eddelbuettel <edd at debian.org>, Henrik Bengtsson <henrik.bengtsson at gmail.com>, "r-package-devel at r-project.org" <r-package-devel at r-project.org>
Subject: Re: [R-pkg-devel] Ensuring permanence and SHA consistency of released CRAN packages for validated software

On Thu, 17 Mar 2022 at 10:08, Borini, Stefano
<stefano.borini at astrazeneca.com> wrote:
Because the rest of the stack evolves and changes (compilers, shared
libraries, other packages), so you need to periodically (or, better
and more efficiently, each time a dependency changes) rebuild stuff to
check that it still works. Linux distributions have dedicated services
for this (see e.g. [1]).

[1] https://koschei.fedoraproject.org/<https://koschei.fedoraproject.org>

--
I?aki ?car

________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
#
Related to this, there's also been discussion (here or on R-devel), of
having `R CMD build` produce identical tarballs when the input doesn't
change, but the injection of `Packaged: <timestamp>; <user>` to the
`DESCRIPTION` file prevents this. If I recall correctly, there was at
least some discussion on being able to control, or anonymize, the
<user> part.

Yes, you can?t have timestamps in metadata if you want a reproducible build.
Well, you can, provided that you use different strategies. For example, have a superformat (e.g. like python wheel) instead of a plain tar.gz, and have metainfo
Inside that. Or you can provide a gpg signature. Even if the sha were to change, one can check for integrity against the signature and know it hasn?t been messed with.


MRAN (https://mran.microsoft.com/timemachine<https://mran.microsoft.com/timemachine>) provides a daily
snapshot of CRAN, and it goes back several years, but I'm not sure if
that would solve your problem. It's only stable for a particular date,
but I'd guess that in this case it could pick up one build one day,
and the other one the next day.

I don?t believe in the snapshot model. It doesn?t scale, and the reality of development, especially agile development, is that I have to mix and match depending on what users require me to add.
The library I need may not be present in the snapshot, or present in a version that is too old, or constraints may not be satisfied (I hardly believe one can ensure a consistent dependency tree with full constraints respected on more than 9000 packages and counting).


There are a few working groups over at the R Consortium
(https://www.r-consortium.org/projects/isc-working-groups<https://www.r-consortium.org/projects/isc-working-groups>) who are
interested in reproducibility of R packages. I suspect the 'R
Validation Hub' working group (https://www.pharmar.org/overview/<https://www.pharmar.org/overview>)
would be interested in these type of hiccups, even if it's just to
collect rare "incidents" like this one. I suggest you ping them as
well.

Will do. Thanks.

As I said, I have some availability of my time for opensource projects and external collaborations, and I can help somehow, but it really depends on what one can do. I know from experience that package management and dependencies are a really hard battle.



/Henrik

On Wed, Mar 16, 2022 at 12:45 PM Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
#
On Thu, 17 Mar 2022 at 11:40, Borini, Stefano
<stefano.borini at astrazeneca.com> wrote:
I answered the question "why rebuild the package that has already been
built?". I didn't say that the rebuilt package had to be republished.
In fact, they're not in general. And when it's really needed because
something in the packaging changed but the upstream source didn't
change, then there's a "release" tag that is incremented and added as
a suffix to the version.
Well, we do.