Skip to content

[R-pkg-devel] ORCID ID finder via tools::CRAN_package_db() ?

20 messages · Kurt Hornik, Chris Evans, Dirk Eddelbuettel +2 more

#
Has anybody written a quick helper function that extracts the Authors at R field
from tools::CRAN_package_db() and 'stems' it into 'Name, Firstname, ORCID'
which one could use to look up ORCID IDs at CRAN? The lookup at orcid.org
sometimes gives us 'private entries' that make it harder / impossible to
confirm a match. Having a normalised matrix or data.frame (or ...) would also
make it easier to generate Authors at R.

Cheers, Dirk
#
Dear Dirk,

Maybe checklist:::author2df() might be useful. It is an unexported function
from my checklist package. It converts a person() object to a dataframe.
https://github.com/inbo/checklist/blob/5649985b58693acb88337873ae14a7d5bc018d96/R/store_authors.R#L38

df <- tools::CRAN_package_db()
lapply(
  df$`Authors at R`[df$Package  %in% c("git2rdata", "qrcode")],
  function(x) {
    parse(text = x) |>
      eval() |>
      vapply(checklist:::author2df, vector(mode = "list", 1)) |>
      do.call(what = rbind)
  }
)

[[1]]
    given       family                       email               orcid
affiliation usage
1 Thierry     Onkelinx    thierry.onkelinx at inbo.be 0000-0001-8804-4216
       <NA>     1
2  Floris Vanderhaeghe floris.vanderhaeghe at inbo.be 0000-0002-6378-6229
       <NA>     1
3   Peter       Desmet        peter.desmet at inbo.be 0000-0002-8442-8025
       <NA>     1
4     Els     Lommelen        els.lommelen at inbo.be 0000-0002-3481-5684
       <NA>     1

[[2]]
    given   family                 email               orcid affiliation usage
1 Thierry Onkelinx qrcode at muscardinus.be 0000-0001-8804-4216        <NA>     1
2  Victor      Teh   victorteh at gmail.com


ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
*Postadres:* Koning Albert II-laan 15 bus 186, 1210 Brussel
*Poststukken die naar dit adres worden gestuurd, worden ingescand en
digitaal aan de geadresseerde bezorgd. Zo kan de Vlaamse overheid haar
dossiers volledig digitaal behandelen. Poststukken met de vermelding
?vertrouwelijk? worden niet ingescand, maar ongeopend aan de geadresseerde
bezorgd.*
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 19 aug 2024 om 14:54 schreef Dirk Eddelbuettel <edd at debian.org>:

  
  
#
On 19 August 2024 at 15:15, Thierry Onkelinx wrote:
| Maybe checklist:::author2df() might be useful. It is an unexported function
| from my checklist package. It converts a person() object to a dataframe.
| https://github.com/inbo/checklist/blob/5649985b58693acb88337873ae14a7d5bc018d96
| /R/store_authors.R#L38
| 
| df <- tools::CRAN_package_db()
| lapply(
| ? df$`Authors at R`[df$Package ?%in% c("git2rdata", "qrcode")],
| ? function(x) {
| ? ? parse(text = x) |>
| ? ? ? eval() |>
| ? ? ? vapply(checklist:::author2df, vector(mode = "list", 1)) |>
| ? ? ? do.call(what = rbind)
| ? }
| )
| 
| 
| [[1]]
|     given       family                       email               orcid affiliation usage
| 1 Thierry     Onkelinx    thierry.onkelinx at inbo.be 0000-0001-8804-4216        <NA>     1
| 2  Floris Vanderhaeghe floris.vanderhaeghe at inbo.be 0000-0002-6378-6229        <NA>     1
| 3   Peter       Desmet        peter.desmet at inbo.be 0000-0002-8442-8025        <NA>     1
| 4     Els     Lommelen        els.lommelen at inbo.be 0000-0002-3481-5684        <NA>     1
| 
| [[2]]
|     given   family                 email               orcid affiliation usage
| 1 Thierry Onkelinx qrcode at muscardinus.be 0000-0001-8804-4216        <NA>     1
| 2  Victor      Teh   victorteh at gmail.com  

That's a very nice start, thank you. (Will also look more closely at
checklist.)  It needs an `na.omit()` or alike, and even with that `rbind`
barked a few entries in (i = 19 if you select the full vector right now).

But definitely something to play with and possibly build upon. Thanks!  (And
the IDs of Floris and you were two of the ones I 'manually' added to a
DESCRIPTION file ;-)

Best,  Dirk
#
Hi Dirk,

Happy to help. I'm working on a new version of the checklist package. I
could export the function if that makes it easier for you.

Best regards,

Thierry

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
*Postadres:* Koning Albert II-laan 15 bus 186, 1210 Brussel
*Poststukken die naar dit adres worden gestuurd, worden ingescand en
digitaal aan de geadresseerde bezorgd. Zo kan de Vlaamse overheid haar
dossiers volledig digitaal behandelen. Poststukken met de vermelding
?vertrouwelijk? worden niet ingescand, maar ongeopend aan de geadresseerde
bezorgd.*
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 19 aug 2024 om 15:39 schreef Dirk Eddelbuettel <edd at debian.org>:

  
  
#
Salut Thierry,
On 20 August 2024 at 13:43, Thierry Onkelinx wrote:
| Happy to help. I'm working on a new version of the checklist package. I could
| export the function if that makes it easier for you.

Would be happy to help / iterate. Can you take a stab at making the
per-column split more robust so that we can bulk-process all non-NA entries
of the returned db?

Best, Dirk
#
Dirk et al,

Sorry for not replying any sooner :-)

I think for now you could use something like what I attach below.

Not ideal: I had not too long ago starting adding orcidtools.R to tools,
which e.g. has .persons_from_metadata(), but that works on the unpacked
sources and not the CRAN package db.  Need to think about that ...

Best
-k

********************************************************************
x <- tools::CRAN_package_db()
a <- lapply(x[["Authors at R"]],
            function(a) {
                if(!is.na(a)) {
                    a <- tryCatch(utils:::.read_authors_at_R_field(a), 
                                  error = identity)
                    if (inherits(a, "person")) 
                        return(a)
                }
                NULL
            })
a <- do.call(c, a)
a <- lapply(a,
            function(e) {
                if(is.null(o <- e$comment["ORCID"]) || is.na(o))
                    return(NULL)
                cbind(given = paste(e$given, collapse = " "),
                      family = paste(e$family, collapse = " "),
                      oid = unname(o))
            })
a <- as.data.frame(do.call(rbind, a))
********************************************************************
#
Hi Kurt,
On 20 August 2024 at 14:29, Kurt Hornik wrote:
| I think for now you could use something like what I attach below.
| 
| Not ideal: I had not too long ago starting adding orcidtools.R to tools,
| which e.g. has .persons_from_metadata(), but that works on the unpacked
| sources and not the CRAN package db.  Need to think about that ...

We need something like that too as I fat-fingered the string 'ORCID'. See
fortune::fortunes("Dirk can type").

Will the function below later. Many thanks for sending it along.

Dirk

| 
| Best
| -k
| 
| ********************************************************************
| x <- tools::CRAN_package_db()
| a <- lapply(x[["Authors at R"]],
|             function(a) {
|                 if(!is.na(a)) {
|                     a <- tryCatch(utils:::.read_authors_at_R_field(a), 
|                                   error = identity)
|                     if (inherits(a, "person")) 
|                         return(a)
|                 }
|                 NULL
|             })
| a <- do.call(c, a)
| a <- lapply(a,
|             function(e) {
|                 if(is.null(o <- e$comment["ORCID"]) || is.na(o))
|                     return(NULL)
|                 cbind(given = paste(e$given, collapse = " "),
|                       family = paste(e$family, collapse = " "),
|                       oid = unname(o))
|             })
| a <- as.data.frame(do.call(rbind, a))
| ********************************************************************
| 
| > Salut Thierry,
|
| > On 20 August 2024 at 13:43, Thierry Onkelinx wrote:
| > | Happy to help. I'm working on a new version of the checklist package. I could
| > | export the function if that makes it easier for you.
| 
| > Would be happy to help / iterate. Can you take a stab at making the
| > per-column split more robust so that we can bulk-process all non-NA entries
| > of the returned db?
| 
| > Best, Dirk
| 
| > -- 
| > dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
#
As I think that should be

fortunes::fortune("Dirk can type")

rather than

fortune::fortunes("Dirk can type")

I think that has become both recursive and demonstrating excellent 
test-retest stability!? Oh boy do I know that issue!

Chris
On 20/08/2024 14:57, Dirk Eddelbuettel wrote:
#
On 20 August 2024 at 15:13, Chris Evans wrote:
| As I think that should be
| 
| fortunes::fortune("Dirk can type")
| 
| rather than
| 
| fortune::fortunes("Dirk can type")

Yes, thank you. I also failed to run that post through CI and testing before
sending.  Doing too many things at once...

Dirk
 
| I think that has become both recursive and demonstrating excellent 
| test-retest stability!? Oh boy do I know that issue!
| 
| Chris
|
| On 20/08/2024 14:57, Dirk Eddelbuettel wrote:
| > Hi Kurt,
| >
| > On 20 August 2024 at 14:29, Kurt Hornik wrote:
| > | I think for now you could use something like what I attach below.
| > |
| > | Not ideal: I had not too long ago starting adding orcidtools.R to tools,
| > | which e.g. has .persons_from_metadata(), but that works on the unpacked
| > | sources and not the CRAN package db.  Need to think about that ...
| >
| > We need something like that too as I fat-fingered the string 'ORCID'. See
| > fortune::fortunes("Dirk can type").
| >
| > Will the function below later. Many thanks for sending it along.
| >
| > Dirk
| >
| > |
| > | Best
| > | -k
| > |
| > | ********************************************************************
| > | x <- tools::CRAN_package_db()
| > | a <- lapply(x[["Authors at R"]],
| > |             function(a) {
| > |                 if(!is.na(a)) {
| > |                     a <- tryCatch(utils:::.read_authors_at_R_field(a),
| > |                                   error = identity)
| > |                     if (inherits(a, "person"))
| > |                         return(a)
| > |                 }
| > |                 NULL
| > |             })
| > | a <- do.call(c, a)
| > | a <- lapply(a,
| > |             function(e) {
| > |                 if(is.null(o <- e$comment["ORCID"]) || is.na(o))
| > |                     return(NULL)
| > |                 cbind(given = paste(e$given, collapse = " "),
| > |                       family = paste(e$family, collapse = " "),
| > |                       oid = unname(o))
| > |             })
| > | a <- as.data.frame(do.call(rbind, a))
| > | ********************************************************************
| > |
| > | > Salut Thierry,
| > |
| > | > On 20 August 2024 at 13:43, Thierry Onkelinx wrote:
| > | > | Happy to help. I'm working on a new version of the checklist package. I could
| > | > | export the function if that makes it easier for you.
| > |
| > | > Would be happy to help / iterate. Can you take a stab at making the
| > | > per-column split more robust so that we can bulk-process all non-NA entries
| > | > of the returned db?
| > |
| > | > Best, Dirk
| > |
| > | > --
| > | > dirk.eddelbuettel.com | @eddelbuettel |edd at debian.org
| >
| -- 
| Chris Evans (he/him)
| Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, 
| University of Roehampton, London, UK.
| Work web site: https://www.psyctc.org/psyctc/
| CORE site: http://www.coresystemtrust.org.uk/
| Personal site: https://www.psyctc.org/pelerinage2016/
| Emeetings (Thursdays): 
| https://www.psyctc.org/psyctc/booking-meetings-with-me/
| (Beware: French time, generally an hour ahead of UK)
| <https://ombook.psyctc.org/book>
| 	[[alternative HTML version deleted]]
| 
| ______________________________________________
| R-package-devel at r-project.org mailing list
| https://stat.ethz.ch/mailman/listinfo/r-package-devel
#
On 20 August 2024 at 07:57, Dirk Eddelbuettel wrote:
| 
| Hi Kurt,
|
| On 20 August 2024 at 14:29, Kurt Hornik wrote:
| | I think for now you could use something like what I attach below.
| | 
| | Not ideal: I had not too long ago starting adding orcidtools.R to tools,
| | which e.g. has .persons_from_metadata(), but that works on the unpacked
| | sources and not the CRAN package db.  Need to think about that ...
| 
| We need something like that too as I fat-fingered the string 'ORCID'. See
| fortune::fortunes("Dirk can type").
| 
| Will the function below later. Many thanks for sending it along.

Very nice. Resisted my common impulse to make it a data.table for easy
sorting via keys etc.  After running your code the line

   head(with(a, sort_by(a, ~ family + given)), 100)

shows that we need a bit more QA as person entries are not properly split
between 'family' and 'given', use the URL and that we have repeats.
Excluding those is next.

Dirk
 
| Dirk
| 
| | 
| | Best
| | -k
| | 
| | ********************************************************************
| | x <- tools::CRAN_package_db()
| | a <- lapply(x[["Authors at R"]],
| |             function(a) {
| |                 if(!is.na(a)) {
| |                     a <- tryCatch(utils:::.read_authors_at_R_field(a), 
| |                                   error = identity)
| |                     if (inherits(a, "person")) 
| |                         return(a)
| |                 }
| |                 NULL
| |             })
| | a <- do.call(c, a)
| | a <- lapply(a,
| |             function(e) {
| |                 if(is.null(o <- e$comment["ORCID"]) || is.na(o))
| |                     return(NULL)
| |                 cbind(given = paste(e$given, collapse = " "),
| |                       family = paste(e$family, collapse = " "),
| |                       oid = unname(o))
| |             })
| | a <- as.data.frame(do.call(rbind, a))
| | ********************************************************************
| | 
| | > Salut Thierry,
| |
| | > On 20 August 2024 at 13:43, Thierry Onkelinx wrote:
| | > | Happy to help. I'm working on a new version of the checklist package. I could
| | > | export the function if that makes it easier for you.
| | 
| | > Would be happy to help / iterate. Can you take a stab at making the
| | > per-column split more robust so that we can bulk-process all non-NA entries
| | > of the returned db?
| | 
| | > Best, Dirk
| | 
| | > -- 
| | > dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
| 
| -- 
| dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
#
Right.  One should canonicalize the ORCID (having the URLs is from being
nice) and then do unique() ...

Best
-k
#
The variant attaches drops the URL and does unique.

Hmm, the ones in

  head(with(a, sort_by(a, ~ family + given)), 100)

without a family look suspicious ...

Best
-k


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: orcid.R
URL: <https://stat.ethz.ch/pipermail/r-package-devel/attachments/20240820/76546959/attachment.ksh>

-------------- next part --------------
#
Looking into one particular example,

https://github.com/seabbs/idmodelr/blob/master/DESCRIPTION

this appears to be the authors' fault:

Authors at R: c(
     person(given = "Sam Abbott",
            role = c("aut", "cre"),
            email = "contact at samabbott.co.uk",
            comment = c(ORCID = "0000-0001-8057-8037")),
     person(given = "Akira Endo",
            role = c("aut"),
            email = "akira.endo at lshtm.ac.uk",
            comment = c(ORCID = "0000-0001-6377-7296")))

   Maybe CRAN should start checking for missing 'family' fields in 
Authors at R ... ???

   cheers
    Ben Bolker
On 2024-08-20 9:47 a.m., Kurt Hornik wrote:

  
    
#
Dear Ben,

This is as simple as setting mandatory given and family fields.
checklist::check_description() ensures that given and family are set unless
the role is "cph" or "fnd". Allowing for organisations to be listed with
only the given field.

The 0.4.1 branch of checklist
<https://github.com/inbo/checklist/tree/0.4.1> now
exports the author2df() function which now can handle objects of call
person, list, logical (NA) and NULL. Feedback is welcome.

library(checklist)
df <- tools::CRAN_package_db()
vapply(
  df$`Authors at R`[df$Package %in% c("git2rdata", "A3", "digest", "abe")],
  function(x) {
    parse(text = x) |>
      eval() |>
      list()
  },
  vector(mode = "list", 1)
) |>
  unname() |>
  author2df()

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
*Postadres:* Koning Albert II-laan 15 bus 186, 1210 Brussel
*Poststukken die naar dit adres worden gestuurd, worden ingescand en
digitaal aan de geadresseerde bezorgd. Zo kan de Vlaamse overheid haar
dossiers volledig digitaal behandelen. Poststukken met de vermelding
?vertrouwelijk? worden niet ingescand, maar ongeopend aan de geadresseerde
bezorgd.*
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 20 aug 2024 om 15:59 schreef Ben Bolker <bbolker at gmail.com>:

  
  
#
On 20 August 2024 at 15:47, Kurt Hornik wrote:
| >>>>> Kurt Hornik writes:
| 
| The variant attaches drops the URL and does unique.

Nice. Alas, some of us default to r-release as the daily driver and then

  Error in unname(tools:::.ORCID_iD_canonicalize(o)) : 
    object '.ORCID_iD_canonicalize' not found
  > 

Will play with my 'RD' which I keep approximately 'weekly-current'. Quick
rebuild first.

Dirk
#
Meanwhile, I am close to committing a change to R-devel which adds
tools::CRAN_authors_db() with docs

  \code{CRAN_authors_db()} returns information on the authors of the
  current CRAN packages extracted from the \samp{Authors at R} fields in
  the package \file{DESCRIPTION} files, as a data frame with character
  columns giving the given and family names, email addresses,
  \abbr{ORCID} identifier, roles, and comments of the person entries,
  and the corresponding package.

Once make check-all is done ...

Best
-k

PS.  Sorry about tools:::.ORCID_iD_canonicalize(), had run into the same
issue when building the authors db on the CRAN master (which uses
current R release ...)
#
On 21 August 2024 at 07:43, Dirk Eddelbuettel wrote:
|
| On 20 August 2024 at 15:47, Kurt Hornik wrote:
| | >>>>> Kurt Hornik writes:
| | 
| | The variant attaches drops the URL and does unique.
| 
| Nice. Alas, some of us default to r-release as the daily driver and then
| 
|   Error in unname(tools:::.ORCID_iD_canonicalize(o)) : 
|     object '.ORCID_iD_canonicalize' not found
|   > 
| 
| Will play with my 'RD' which I keep approximately 'weekly-current'. Quick
| rebuild first.

As simple as adding

  .ORCID_iD_canonicalize <- function (x) sub(tools:::.ORCID_iD_variants_regexp, "\\3", x)

and making the call (or maybe making it a lambda anyway ...)

  oid = unname(.ORCID_iD_canonicalize(o)))

After adding

  a <- sort_by(a, ~ a$family + a$given)

the first 48 out if a (currently) total of 6465 are empty for family.

  > sum(a$family == "")
  [1] 48
  > 

Rest is great!

Dirk
#
Committed now.

Best
-k
#
On 21 August 2024 at 15:47, Kurt Hornik wrote:
| >>>>> Kurt Hornik writes:
| 
| Committed now.

That is just *lovely*:

   > aut <- tools::CRAN_authors_db()
   > dim(aut)
   [1] 47433     7
   > head(aut)
         given  family                     email               orcid     role comment       package
   1    Martin   Bladt    martinbladt at math.ku.dk                <NA> aut, cre    <NA> AalenJohansen
   2 Christian  Furrer         furrer at math.ku.dk                <NA>      aut    <NA> AalenJohansen
   3    Sercan Kahveci sercan.kahveci at plus.ac.at                <NA> aut, cre    <NA>      AATtools
   4    Andrew   Pilny        andy.pilny at uky.edu 0000-0001-6603-5490 aut, cre    <NA>   abasequence
   5   Sigbert  Klinke      sigbert at hu-berlin.de                <NA> aut, cre    <NA>    abbreviate
   6  Csillery Katalin   kati.csillery at gmail.com                <NA>      aut    <NA>           abc
   > 

Can we possibly get this into r-patched and the next r-release?

Dirk
#
Possibly yes, if there is enough "need" :-)

Best
-k