Skip to content

Question about quantile fuzz and GPL license

5 messages · Abel AOUN, Ben Bolker, GILLIBERT, Andre +1 more

#
Hello, 

I'm currently working on Python numpy package to develop linear interpolation methods for quantiles. 
Currently, numpy only support the type 7 of Hyndman & Fan and I did the implementation for the 8 other methods to do as much as R ::quantile. 

As you may guess, I was inspired by R implementation as well as other sources, which lead to my questions: 

About fuzz (see first reference below for the source code), 
fuzz <- 4 * .Machine $ double.eps 
I think I understand why the machine epsilon is used to correct some edge cases where the float comparisons would fail. 
However I don't get why epsilon is multiplied by 4 instead of simply using epsilon. 
Is there someone who can explain this 4 ? 

About licence, 
Numpy is under license BSD and R is on GPL. 
The only thing I really cherry picked and rewrote for numpy is the fuzz part. 
I'm quite new to open source development. We are wondering if doing this breaks the license GPL and if I can credit the original authors. 
Plus, I'm not quite sure this is the right place to ask this, if not, sorry for the noise. 
The relevant discussion on numpy PR is here: [ https://github.com/numpy/numpy/pull/19857#discussion_r706019184 | https://github.com/numpy/numpy/pull/19857#discussion_r706019184 ] 


Thank you for your time. 

Regards, 
Abel Aoun 


References: 
The source code for R::quantile (fuzz is at line 82) [ https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/quantile.R | https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/quantile.R ] [ https://github.com/numpy/numpy/pull/19857 ] 
R doc for quantile : [ https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/quantile | https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/quantile ] 
The ongoing PR on numpy: [ https://github.com/numpy/numpy/pull/19857 | https://github.com/numpy/numpy/pull/19857 ]
#
On 9/14/21 9:22 AM, Abel AOUN wrote:
No, but doing a bit of archaeology

https://github.com/wch/r-source/blame/trunk/src/library/stats/R/quantile.R

   give the commit message for these lines as "add (modified) version of 
quantile.default from Rob Hyndman (17 years ago)".  This commit was made 
by Brian Ripley.

   However, the code from Rob Hyndman here:

https://stat.ethz.ch/pipermail/r-devel/2004-July/030204.html

   does **not** have the lines with the fuzz.  So my guess would be that 
Brian Ripley is the author of that particular bit of code.

   I can't say, myself, what the logic behind 4 * .Machine$double.eps is ...

  
    
#
On 9/14/21 9:22 AM, Abel AOUN wrote:
.Machine$double.eps is the "precision" of floating point values for values close to 1.0 (between 0.5 and 2.0).

Using fuzz = .Machine$double.eps would have no effect if nppm is greater than or equal to 2.
Using fuzz = 4 * .Machine$double.eps can fix rounding errors for nppm < 8; for greater nppm, it has no effect.

Indeed:
2 + .Machine$double.eps == 2
8+ 4*.Machine$double.eps == 8

Since nppm is approximatively equal to the quantile multiplied by the sample size, it can be much greater than 2 or 8.

Maybe the rounding errors are only problematic for small nppm; or only that case is taken in account.

Moreover, if rounding errors are cumulative, they can be much greater than the precision of the floating point value. I do not know how this constant was chosen and what the use-cases were.

--
Sincerely
Andre GILLIBERT
#

        
> On 9/14/21 9:22 AM, Abel AOUN wrote:
>> However I don't get why epsilon is multiplied by 4 instead of simply using epsilon.
    >> Is there someone who can explain this 4 ?

    > .Machine$double.eps is the "precision" of floating point values for values close to 1.0 (between 0.5 and 2.0).

    > Using fuzz = .Machine$double.eps would have no effect if nppm is greater than or equal to 2.
    > Using fuzz = 4 * .Machine$double.eps can fix rounding errors for nppm < 8; for greater nppm, it has no effect.

    > Indeed:
    > 2 + .Machine$double.eps == 2
    > 8+ 4*.Machine$double.eps == 8

    > Since nppm is approximatively equal to the quantile multiplied by the sample size, it can be much greater than 2 or 8.

hmm: not "quantile":
 it is approximatively equal to the *'prob'* multiplied by the sample size
 {the quantiles themselves can be on any scale anyway, but they
  don't matter yet fortunately in these parts of the calculations}

but you're right in the main point that they are
approx. proportional to  n.

    > Maybe the rounding errors are only problematic for small nppm; or only that case is taken in account.

    > Moreover, if rounding errors are cumulative, they can be much greater than the precision of the floating point value. I do not know how this constant was chosen and what the use-cases were.

I vaguely remember I've been wondering about this also (back at the time).

Experiential wisdom would tell us to take such  fuzz values as
*relative* to the magnitude of the values they are added to,
here 'nppm' (which is always >= 0, hence no need for  abs(.) as usually).

So, instead of

    j <- floor(nppm + fuzz)
    h <- nppm - j
    if(any(sml <- abs(h) < fuzz, na.rm = TRUE)) h[sml] <- 0

it would be (something like)

    j <- floor(nppm*(1 + fuzz))
    h <- nppm - j
    if(any(sml <- abs(h) < fuzz*nppm, na.rm = TRUE)) h[sml] <- 0

or rather we would define fuzz as
   nppm * (k * .Machine$double.eps) 
for a small k.

- - -

OTOH,  type=7 is the default, and I guess used in 99.9% of
all uses of quantile, *and* does never use any fuzz ....

Martin

    > --
    > Sincerely
    > Andre GILLIBERT


    > [[alternative HTML version deleted]]
#
Martin Maechler wrote:
Indeed. This also implies that this default should be well-thought when creating a new implementation of the quantile() procedure for a new programming language or library.
Most of the time, users use the default procedure, and do not report the procedure used in the statistical analysis reports, scientific or non-scientific articles produced.
The differences between all quantiles procedures are minor, unless they are used in crazy scenarios such as a sample size of 2, or with probs=0.001 for a sample of size 1000.
But, standardization of procedures is desirable for analysis reproducibility, as well as teaching (see https://doi.org/10.1080/10691898.2006.11910589 ).

Hyndman and Fan wanted that software package standardize their definition, but to no avail:
See https://robjhyndman.com/hyndsight/sample-quantiles-20-years-later/

In the absence of standard, my personal advice would be to use the same default as a popular statistical software, such as R or SAS.

R, Julia and NumPy (python) uses type 7 as default.
Microsoft Excel and LibreOffice Calc use type 7 as default (although Excel versions >= 2010 have new procedures).
SAS uses type 3 as default, unless prob=0.50
Stata uses type 2 or type 6, depending on the procedure (https://data.princeton.edu/stata/markdown/quantiles.htm)