complex NA's match(), etc: not back-compatible change proposal
Suharto Anggono
on Sat, 28 May 2016 09:34:08 +0000 writes:
> On 'factor', I meant the case where 'levels' is not
> specified, where 'unique' is called.
I see, thank you.
>> factor(c(complex(real=NaN), complex(imaginary=NaN)))
> [1] NaN+0i <NA>
> Levels: NaN+0i
> Look at <NA> in the result above. Yes, it happens in
> earlier versions of R, too.
Yes; let's call this "problem 1"
> On matching both NA and NaN, another consequence is that
> length(unique(.)) may depend on order.
> Example using R devel r70604:
>> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
>> (z <- z[is.na(z)])
> [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA
> [9] 0+NaNi 1+NaNi NA NaN+NaNi
>> length(print(unique(z)))
> [1] NA NaN+0i
> [1] 2
>> length(print(unique(c(z[8], z[-8]))))
> [1] NA
> [1] 1
> --------------------------------------------
Thank you, Suharto. I agree these are even more convincing
reasons to consider changing.
Let's call this ("matching both NA and NaN") "problem 2".
I think we agree that the R-devel -- comparted to previous
versions -- *is* consistent in its (C level) functions cequal()
and chash() and also is consistent with the documentation
of match()/unique()/duplicated().
Hence I think a change would have to affect all of the above,
including a change of documentation.
Also, resolution of "problem 1" and "problem 2" are related, but
--I think-- almost separate.
For the following, let's use a vector notation for complex
numbers, say
(a, b) :== complex(real = a, imaginary = b)
With R (showing relevant examples):
##------------------------------------------------------------------------------
options(width = max(85, getOption("width"))) # so 'z' prints in one line
p.z <- function(z) print(noquote(paste0("(",Re(z),",",Im(z),")")))
z <- c(1,NA,NaN); z <- outer(z,z, complex, length.out=1); (z <- z[is.na(z)])
## NA NaN+ 1i NA NA NA 1+NaNi NA NaN+NaNi
p.z(z)
## (NA,1) (NaN,1) (1,NA) (NA,NA) (NaN,NA) (1,NaN) (NA,NaN) (NaN,NaN)
length(p.z(unique(z[ 1:8 ])))
## [1] (NA,1) (NaN,1)
## [1] 2
length(p.z(unique(z[ c(8,1:7) ])))
## [1] (NaN,NaN) (NA,1)
## [1] 2
length(p.z(unique(z[ c(7:8,1:6) ])))
## [1] (NA,NaN)
## [1] 1
##------------------------------------------------------------------------------
Problem 1:
To me, at the moment, it would seem most "natural" to consider a
change where the match()/unique()/duplicated() behavior matched
the behavior of print()/format()/as.character() for such
complex vectors.
I think this would automatically solve the issue that sometimes
length(unique(as.character(x))) > length(unique(x))
The are principally two solutions to this:
A: change match()/unique()/duplicated()
B: change print()/format()/as.character()
For A -- which seems "less disruptive" and more desirable to
me -- we would have to change cequal() {and chash()!} and say
that complex numbers with NA|NaN "match" if they have any NA, but
otherwise, both the regular (r,i) and the NaN must be at the
exact same places (and *different* NaNs should match, of course).
Problem 2: unique(z[i]) depends on the permutation 'i'
What should a change be here ... notably after the "proposed"
(rather only "considered") change '1 A' above ?
Can "the" new behavior easily be described in words (if '1 A'
above is already assumed)?
At the moment, I would not tackle Problem 2.
It would become less problematic once Problem 1 is solved
according to '1 A', because it least length(unique(.)) would
not change: It would contain *one* z[] with an NA, and all the
other z[]s.
Opinions ? Thank you in advance for chiming in..
Martin Maechler,
ETH Zurich
> On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
> Cc: R-devel at r-project.org
> Date: Monday, 23 May, 2016, 11:06 PM
>>>>>>
> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>> ? ???on Fri, 13
> May 2016 16:33:05 +0000 writes:
> ? ? > That, for example, complex(real=NaN)
> and complex(imaginary=NaN) are regarded as equal makes it
> possible that
> ? ? >?
> length(unique(as.character(x))) > length(unique(x))
> ? ? > (current code of
> function 'factor' doesn't expect it).
> Thank you, that is an
> interesting remark - but is already true,
> in
> [[elided Yahoo spam]]
> ..
> and of course this is because we do
> *print*???0+NaNi? etc,
> i.e., we
> differentiate the? non-NA-but-NaN complex values in
> formatting / printing but not in match(),
> unique() ...
> and indeed,
> with the? 'z'? example below,
> ?
> fz <- factor(z,z)
> gives a warnings about
> duplicated levels and gives such warnings
> also in current (and previous) versions of R,
> at least for the slightly
> larger z?
> I've used in the tests/reg-tests-1c.R example.
> For the moment I can live with
> that warning, as I don't think
> factor()s
> are constructed from complex numbers "often"...
> and the performance of factor() in the more
> regular cases is important.
>> Yes, an argument for the behavior is that
> NA and NaN are of one kind.
>> On my
> system, using 32-bit R for Windows from binary from CRAN,
> the result of sapply(z, match, table = z) (not in current
> R-devel) may be different from below:
> ? ?
>> 1 2 3 4 1 3 7 8 2 4 8 12? # R 2.10.1, different from
> below
> ? ? > 1 2 3 4 1 3 7 8 2 4 8 12?
> # R 3.2.5, different from below
> interesting, thank you... and another reason
> why the change
> (currently only in R-devel)
> may have been a good one: More uniformity.
> ? ? > I noticed that, by
> function 'cequal' in unique.c, a complex number that
> has both NA and NaN matches NA and also matches NaN.
> ? ? >> x0 <- c(0,1,
> NA, NaN); z <- outer(x0,x0, complex, length.out=1);
> rm(x0)
> ? ? >> (z <-
> z[is.na(z)])
> ? ? > [1]? ?
> ???NA NaN+? 0i? ? ???NA NaN+? 1i?
> ? ???NA? ? ???NA? ?
> ???NA? ? ???NA
> ? ?
>> [9]???0+NaNi???1+NaNi? ?
> ???NA NaN+NaNi
> ? ? >> sapply(z, match, table =
> z[8])
> ? ? > [1] 1 1 1 1 1 1 1 1 1 1 1
> 1
> ? ? >> match(z, z[8])
> ? ? > [1] 1 1 1 1 1 1 1 1 1 1 1 1
> Yes, I see the same. But is
> n't it what we expect:
> All of our z[] entries has at least one NA or a
> NaN in its real
> or imaginary, and since z[8]
> has both, it does match with all
> z[]'s
> either because of the NA or because of the NaN in common.
> Hence, currently, I don't
> think this needs to be changed...
> but if
> there are other reasons / arguments ...
> Thank you again,
> Martin
> Maechler
> ? ? >> sessionInfo()
> ?
> ? > R Under development (unstable) (2016-05-12
> r70604)
> ? ? > Platform:
> i386-w64-mingw32/i386 (32-bit)
> ? ? >
> Running under: Windows XP (build 2600) Service Pack 2
> ? ? > locale:
> ? ? > [1] LC_COLLATE=English_United
> States.1252
> ? ? > [2]
> LC_CTYPE=English_United States.1252
> ? ?
>> [3] LC_MONETARY=English_United States.1252
> ? ? > [4] LC_NUMERIC=C
> ?
> ? > [5] LC_TIME=English_United States.1252
> ? ? > attached base
> packages:
> ? ? > [1] stats?
> ???graphics? grDevices utils?
> ???datasets? methods???base
> ? ? >
> -----------------
>>>>>>
> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>> ? ???on Tue, 10
> May 2016 16:08:39 +0200 writes:
> ? ? >> This is an RFC / announcement
> related to the 2nd part of PR#16885
> ? ?
>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885
> ? ? >> about? complex NA's.
> ? ? >> The (somewhat
> rare) incompatibility in R's 3.3.0 match() behavior for
> the
> ? ? >> case of complex numbers
> with NA & NaN's {which has been fixed for R 3.3.0
> ? ? >> patched in the mean time}
> triggered some more comprehensive "research".
> ? ? >> I found that we
> have had a long-standing inconsistency at least between
> the
> ? ? >> documented and the real
> behavior.? I am claiming that the documented
> ? ? >> behavior is desirable and hence
> R's current "real" behavior is bugous, and
> ? ? >> I am proposing to change it, in
> R-devel (to be 3.4.0) for now.
> ? ? > After the? "roaring
> unanimous" assent? (one private msg
> ?
> ? > encouraging me to go forward, no dissenting voice,
> hence an
> ? ? > "odds ratio"
> of? +Inf? in favor ;-)
> ?
> ? > I have now committed my proposal to R-devel (svn
> rev. 70597) and
> ? ? > some of us will
> be seeing the effect in package space within a
> ? ? > day or so, in the CRAN checks
> against R-devel (not for
> ? ? >
> bioconductor AFAIK; their checks using R-devel only when it
> less
> ? ? > than ca 6 months from
> release).
> ? ? >
> It's still worthwhile to discuss the issue, if you come
> late
> ? ? > to it, notably as
> ---paraphrasing Dirk on the R-package-devel list---
> ? ? > the release of 3.4.0 is almost a
> year away, and so now is the
> ? ? > best
> time to tinker with the API, in other words, consider
> breaking
> ? ? > rarely used legacy
> APIs..
> ? ? > Martin
> ? ?
>>> In help(match) we have been saying
> ? ? >> |? Exactly
> what matches what is to some extent a matter of
> definition.
> ? ? >> |? For all
> types, \code{NA} matches \code{NA} and no other value.
> ? ? >> |? For real and complex values,
> \code{NaN} values are regarded
> ? ?
>>> |? as matching any other \code{NaN} value, but not
> matching \code{NA}.
> ? ?
>>> for at least 10 years.? But we don't do that
> at all in the
> ? ? >> complex case
> (and AFAIK never got a bug report about it).
> ? ? >> Also, e.g.,
> print(.) or format(.) do simply use? "NA" for
> all
> ? ? >> the different complex
> NA-containing numbers, where OTOH,
> ? ?
>>> non-NA NaN's { <=>? !is.nan(z) &
> is.na(z) }
> ? ? >> in format() or
> print() do show the NaN in real and/or imaginary
> ? ? >> parts; for an example, look at
> the "format" column of the matrix
> ? ? >> below, after
> 'print(cbind' ...
> ? ? >> The current match()---and
> duplicated(), unique() which are based on the same
> ? ? >> C code---*do* distinguish almost
> all complex NA / NaN's which is
> ? ?
>>> NOT according to documentation. I have found that
> this is just because of
> ? ? >> of
> our hashing function for the complex case, chash() in
> R/src/main/unique.c,
> ? ? >> is
> bogous in the sense that it is not compatible with the above
> documentation
> ? ? >> and also not
> with the cequal() function (in the same file uniqu.c) for
> checking
> ? ? >> equality of complex
> numbers.
> ? ? >> As
> I have found,, a *simplified* version of the chash()
> function
> ? ? >> to make it
> compatible with cequal() does solve all the problems
> I've
> ? ? >> indicated,? and the
> current plan is to commit that change --- after some
> ? ? >> discussion time, here on R-devel
> ---? to the code base.
> ?
> ? >> My change passes? 'make check-all'
> fine, but I'm 100% sure that there will
> ? ? >> be effects in package-space. ...
> one reason for this posting.
> ? ? >> As mentioned above, note that
> the chash() function has been in
> ? ?
>>> use for all three functions
> ? ?
>>> match()
> ? ? >>
> duplicated()
> ? ? >> unique()
> ? ? >> and the change will affect all
> three --- but just for the case of complex
> ? ? >> vectors with NA or NaN's.
> ? ? >> To show more, a
> small R session -- using my version of R-devel
> ? ? >> == the proposition:
> ? ? >> The R script
> ('complex-NA-short.R') for (a bit more than) the
> ? ? >> session is attached {{you can
> attach? text/plain easily}}:
> ? ? >>> x0 <- c(0,1, NA, NaN); z
> <- outer(x0,x0, complex, length.out=1); rm(x0)
> ? ? >>> ##? ? ? ?
> ???--- = NA_real_? but that does not exist e.g.,
> in R 2.3.1
> ? ? >>> ##? ? ? ?
> ? ? ? ? ???similarly,? '1L',
> '2L', .. do not exist e.g., in R 2.3.1
> ? ? >>> (z <- z[is.na(z)])
> ? ? >> [1]? ? ???NA NaN+?
> 0i? ? ???NA NaN+? 1i? ? ???NA?
> ? ???NA? ? ???NA? ?
> ???NA
> ? ? >>
> [9]???0+NaNi???1+NaNi? ?
> ???NA NaN+NaNi
> ? ? >>>
> outerID <- function(x,y, ...) { ## ugly; can we get
> outer() to work ?
> ? ? >> +?
> ???r <- matrix( , length(x), length(y))
> ? ? >> +? ???for(i in
> seq(along=x))
> ? ? >> +? ? ?
> ???for(j in seq(along=y))
> ? ?
>>> +? ? ? ? ? ???r[i,j] <-
> identical(z[i], z[j], ...)
> ? ? >>
> +? ???r
> ? ? >> + }
> ? ? >>> ## Very strictly - in the
> sense of identical() -- these 12 complex numbers all
> differ:
> ? ? >>> ## a version that
> works in older versions of R, where identical() had fewer
> arguments!
> ? ? >>> outerID.picky
> <- function(x,y) {
> ? ? >> +?
> ???nF <- length(formals(identical)) - 2
> ? ? >> +?
> ???do.call("outerID", c(list(x, y),
> as.list(rep(FALSE, nF))))
> ? ? >> +
> }
> ? ? >>> oldR <-
> !exists("getRversion") || getRversion() <
> "3.0.0" ## << FIXME: 3.0.0 is? a wild
> guess
> ? ? >>> symnum(id.z <-
> outerID.picky(z,z)) ## == Diagonal matrix [newer versions of
> R]
> ? ? ? ? ? ? ? ? ? ? ? ? ?
> ???
> ? ? >> [1,] | . . . .
> . . . . . . .
> ? ? >> [2,] . | . . .
> . . . . . . .
> ? ? >> [3,] . . | . .
> . . . . . . .
> ? ? >> [4,] . . . | .
> . . . . . . .
> ? ? >> [5,] . . . . |
> . . . . . . .
> ? ? >> [6,] . . . . .
> | . . . . . .
> ? ? >> [7,] . . . . .
> . | . . . . .
> ? ? >> [8,] . . . . .
> . . | . . . .
> ? ? >> [9,] . . . . .
> . . . | . . .
> ? ? >> [10,] . . . . .
> . . . . | . .
> ? ? >> [11,] . . . . .
> . . . . . | .
> ? ? >> [12,] . . . . .
> . . . . . . |
> ? ? >>> try(# for
> older R versions
> ? ? >> +
> stopifnot(identical(id.z, outerID(z,z)), oldR ||
> identical(id.z, diag(12) == 1))
> ? ?
>>> + )
> ? ? >>> (mz <-
> match(z, z)) # currently different {NA,NaN} patterns differ
> - not in print()/format() _FIXME_
> ? ?
>>> [1] 1 2 1 2 1 1 1 1 2 2 1 2
> ? ?
>>>> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see
> the pattern :
> ? ? >>>
> print(cbind(format = format(z), t(zRI), mz), quote=FALSE)
> ? ? >>
> format???Re???Im???mz
> ? ? >> [1,]? ? ???NA
> <NA> 0? ? 1
> ? ? >> [2,]
> NaN+? 0i NaN? 0? ? 2
> ? ? >>
> [3,]? ? ???NA <NA> 1? ? 1
> ? ? >> [4,] NaN+? 1i NaN? 1? ? 2
> ? ? >> [5,]? ? ???NA
> 0? ? <NA> 1
> ? ? >> [6,]?
> ? ???NA 1? ? <NA> 1
> ?
> ? >> [7,]? ? ???NA <NA> <NA>
> 1
> ? ? >> [8,]? ? ???NA
> NaN? <NA> 1
> ? ? >>
> [9,]???0+NaNi 0? ? NaN? 2
> ?
> ? >> [10,]???1+NaNi 1? ? NaN? 2
> ? ? >> [11,]? ? ???NA
> <NA> NaN? 1
> ? ? >> [12,]
> NaN+NaNi NaN? NaN? 2
> ? ? >>>
> ? ? >>
> -------------------------------
> ? ?
>>> Note that 'mz <- match(z, z)' and hence
> the last column of the matrix above
> ? ?
>>> are very different in current R,
> ? ? >> distinguishing most kinds of NA
> / NaN? against the documentation (and the
> ? ? >> real/numeric case).
> ? ? >> Martin
> Maechler
> ? ? >> R Core Team
> ? ?
>>> ### Basically a shortened version of? the PR#16885
> -- complex part b)
> ? ? >> ### of?
> R/tests/reg-tests-1c.R
> ?
> ? >> ## b) complex 'x' with different kinds
> of NaN
> ? ? >> x0 <- c(0,1, NA,
> NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
> ? ? >> ##? ? ? ? ???---
> = NA_real_? but that does not exist e.g., in R 2.3.1
> ? ? >> ##? ? ? ? ? ? ? ?
> ???similarly,? '1L', '2L', .. do
> not exist e.g., in R 2.3.1
> ? ? >> (z
> <- z[is.na(z)])
> ? ? >> outerID
> <- function(x,y, ...) { ## ugly; can we get outer() to
> work ?
> ? ? >> r <- matrix( ,
> length(x), length(y))
> ? ? >> for(i
> in seq(along=x))
> ? ? >> for(j in
> seq(along=y))
> ? ? >> r[i,j] <-
> identical(z[i], z[j], ...)
> ? ? >>
> r
> ? ? >> }
> ? ?
>>> ## Very strictly - in the sense of identical() --
> these 12 complex numbers all differ:
> ? ?
>>> ## a version that works in older versions of R,
> [[elided Yahoo spam]]
> ? ?
>>> outerID.picky <- function(x,y) {
> ? ? >> nF <-
> length(formals(identical)) - 2
> ? ?
>>> do.call("outerID", c(list(x, y),
> as.list(rep(FALSE, nF))))
> ? ? >>
> }
> ? ? >> oldR <-
> !exists("getRversion") || getRversion() <
> "3.0.0" ## << FIXME: 3.0.0 is? a wild
> guess
> ? ? >> symnum(id.z <-
> outerID.picky(z,z)) ## == Diagonal matrix [newer versions of
> R]
> ? ? >> try(# for older R
> versions
> ? ? >>
> stopifnot(identical(id.z, outerID(z,z)), oldR ||
> identical(id.z, diag(12) == 1))
> ? ?
>>> )
> ? ? >> (mz <- match(z,
> z)) # currently different {NA,NaN} patterns differ - not in
> print()/format() _FIXME_
> ? ? >> zRI
> <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern :
> ? ? >> print(cbind(format = format(z),
> t(zRI), mz), quote=FALSE)
> ? ? >> ## compute? match(z[i], z) ,
> for? i = 1,2,..,12? :
> ? ? >> (m1z
> <- sapply(z, match, table = z))
> ? ?
>>> ## 1 2 1 2 2 2 1 2 2 2 1 2???# R 1.2.3?
> (2001-04-26)
> ? ? >> ## 1 2 3 4 1 3 7
> 8 2 4 8 7???# R 1.4.1? (2002-01-30)
> ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? #
> R 1.5.1? (2002-06-17)
> ? ? >> ## 1 2
> 3 4 1 3 7 8 2 4 8 12? # R 1.8.1? (2003-11-21)
> ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? #
> R 2.0.1? (2004-11-15)
> ? ? >> ## 1 2
> 3 4 1 3 7 4 2 4 4 12? # R 2.1.1? (2005-06-20)
> ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? #
> R 2.3.1? (2006-06-01)
> ? ? >> ## 1 2
> 3 4 1 3 7 8 2 4 8 12? # R 2.5.1? (2007-06-27)
> ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? #
> R 2.10.1 (2009-12-14)
> ? ? >> ## 1 2
> 3 4 1 3 7 4 2 4 4 12? # R 3.1.1? (2014-07-10)
> ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? #
> R 3.2.5 -- and 3.3.0 patched
> ? ? >>
> ## 1 2 1 2 1 1 1 1 2 2 1 2???# <<--
> Martin's R-devel and proposed future R
> ? ? >>
> if(!exists("anyNA", mode="function"))
> anyNA <- function(x) any(is.na(x))
> ? ?
>>> stopifnot(apply(zRI, 2, anyNA)) # *all* are? NA
> *or* NaN (or both)
> ? ? >> is.NA
> <- function(.) is.na(.) & !is.nan(.)
> ? ? >> (iNaN <- apply(zRI, 2,
> function(.) any(is.nan(.))))
> ? ? >>
> (iNA <-? apply(zRI, 2, function(.) any(is.NA (.)))) #
> has non-NaN NA's
> ? ? >> ## In
> Martin's version of R-devel :
> ? ?
>>> stopifnot(identical(m1z == 1, iNA),
> ? ? >> identical(m1z == 2, !iNA))
> ? ? >> ## m1z uses match(x, *) with
> length(x) == 1 and failed in R 3.3.0
> ? ?
>>> stopifnot(identical(m1z, mz))
> ? ?
>>> ______________________________________________
> ? ? >> R-devel at r-project.org mailing
> list
> ? ? >> https://stat.ethz.ch/mailman/listinfo/r-devel
> ? ? >
> ______________________________________________
> ? ? > R-devel at r-project.org
> mailing list
> ? ? > https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel