Skip to content

duplicated.data.frame() is broken on data frames containing \r

2 messages · Hervé Pagès

#
Hi,

The trick used by duplicated.data.frame() is to transform the supplied
data.frame into a character vector by pasting together the columns using
"\r" as separator. But no precautions are taken to deal with "\r" in
the supplied data.frame. As a consequence it's easy to imagine
situations where duplicated.data.frame() returns an incorrect answer:

   > df <- data.frame(a=c("AA", "AA\r"), b=c("\rBBB", "BBB"))
   > df
        a     b
   1   AA \rBBB
   2 AA\r   BBB
   > duplicated(df)
   [1] FALSE  TRUE

Cheers,
H.

 > sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
#
OK it's actually documented:

      The data frame method works by pasting together a character
      representation of the rows separated by ?\r?, so may be imperfect
      if the data frame has characters with embedded carriage returns or
      columns which do not reliably map to characters.

But what about fixing it? One possible fix is to use "\r\r" as
separator and to substitute user-supplied "\r" with, say, "#\r#".
Just an example.

Thanks,
H.
On 07/29/2013 11:52 AM, Herv? Pag?s wrote: