Skip to content

Corrupt data frame construction - bug?

6 messages · Steven McKinney, Duncan Murdoch, Wacek Kusnierczyk

#
Hi useRs,

A recent coding infelicity along these lines
yielded a corrupt data frame.

foo <- matrix(1:12, nrow = 3)
bar <- data.frame(foo)
bar$NewCol <- foo[foo[, 1] == 4, 4]
bar
lapply(bar, length)
X1 X2 X3 X4 NewCol
1  1  4  7 10   <NA>
2  2  5  8 11   <NA>
3  3  6  9 12   <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs
$X1
[1] 3

$X2
[1] 3

$X3
[1] 3

$X4
[1] 3

$NewCol
[1] 0


Is this a bug in the data.frame machinery?
If an attempt is made to add a new column
to a data frame, and the new object does
not have length = number of rows of data frame,
or cannot be made to have such length via recycling,
shouldn't an error be thrown?

Instead in this example I end up with a
"corrupt data frame" having one zero-length column.


Should this be reported as a bug, or did I misinterpret
the documentation?
R version 2.9.0 (2009-04-17) 
powerpc-apple-darwin8.11.1 

locale:
en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] nlme_3.1-90

loaded via a namespace (and not attached):
[1] grid_2.9.0      lattice_0.17-22 tools_2.9.0
Also occurs on Windows box with R 2.8.1



Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: smckinney +at+ bccrc +dot+ ca

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada
#
On 29/04/2009 6:41 PM, Steven McKinney wrote:
I don't think "$" uses any data.frame machinery.  You are working at a 
lower level.

If you had added the new column using

bar <- data.frame(bar, NewCol=foo[foo[, 1] == 4, 4])

you would have seen the error:

Error in data.frame(bar, NewCol = foo[foo[, 1] == 4, 4]) :
   arguments imply differing number of rows: 3, 0

But since you treated it as a list, it let you go ahead and create 
something that was labelled as a data.frame but wasn't.  This is one of 
the reasons some people prefer S4 methods:  it's easier to protect 
against people who mislabel things.

Duncan Murdoch
#
Thanks Duncan,

Comments and a proposed bug fix in-line below:
cannot
of
I did some more digging on '$' - there is a data.frame method for it:
A single object matching '$<-.data.frame' was found
It was found in the following places
  package:base
  registered S3 method for $<- from namespace base
  namespace:base
with value

function (x, i, value) 
{
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (N < nrows && N > 0L) 
            if (nrows%%N == 0L && length(dim(value)) <= 1L) 
                value <- rep(value, length.out = nrows)
            else stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (is.atomic(value)) 
            names(value) <- NULL
    }
    x[[i]] <- value
    class(x) <- cl
    return(x)
}<environment: namespace:base>
I placed a browser() command before return(x) and did some poking
around.

It seems to me there's a bug in this function.  It should be able to
detect the problem I threw at it, and throw an error as you point out is
thrown by the other data.frame assign method.


I modified the rows
          if (N < nrows && N > 0L) 
            if (nrows%%N == 0L && length(dim(value)) <= 1L)
to read
           if (N < nrows) 
            if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)

as in

"$<-.data.frame" <-
function (x, i, value) 
{
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (N < nrows) 
            if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L) 
                value <- rep(value, length.out = nrows)
            else stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (is.atomic(value)) 
            names(value) <- NULL
    }
    x[[i]] <- value
    class(x) <- cl
    return(x)
} 

Now it detects the problem I created, in the fashion you demonstrated
above for the replacement using data.frame().
Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : 
  replacement has 0 rows, data has 3

It doesn't appear to stumble on weird data frames (these from the
?data.frame help page)
replace=TRUE)))
Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : 
  replacement has 0 rows, data has 10

### Catches this problem above alright.
[1] x      y      fac    NewCol
<0 rows> (or 0-length row.names)

### Lets the above one through alright.
[1] NewCol
<0 rows> (or 0-length row.names)
### Lets the above one through alright.


Would the above modification work to fix this problem?
#
On 29/04/2009 9:21 PM, Steven McKinney wrote:
Thanks; sorry for the misinformation about the $ method.

I'm not going to have time today to look at the patch, but will check it 
out tomorrow, unless someone else gets there first.

Duncan Murdoch
#
Duncan Murdoch wrote:
well, there is the function `$<-.data.frame`.  why does

    bar$NewCol <- ...

*not* dispatch to $<-.data.frame?  $<- is used on bar, and bar is a data
frame:

    is(bar)
    # "data.frame" ...

    trace('$<-.data.frame')
    bar$foo <- 1
    # no output

    trace('$<-')
    bar$foo <- 1
    # trace: `$<-`(`*tmp*`, foo, value = 1)

(still with the ugly *tmp*-hack)

and, actually, ?'$<-.data.frame' says:

"     ## S3 replacement method for class 'data.frame':
     x$i <- value"
he has *not*:  he has used the "S3 replacement method for class
'data.frame'".  the fact that it didn't work as expected seems to be a
consequence of a bug in the dispatch mechanism.
wasn't?  what wasn't what?  after bar$NewCol <- integer(0), bar is
labelled as a data frame, and it seems to actually *be* a data frame; 
data frame operations seem to work on bar, and the warning from print
bar talks about a corrupt data frame, not a non-data frame. 

or do you mean that bar is not a data frame internally?  that would be a
semantic weirdo where a user successfully performs an operation on a
data frame and gets a zombie.  in any case, looks like a bug.
it's *R* that mislabels things here.  i can't see the user doing any
explicit labelling;  the only stuff used was data.frame() and '$<-.',
which should dispatch to '$<-.data.frame'.  the resulting zombie object
is clearly R's, not the user's, fault.

vQ
#
Duncan Murdoch wrote:
maybe it's a good idea to change your strategy and avoid blaming users
for faults that lie on the software's side.  r is buggy, and you might
well be more open to admitting there is a bug in a particular function
instead of suggesting that "people mislabel things" and the like.

best,
vQ