Skip to content

Wrong length of POSIXt vectors (PR#10507)

12 messages · simecek at gmail.com, Peter Dalgaard, Duncan Murdoch +4 more

#
Full_Name: Petr Simecek
Version: 2.5.1, 2.6.1
OS: Windows XP
Submission from: (NULL) (195.113.231.2)


Several times I have experienced that a length of a POSIXt vector has not been
computed right.

Example:

tv<-structure(list(sec = c(50, 0, 55, 12, 2, 0, 37, NA, 17, 3, 31
), min = c(1L, 10L, 11L, 15L, 16L, 18L, 18L, NA, 20L, 22L, 22L
), hour = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, NA, 12L, 12L, 
12L), mday = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, NA, 13L, 13L, 
13L), mon = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, NA, 5L, 5L, 5L), year = c(105L, 
105L, 105L, 105L, 105L, 105L, 105L, NA, 105L, 105L, 105L), wday = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L), yday = c(163L, 163L, 
163L, 163L, 163L, 163L, 163L, NA, 163L, 163L, 163L), isdst = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, -1L, 1L, 1L, 1L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXt", "POSIXlt"))

print(tv)
# print 11 time points (right)

length(tv)
# returns 9 (wrong)

I have tried that on several computers with/without switching to English
locales, i.e. Sys.setlocale("LC_TIME", "en"). I have searched a help pages but I
cannot imagine how that could be OK.
1 day later
#
simecek at gmail.com wrote:
Given the way you define it, you should be able to imagine it!

It's a list of length 9:  sec, min, hour,..., isdst.
#
On 12/11/2007 6:20 AM, simecek at gmail.com wrote:
tv is a list of length 9.  The answer is right, your expectation is wrong.
See this in ?POSIXt:

Class '"POSIXlt"' is a named list of vectors...

You could define your own length measurement as

length.POSIXlt <- function(x) length(x$sec)

and you'll get the answer you expect, but be aware that length.XXX 
methods are quite rare, and you may surprise some of your users.

Duncan Murdoch
1 day later
#
Duncan Murdoch wrote:
On the other hand, isn't the fact that length() currently always returns 9 
for POSIXlt objects likely to be a surprise to many users of POSIXlt?

The back of "The New S Language" says "Easy-to-use facilities allow you to 
organize, store and retrieve all sorts of data. ... S functions and data 
organization make applications easy to write."

Now, POSIXlt has methods for c() and vector subsetting "[" (and many other 
vector-manipulation methods - see methods(class="POSIXlt")).  Hence, from 
the point of view of intending to supply "easy-to-use facilities ... [for] 
all sorts of data", isn't it a little incongruous that length() is not also 
provided -- as 3 functions (any others?) comprise a core set of 
vector-manipulation functions?

Would it make sense to have an informal prescription (e.g., in R-exts) that 
a class that implements a vector-like object and provides at least of one 
of functions 'c', '[' and 'length' should provide all three?  It would also 
be easy to describe a test-suite that should be included in the 'test' 
directory of a package implementing such a class, that had some tests of 
the basic vector-manipulation functionality, such as:

 > # at this point, x0, x1, x3, & x10 should exist, as vectors of the
 > # class being tested, of length 0, 1, 3, and 10, and they should
 > # contain no duplicate elements
 > length(x0)
[1] 1
 > length(c(x0, x1))
[1] 2
 > length(c(x1,x10))
[1] 11
 > all(x3 == x3[seq(len=length(x3))])
[1] TRUE
 > all(x3 == c(x3[1], x3[2], x3[3]))
[1] TRUE
 > length(c(x3[2], x10[5:7]))
[1] 4
 >

It would also be possible to describe a larger set of vector manipulation 
functions that should be implemented together, including e.g., 'rep', 
'unique', 'duplicated', '==', 'sort', '[<-', 'is.na', head, tail ... (many 
of which are provided for POSIXlt).

Or is there some good reason that length() cannot be provided (while 'c' 
and '[' can) for some vector-like classes such as "POSIXlt"?

-- Tony Plate
#
On 12/13/2007 1:59 PM, Tony Plate wrote:
What you say sounds good in general, but the devil is in the details. 
Changing the meaning of length(x) for some objects has fairly widespread 
effects.  Are they all positive?  I don't know.

Adding a prescription like the one you suggest would be good if it's 
easy to implement, but bad if it's already widely violated.  How many 
base or CRAN or Bioconductor packages violate it currently?   Do the 
ones that provide all 3 methods do so in a consistent way, i.e. does 
"length(x)" mean the same thing in all of them?

I agree that the current state is less than perfect, but making it 
better would really be a lot of work.  I suspect there are better ways 
to spend my time, so I'm not going to volunteer to do it.  I'm not even 
going to invite someone else to do it, or offer to review your work if 
you volunteer.  I think this falls into the class of "next time we write 
a language, let's handle this better" problems.

Duncan Murdoch
1 day later
#
Duncan Murdoch wrote:
I'm not sure doing something like this would be so bad even if it is 
already widely violated.  R has evolved significantly over time, and 
many rough edges have been cleaned up, sometimes in ways that were not 
backward compatible.  This is a great thing & my thanks go to the people 
working on R.

If some base or CRAN or Bioconductor packages currently don't implement 
vector operations consistently, wouldn't it be good to know that?  
Wouldn't it be useful to have an automatic way of determining whether a 
particular vector-like class is consistent with generally agreed set of 
principles for how basic vector operations should work -- things like 
length(x)+length(y)==length(c(x,y))?  This could help developers check, 
document & improve their code, and it could help users understand how to 
use a class, and to evaluate the software quality of a class 
implementation and whether or not it provides the functionality they need.
Thanks very much for the thoughtful (and honest) feedback!  I suspect 
that the current state could be improved with just a little work, and 
without forcing anyone to do any work they don't want to do.  I'll think 
about this more and try to come back with a better & more concrete 
suggestion.

-- Tony Plate
1 day later
#

        
TP> Duncan Murdoch wrote:
>> On 12/13/2007 1:59 PM, Tony Plate wrote:
>>> Duncan Murdoch wrote:
>>>> On 12/11/2007 6:20 AM, simecek at gmail.com wrote:
>>>>> Full_Name: Petr Simecek
    >>>>> Version: 2.5.1, 2.6.1
    >>>>> OS: Windows XP
    >>>>> Submission from: (NULL) (195.113.231.2)
    >>>>> 
    >>>>> 
    >>>>> Several times I have experienced that a length of a POSIXt vector 
    >>>>> has not been
    >>>>> computed right.
    >>>>> 
    >>>>> Example:
    >>>>> 
    >>>>> tv<-structure(list(sec = c(50, 0, 55, 12, 2, 0, 37, NA, 17, 3, 31
    >>>>> ), min = c(1L, 10L, 11L, 15L, 16L, 18L, 18L, NA, 20L, 22L, 22L
    >>>>> ), hour = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, NA, 12L, 12L, 12L), 
    >>>>> mday = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, NA, 13L, 13L, 13L), mon 
    >>>>> = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, NA, 5L, 5L, 5L), year = c(105L, 
    >>>>> 105L, 105L, 105L, 105L, 105L, 105L, NA, 105L, 105L, 105L), wday = 
    >>>>> c(1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L), yday = c(163L, 163L, 
    >>>>> 163L, 163L, 163L, 163L, 163L, NA, 163L, 163L, 163L), isdst = c(1L, 
    >>>>> 1L, 1L, 1L, 1L, 1L, 1L, -1L, 1L, 1L, 1L)), .Names = c("sec", "min", 
    >>>>> "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    >>>>> ), class = c("POSIXt", "POSIXlt"))
    >>>>> 
    >>>>> print(tv)
    >>>>> # print 11 time points (right)
    >>>>> 
    >>>>> length(tv)
    >>>>> # returns 9 (wrong)
    >>>> 
    >>>> tv is a list of length 9.  The answer is right, your expectation is 
    >>>> wrong.
    >>>>> I have tried that on several computers with/without switching to 
    >>>>> English
    >>>>> locales, i.e. Sys.setlocale("LC_TIME", "en"). I have searched a 
    >>>>> help pages but I
    >>>>> cannot imagine how that could be OK.
    >>>> 
    >>>> See this in ?POSIXt:
    >>>> 
    >>>> Class '"POSIXlt"' is a named list of vectors...
    >>>> 
    >>>> You could define your own length measurement as
    >>>> 
    >>>> length.POSIXlt <- function(x) length(x$sec)
    >>>> 
    >>>> and you'll get the answer you expect, but be aware that length.XXX 
    >>>> methods are quite rare, and you may surprise some of your users.
    >>>> 
    >>> 
    >>> On the other hand, isn't the fact that length() currently always 
    >>> returns 9 for POSIXlt objects likely to be a surprise to many users 
    >>> of POSIXlt?
    >>> 
    >>> The back of "The New S Language" says "Easy-to-use facilities allow 
    >>> you to organize, store and retrieve all sorts of data. ... S 
    >>> functions and data organization make applications easy to write."
    >>> 
    >>> Now, POSIXlt has methods for c() and vector subsetting "[" (and many 
    >>> other vector-manipulation methods - see methods(class="POSIXlt")).  
    >>> Hence, from the point of view of intending to supply "easy-to-use 
    >>> facilities ... [for] all sorts of data", isn't it a little 
    >>> incongruous that length() is not also provided -- as 3 functions (any 
    >>> others?) comprise a core set of vector-manipulation functions?
    >>> 
    >>> Would it make sense to have an informal prescription (e.g., in 
    >>> R-exts) that a class that implements a vector-like object and 
    >>> provides at least of one of functions 'c', '[' and 'length' should 
    >>> provide all three?  It would also be easy to describe a test-suite 
    >>> that should be included in the 'test' directory of a package 
    >>> implementing such a class, that had some tests of the basic 
    >>> vector-manipulation functionality, such as:
    >>> 
    >>> > # at this point, x0, x1, x3, & x10 should exist, as vectors of the
    >>> > # class being tested, of length 0, 1, 3, and 10, and they should
    >>> > # contain no duplicate elements
    >>> > length(x0)
    >>> [1] 1
    >>> > length(c(x0, x1))
    >>> [1] 2
    >>> > length(c(x1,x10))
    >>> [1] 11
    >>> > all(x3 == x3[seq(len=length(x3))])
    >>> [1] TRUE
    >>> > all(x3 == c(x3[1], x3[2], x3[3]))
    >>> [1] TRUE
    >>> > length(c(x3[2], x10[5:7]))
    >>> [1] 4
    >>> >
    >>> 
    >>> It would also be possible to describe a larger set of vector 
    >>> manipulation functions that should be implemented together, including 
    >>> e.g., 'rep', 'unique', 'duplicated', '==', 'sort', '[<-', 'is.na', 
    >>> head, tail ... (many of which are provided for POSIXlt).
    >>> 
    >>> Or is there some good reason that length() cannot be provided (while 
    >>> 'c' and '[' can) for some vector-like classes such as "POSIXlt"?
    >> 
    >> What you say sounds good in general, but the devil is in the details. 
    >> Changing the meaning of length(x) for some objects has fairly 
    >> widespread effects.  Are they all positive?  I don't know.
    >> 
    >> Adding a prescription like the one you suggest would be good if it's 
    >> easy to implement, but bad if it's already widely violated.  How many 
    >> base or CRAN or Bioconductor packages violate it currently?   Do the 
    >> ones that provide all 3 methods do so in a consistent way, i.e. does 
    >> "length(x)" mean the same thing in all of them?
    TP> I'm not sure doing something like this would be so bad even if it is 
    TP> already widely violated.  R has evolved significantly over time, and 
    TP> many rough edges have been cleaned up, sometimes in ways that were not 
    TP> backward compatible.  This is a great thing & my thanks go to the people 
    TP> working on R.

    TP> If some base or CRAN or Bioconductor packages currently don't implement 
    TP> vector operations consistently, wouldn't it be good to know that?  
    TP> Wouldn't it be useful to have an automatic way of determining whether a 
    TP> particular vector-like class is consistent with generally agreed set of 
    TP> principles for how basic vector operations should work -- things like 
    TP> length(x)+length(y)==length(c(x,y))?  This could help developers check, 
    TP> document & improve their code, and it could help users understand how to 
    TP> use a class, and to evaluate the software quality of a class 
    TP> implementation and whether or not it provides the functionality they need.
    >> I agree that the current state is less than perfect, but making it 
    >> better would really be a lot of work.  I suspect there are better ways 
    >> to spend my time, so I'm not going to volunteer to do it.  I'm not 
    >> even going to invite someone else to do it, or offer to review your 
    >> work if you volunteer.  I think this falls into the class of "next 
    >> time we write a language, let's handle this better" problems.

    TP> Thanks very much for the thoughtful (and honest) feedback!  I suspect 
    TP> that the current state could be improved with just a little work, and 
    TP> without forcing anyone to do any work they don't want to do.  I'll think 
    TP> about this more and try to come back with a better & more concrete 
    TP> suggestion.

Good. From "the outside" (i.e. superficial gut feeling :-)
I've sympathized with your suggestion, Tony, quite a bit.
Further, my own taste would probably also have lead me to define
length.POSIXlt differently ..
OTOH, I agree with Duncan that it may be too late to change it
and even more to enforce the consistency rules you propose.
If with a small bit of code (and some patience) we could check
all of CRAN and hopefully bioconductor packages and find only a
very few where it was violated, the whole endeavor may be worth it
... for the sake of making  R more consistent, easier to teach, etc..

Unfortunately I don't remember now what happened many months ago
when I indeed did experiment with having something like

  length.POSIXlt <- function(x) length(x$sec)

Martin Maechler
#
If it were simply deprecated and then changed then
everyone using it would get a warning during the period
of deprecation so it would
not be so bad.  Given that its current behavior is
not very useful I suspect its not widely used anyways.
| haven't followed the whole discussion so sorry if these
points have already been made.
On Dec 15, 2007 5:17 PM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
#
On 15/12/2007 5:17 PM, Martin Maechler wrote:
One reason I don't want to work on this is because the appropriate 
action depends on what "length(x)" is intended to mean.  Currently for 
POSIXlt objects, it gives the physical length of the underlying basic 
type (the list).  This is the same behaviour as we have for matrices, 
data frames and every other object without a specific length method, so 
it's not outrageous.

The proposed change is to have it return the logical length of the 
object, which also seems quite reasonable.  I don't think matrices and 
data frames have a "logical length", so there would be no contradiction 
in those examples.  The thing that worries me is that there are probably 
objects in packages where both logical length and physical length make 
sense but are different.  I don't have any expectation that length(x) on 
those currently is consistent in which type of value it returns.

If we were to decide that "length(x)" *always* meant logical length, 
then we would have a problem:  matrices and data frames don't have a 
logical length, so we shouldn't be getting an answer there.  Changing 
length(x) for those is not acceptable.

On the other hand, if we decide that "length(x)" *always* means physical 
length, we don't need to do anything to the POSIXlt or matrices or data 
frames, but there may well be other kinds of objects out there that 
violate this rule.

We could leave the meaning of length(x) ambiguous.  If you want to know 
what it does for a POSIXlt object, you need to read the documentation or 
look at the source code.  As a policy, this isn't particularly 
appealing, but I could probably live with it if someone else did the 
research and showed that current usage is ambiguous.

Duncan Murdoch
#
Duncan Murdoch wrote:
Leaving the meaning of length(x) ambiguous seems reasonable to me (as 
are the meanings of 'c' and '[').

I was thinking more in terms of consistency of either supplying all or 
none of the tightly related group of functions 'c', '[', and 'length'. 
It seems diabolically confusing that 'c' and '[' exist for POSIXlt and 
do the expected things in terms of the vector-of-dates interpretation, 
but length does something completely different.  (And this is not 
mentioned in ?POSIXlt).

Coding & documentation guidelines & tools could help R to move towards 
more consistency with regard to this kind of behavior.

-- Tony Plate
#
Duncan Murdoch <murdoch at stats.uwo.ca> writes:
Physical length and logical length are, as you say, two different things.  So
why not two functions?  Keep length() for physical length, as it is now, and
maybe Length() for logical length.  The latter could be defined as

Length <- function(x, ...) UseMethod("Length")

Length.default <- function(x, ...) length(x)

and then add methods for classes that want something else.
#
Jeffrey J. Hallman wrote:
A very reasonable suggestion, but I'd also put this in the "next time we 
design a language" category.

The current system in R seems workable to me, if one knows that 
vector-like classes that have a S3 list-based implementation need to 
have methods defined for 'c', 'length', '[', etc, and that if these 
methods aren't defined, then you'll be operating on the underlying list 
structure.  Where these methods are defined, one can get at the 
underlying structure by unclassing first, and that's OK.  However, 
classes that have some of these methods defined but not others seem to 
me to be needlessly confusing -- it's not like there any great benefit 
that length() always returns the length of the underlying list for 
POSIXlt -- if there was a length() method one could get at the 
underlying length using length(unclass(x)).  It just seems like a design 
oversight that makes using such classes unnecessarily difficult and 
error-prone.

Hence my proposal (in a new thread) for coding & documentation 
guidelines that would that would:
(1) suggest consistency is a good thing
(2) suggent compliance or deviation should be documented
(3) define what consistency was (and here it's not so important to get 
absolutely the right set of consistency definitions as it is to get a 
reasonable set that people agree on.)

-- Tony Plate