Suggestion: Dimension-sensitive attributes

6 messages · Laurent Gautier, Bengoechea Bartolomé Enrique (SIES 73)

Thu, Jul 9, 2009 5:15 AM #

Starting by working on an interface for such object(s) is probably the 
first step toward a unified solution, and this before about if and how R 
attributes are used.

It would also help to ensure a smooth transition from the existing 
classes implementing a similar solution (first the interface is added to 
those classes, then after a grace period the classes are eventually 
refactored).

Dimension-level is what seems to the be most needed... but I am not 
convinced of the practicality of the object-level, and cell-level scheme 
s proposed:

- Object-level, if not linked to any dimension-attribute is such saying 
that one want to attach anything to any object. That's what attr() is 
already doing.

- Cell-level, is may be out-of-scope for one first trial (but may be I 
missed the use-cases for it)



If starting with behaviour, it seems to boil to having "["/"[<-" and 
"dimmeta()"/"dimmeta<-()", :

- extract "[" / replace "[<-" :

   * keeps working the way it already does

   * extracts a subset of the object as well as a subset of the 
dimension-associated metadata.

   * departing too much from the way "[" is working and add 
behind-the-curtain name matching will only compromise the chances of 
adoption.

   * forget about the bit about which metadata is kept and which one 
isn't when using "[". Make a function "unmeta()" (similar behavior to 
"unname()") to drop them all, or work it out with something like
 > dimmeta(x, 1) <- NULL # drop the metadata associated with dimension 1

- access the dimension-associated metadata:

   * may be a function called "dimmeta()" (for consistency with 
"dimnames()") ? The signature could be dimmeta(x, i), with x the object, 
and i the dimension requested. A replace function "dimmeta<-"(x, i, 
value) would be provided.


In the abstract the "names" associated with a given dimension is just 
one of possible metadata, but I'd keep away from meddling with it for a 
start.


It would seem natural that metadata associated with one dimension:
would a table-like object (data.frame seems natural in R, and 
unfortunately there is no data.frame-like structure in R).



L.

Bengoechea Bartolomé Enrique (SIES 73)

Thu, Jul 9, 2009 7:14 AM #

Very good points. They closely match the current prototype I have written...

Agree. Getting a good API is always the most important step.

True, and that was Henrik's original suggestion. But I find all three are closely related to the same topic (metadata) and as such deserve to be worked out together, but if most people agree otherwise, the direction is clear.

Except that plain attributes are dropped when subsetting. I've found myself dozens of times creating classes must to create a `[` method for them that preserves some attributes. This looks like such a common situation that having a mechanism to avoid the user programming the same stuff again and again would be handy.

Although I agree that cell-level is far less common, here are a couple of use cases I've hit recently:

1) the array represents time series in columns. The original data comes in a different frequency for each column, with some data missing. When you align to a common frequency and interpolate missing values, I needed a factor array of the same dimension as the data array identifying whether each observation corresponded to the actual original series, or had been interpolated, and whether interpolation was due to missing data or to frequency alignment. Of course, I needed the factor array to be subsetted together with the array.

2) the array is a table representing data to be formatted by a reporting system (Sweave, R2HTML, etc), similar to the 'xtable' class. So I needed to associate formatting information to each individual "cell" (font, color, borders...), as well to each dimension and to the whole table.

Anyway, it's far easier to add "cell-level" metadata on top of the other features with a new class: for `[` subscripting just call NextMethod() and then apply the same indexes to the object storing the cell-level metadata. But I still think it's useful to work out data object's metadata at all possible levels with a unified interface.

About the subscripting `[` methods, I don't see the need to modify `[<-` for arrays, as out-of-bound indexes generate errors with arrays (unlike vectors or data frames), so `[<-` would only replace data and leave metadata untouched. Am I missing something?

I'm using 'dimdata' in my current prototype, and Henrik suggested 'dimattr', but I really like your proposal more. 

Wrappers to the two first elements of 'dimmeta' for 2-dim arrays could be added in the same vein as 'rownames' and 'colnames': 'rowmeta' and 'colmeta'.

For consistency with 'dimnames', the 'i' argument could be dropped and use dimmeta(x)[[i]] instead...


Other standard generics to be affected would be:

 * rbind & cbind for 2-dim arrays/matrices: they should combine the metadata, and for dimension-sensitive metadata can be modelled upon what is done with dimnames: use rowmeta (colmeta) of the first object with them in cbind (rbind), and combine colmeta (rowmeta) of all objects with them, filling with NAs/NULLs/.. for non metadata-sensitive objects being combined. An issue of coercing dimmeta of different classes may arise.

 * `dim<-`, but this may raise the same problem of coercing dimmeta of different classes.


...and I agree with the rest of your comments.

Best,

Enrique

-----Original Message-----
From: Laurent Gautier [mailto:lgautier at gmail.com] 
Sent: jueves, 09 de julio de 2009 14:15
Cc: Heinz Tuechler; Bengoechea Bartolom? Enrique (SIES 73); Tony Plate; Henrik Bengtsson; r-devel at r-project.org
Subject: Re: [Rd] Suggestion: Dimension-sensitive attributes

Starting by working on an interface for such object(s) is probably the first step toward a unified solution, and this before about if and how R attributes are used.

It would also help to ensure a smooth transition from the existing classes implementing a similar solution (first the interface is added to those classes, then after a grace period the classes are eventually refactored).

Dimension-level is what seems to the be most needed... but I am not convinced of the practicality of the object-level, and cell-level scheme s proposed:

- Object-level, if not linked to any dimension-attribute is such saying that one want to attach anything to any object. That's what attr() is already doing.

- Cell-level, is may be out-of-scope for one first trial (but may be I missed the use-cases for it)



If starting with behaviour, it seems to boil to having "["/"[<-" and 
"dimmeta()"/"dimmeta<-()", :

- extract "[" / replace "[<-" :

   * keeps working the way it already does

   * extracts a subset of the object as well as a subset of the 
dimension-associated metadata.

   * departing too much from the way "[" is working and add 
behind-the-curtain name matching will only compromise the chances of 
adoption.

   * forget about the bit about which metadata is kept and which one 
isn't when using "[". Make a function "unmeta()" (similar behavior to 
"unname()") to drop them all, or work it out with something like
 > dimmeta(x, 1) <- NULL # drop the metadata associated with dimension 1

- access the dimension-associated metadata:

   * may be a function called "dimmeta()" (for consistency with 
"dimnames()") ? The signature could be dimmeta(x, i), with x the object, 
and i the dimension requested. A replace function "dimmeta<-"(x, i, 
value) would be provided.


In the abstract the "names" associated with a given dimension is just 
one of possible metadata, but I'd keep away from meddling with it for a 
start.


It would seem natural that metadata associated with one dimension:
would a table-like object (data.frame seems natural in R, and 
unfortunately there is no data.frame-like structure in R).



L.

Bengoechea Bartolomé Enrique (SIES 73)

Thu, Jul 9, 2009 7:56 AM #

Forgot to answer this one:

Right. A data frame has the problem that for most use cases one would want that each dimension length matches the *rows* of the data frame instead of the columns, but it is the columns what we would have "for free" when allowing "dimmeta" elements to be lists...

Enrique

-----Original Message-----
From: Laurent Gautier [mailto:lgautier at gmail.com] 
Sent: jueves, 09 de julio de 2009 14:15
Cc: Heinz Tuechler; Bengoechea Bartolom? Enrique (SIES 73); Tony Plate; Henrik Bengtsson; r-devel at r-project.org
Subject: Re: [Rd] Suggestion: Dimension-sensitive attributes

Starting by working on an interface for such object(s) is probably the first step toward a unified solution, and this before about if and how R attributes are used.

It would also help to ensure a smooth transition from the existing classes implementing a similar solution (first the interface is added to those classes, then after a grace period the classes are eventually refactored).

Dimension-level is what seems to the be most needed... but I am not convinced of the practicality of the object-level, and cell-level scheme s proposed:

- Object-level, if not linked to any dimension-attribute is such saying that one want to attach anything to any object. That's what attr() is already doing.

- Cell-level, is may be out-of-scope for one first trial (but may be I missed the use-cases for it)

If starting with behaviour, it seems to boil to having "["/"[<-" and 
"dimmeta()"/"dimmeta<-()", :

- extract "[" / replace "[<-" :

   * keeps working the way it already does

   * extracts a subset of the object as well as a subset of the 
dimension-associated metadata.

   * departing too much from the way "[" is working and add 
behind-the-curtain name matching will only compromise the chances of 
adoption.

   * forget about the bit about which metadata is kept and which one 
isn't when using "[". Make a function "unmeta()" (similar behavior to 
"unname()") to drop them all, or work it out with something like
 > dimmeta(x, 1) <- NULL # drop the metadata associated with dimension 1

- access the dimension-associated metadata:

   * may be a function called "dimmeta()" (for consistency with 
"dimnames()") ? The signature could be dimmeta(x, i), with x the object, 
and i the dimension requested. A replace function "dimmeta<-"(x, i, 
value) would be provided.

In the abstract the "names" associated with a given dimension is just 
one of possible metadata, but I'd keep away from meddling with it for a 
start.

It would seem natural that metadata associated with one dimension:
would a table-like object (data.frame seems natural in R, and 
unfortunately there is no data.frame-like structure in R).

L.

Laurent Gautier

Thu, Jul 9, 2009 8:33 AM #

Bengoechea Bartolom? Enrique (SIES 73) wrote:

[thanks for reading through what seems much like a telescoped sentences]

Think of one data.frame per dimension and each data.frame having its 
rows aligned along that dimension.

In the case of a matrix, the dim-1 data.frame would have as many rows as 
rows in the matrix and the dim-2 data.frame would have as many rows as 
columns in the matrix.

When thinking in terms of generalization, one can also note that the 
one-dimension case can already be modelled by a data.frame.


L.

Enrique

-----Original Message----- From: Laurent Gautier
[mailto:lgautier at gmail.com] Sent: jueves, 09 de julio de 2009 14:15 
Cc: Heinz Tuechler; Bengoechea Bartolom? Enrique (SIES 73); Tony
Plate; Henrik Bengtsson; r-devel at r-project.org Subject: Re: [Rd]
Suggestion: Dimension-sensitive attributes

Starting by working on an interface for such object(s) is probably
the first step toward a unified solution, and this before about if
and how R attributes are used.

It would also help to ensure a smooth transition from the existing
classes implementing a similar solution (first the interface is added
to those classes, then after a grace period the classes are
eventually refactored).

Dimension-level is what seems to the be most needed... but I am not
convinced of the practicality of the object-level, and cell-level
scheme s proposed:

- Object-level, if not linked to any dimension-attribute is such
saying that one want to attach anything to any object. That's what
attr() is already doing.

- Cell-level, is may be out-of-scope for one first trial (but may be
I missed the use-cases for it)



If starting with behaviour, it seems to boil to having "["/"[<-" and
 "dimmeta()"/"dimmeta<-()", :

- extract "[" / replace "[<-" :

* keeps working the way it already does

* extracts a subset of the object as well as a subset of the 
dimension-associated metadata.

* departing too much from the way "[" is working and add 
behind-the-curtain name matching will only compromise the chances of
 adoption.

* forget about the bit about which metadata is kept and which one 
isn't when using "[". Make a function "unmeta()" (similar behavior to
 "unname()") to drop them all, or work it out with something like

dimmeta(x, 1) <- NULL # drop the metadata associated with dimension
1

- access the dimension-associated metadata:

* may be a function called "dimmeta()" (for consistency with 
"dimnames()") ? The signature could be dimmeta(x, i), with x the
object, and i the dimension requested. A replace function
"dimmeta<-"(x, i, value) would be provided.


In the abstract the "names" associated with a given dimension is just
 one of possible metadata, but I'd keep away from meddling with it
for a start.


It would seem natural that metadata associated with one dimension: 
would a table-like object (data.frame seems natural in R, and 
unfortunately there is no data.frame-like structure in R).



L.

Laurent Gautier

Thu, Jul 9, 2009 11:41 PM #

Bengoechea Bartolom? Enrique (SIES 73) wrote:

I see. I never faced the issue, but I agree that this can be somehow 
counter-intuitive.
Thinking about it, it seems natural nowadays to consider 
attributes-associated objects as a kind of prototype-based programming 
(and "[" to keep the attributes - although it does somehow consider 
special attributes such as "dim", "names", "dimnames").

In that respect, and as you outline it, this is then like 
"stacking"/"putting side-by-side" arrays of identical dimensions. Your 
time serie data is in one array, the origin of the observation in an 
other...

I would see that as a separate data structure (that could implement the 
metadata interface we are discussing).

I understand the use cases, but I can't stop stop thinking that this 
should be separated from the dimension-associated metadata.

In the examples above, the data structures are two-dimensional and 
therefore dimension-associated metadata will be for "rows" and for 
"columns"; all the cells in a table/array as a sequence are not mapped 
to any *dimension*.

That's what I am thinking.
I bundle "[" with "[<-" to specify that the way indexing is done would 
remain the same (for a second I considered that someone though of 
somehow indexing on the names of the dimensions, or on the metadata).

the colour of the bikeshed

Yes. That the spirit.

I thought about that, but also thought that it could have implications 
on the actual storage of those metadata. In the case the metadata are 
stored in a list, that interface enforces the building of a list.
(I said to ignore implementation for now, but paradoxically this made me 
consider possible implementations).

Let's ignore that and go for consistency first (there will always be 
time to come back on that and make backward compatible changes, such as 
dimmeta(x, i=NULL) # return the list if i is NULL ).

May be good to be trigger-happy for a first pass ( stop("mismatching 
meta data - sorry") )... and mix-and-match use cases might be fewer.

Disabling "dim<-" is, I think, choosing sanity for now.

Same for me (about your comments).
This thread seems to be leading to something great.


L.

Best,

Enrique

-----Original Message----- From: Laurent Gautier
[mailto:lgautier at gmail.com] Sent: jueves, 09 de julio de 2009 14:15 
Cc: Heinz Tuechler; Bengoechea Bartolom? Enrique (SIES 73); Tony
Plate; Henrik Bengtsson; r-devel at r-project.org Subject: Re: [Rd]
Suggestion: Dimension-sensitive attributes

Starting by working on an interface for such object(s) is probably
the first step toward a unified solution, and this before about if
and how R attributes are used.

It would also help to ensure a smooth transition from the existing
classes implementing a similar solution (first the interface is added
to those classes, then after a grace period the classes are
eventually refactored).

Dimension-level is what seems to the be most needed... but I am not
convinced of the practicality of the object-level, and cell-level
scheme s proposed:

- Object-level, if not linked to any dimension-attribute is such
saying that one want to attach anything to any object. That's what
attr() is already doing.

- Cell-level, is may be out-of-scope for one first trial (but may be
I missed the use-cases for it)



If starting with behaviour, it seems to boil to having "["/"[<-" and
 "dimmeta()"/"dimmeta<-()", :

- extract "[" / replace "[<-" :

* keeps working the way it already does

* extracts a subset of the object as well as a subset of the 
dimension-associated metadata.

* departing too much from the way "[" is working and add 
behind-the-curtain name matching will only compromise the chances of
 adoption.

* forget about the bit about which metadata is kept and which one 
isn't when using "[". Make a function "unmeta()" (similar behavior to
 "unname()") to drop them all, or work it out with something like

dimmeta(x, 1) <- NULL # drop the metadata associated with dimension
1

- access the dimension-associated metadata:

* may be a function called "dimmeta()" (for consistency with 
"dimnames()") ? The signature could be dimmeta(x, i), with x the
object, and i the dimension requested. A replace function
"dimmeta<-"(x, i, value) would be provided.


In the abstract the "names" associated with a given dimension is just
 one of possible metadata, but I'd keep away from meddling with it
for a start.


It would seem natural that metadata associated with one dimension: 
would a table-like object (data.frame seems natural in R, and 
unfortunately there is no data.frame-like structure in R).



L.

Bengoechea Bartolomé Enrique (SIES 73)

Fri, Jul 10, 2009 1:58 AM #

Creating the list on the fly if it's not stored internally as a list should be cheap. For example, this is done with data frames, that store "dimnames" in two separate attributes, "names" and "row.names".