Skip to content

Best method to add unit information to dataframe ?

6 messages · bruno Piguet, Marc Schwartz, Joshua Wiley +3 more

#
Dear all,

  I'd like to have a dataframe store information about the units of
the data it contains.

  You'll find below a minimal exemple of the way I do, so far. I add a
"units" attribute to the dataframe. But  I dont' like the long syntax
needed to access to the unit of a given variable (namely, something
like :
   var_unit <- attr(my_frame, "units")[[match(var_name, attr(my_frame,
"names"))]]

  Can anybody point me to a better solution ?

Thanks in advance,

Bruno.


# Dataframe creation
x <- c(1:10)
y <- c(11:20)
z <- c(101:110)
my_frame <- data.frame(x, y, z)
attr(my_frame, "units") <- c("x_unit", "y_unit")

#
# later on, using dataframe
for (var_name in c("x", "y")) {
   idx <- match(var_name, attr(my_frame, "names"))
   var_unit <- attr(my_frame, "units")[[idx]]
   print (paste("max ", var_name, ": ", max(my_frame[[var_name]]), var_unit))
}
#
On Oct 3, 2011, at 9:35 AM, bruno Piguet wrote:

            
The problem is that there are operations on data frames (e.g. subset()) that will end up stripping your attributes.
'data.frame':	10 obs. of  3 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10
 $ y: int  11 12 13 14 15 16 17 18 19 20
 $ z: int  101 102 103 104 105 106 107 108 109 110
 - attr(*, "units")= chr  "x_unit" "y_unit"

newDF <- subset(my_frame, x <= 5)
'data.frame':	5 obs. of  3 variables:
 $ x: int  1 2 3 4 5
 $ y: int  11 12 13 14 15
 $ z: int  101 102 103 104 105


You might want to look at either ?comment or the ?label function in Frank's Hmisc package on CRAN, either to use or for example code on how he handles this.

HTH,

Marc Schwartz
#
Hi Bruno,

It sounds like what you want is really a separate class, one that has
stores information about units for each variable.  This is far from an
elegant example, but depending on your situation may be useful.  I
create a new class inheriting from the data frame class.  This is
likely fraught with problems because a formal S4 class is inheriting
from an informal S3.  Then a data frame can be stored in the .Data
slot (special---I did not make it), but character data can also be
stored in the units slot (which I did define).  You could get fancier
imposing constraints that the length of units be equal to the number
of columns in the data frame or the like.  S3 methods for data frames
should still mostly work, but you also have the ability to access the
new units slot.  You could define special S4 methods to do the
extraction then, if you wanted, so that your ultimate syntax to get
the units of a particular variable would be shorter.

setOldClass("data.frame")

setClass("mydf", representation(units = "character"),
  contains = "data.frame", S3methods = TRUE)

tmp <- new("mydf")

tmp at .Data <- mtcars
tmp at row.names <- rownames(mtcars)
tmp at units <- c("x", "y")

## data frameish
colMeans(tmp)
tmp + 10

# but
tmp at units

Cheers,

Josh

N.B. I've read once and skimmeda gain Chambers' book, but I still do
not have a solid grasp on S4 so I may have made some fundamental
blunder in the example.
On Mon, Oct 3, 2011 at 7:35 AM, bruno Piguet <bruno.piguet at gmail.com> wrote:

  
    
#
Hi,

If you want to take advantage of Josh's example below (using an S4
subclass of data.frame), perhaps you might be interested in taking
advantage of the multitude of useful objects/classes defined in the
bioconductor IRanges package:

http://www.bioconductor.org/packages/release/bioc/html/IRanges.html

It has no other bioconductor dependencies, so it's a "slim" install,
in that respect. It defines a DataFrame class which keeps "metadata"
around with as you subset/index/etc. it, eg:

R> library(IRanges)
R> DF <- DataFrame(a=1:10, b=letters[1:10])
R> metadata(DF) <- list(units=list(a=NA, b='inches'))

R> sub.1 <- subset(DF, a %% 2 == 0)
R> sub.1
DataFrame with 5 rows and 2 columns
          a           b
  <integer> <character>
1         2           b
2         4           d
3         6           f
4         8           h
5        10           j

R> metadata(sub.1)
$units
$units$a
[1] NA

$units$b
[1] "inches"

(although I noticed that transform,DataFrame isn't defined actually ...)

Anyway, HTH.

-steve
On Mon, Oct 3, 2011 at 11:15 AM, Joshua Wiley <jwiley.psych at gmail.com> wrote:

  
    
#
On Mon, Oct 3, 2011 at 10:35 AM, bruno Piguet <bruno.piguet at gmail.com> wrote:
The Hmisc package has some support for this:

library(Hmisc)

DF <- data.frame(x, y, z)
units(DF$x) <- "my x units"
units(DF$y) <- "my y units"

units(DF$x)