[Bioc-devel] read.AnnotatedDataFrame - Bioc-devel

Mon, Jan 8, 2007 9:23 AM #

Hi Martin, all

I tried to adopt the read.AnnotatedDataFrame method on files that I was
able to import with read.phenoData before and got the following error
message:

Error in data.frame(labelDescription = varLabels, row.names =
names(varLabels)) :
        row names supplied are of the wrong length

After taking a look at the code and changing the line
    varLabels <- as.list(rep("read from file", ncol(pData)))
to
    varLabels <- rep("read from file", ncol(pData))
the function created the AnnotatedDataFrame

Not sure if this is a bug or if my phenoData files should be formated in
another way, but I strongly doubt that the original version will work in
any case, since the implicit coercion in the following
varMetadata = data.frame(labelDescription = varLabels,
                row.names = names(varLabels))))
creates a data frame with 1 row and length(varLabels) columns, hence the
row.names=names(varLabels) argument will cause an error.

Am i wrong here?
Florian


 sessionInfo()
R version 2.5.0 Under development (unstable) (2006-05-24 r38188)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=de_DE.ISO-8859-1;LC_NUMERIC=C;LC_TIME=de_DE.ISO-8859-1;LC_COLLATE=de_DE.ISO-8859-1;LC_MONETARY=de_DE.ISO-8859-1;LC_MESSAGES=de_DE.ISO-8859-1;LC_PAPER=de_DE.ISO-8859-1;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.ISO-8859-1;LC_IDENTIFICATION=C

attached base packages:
 [1] "splines"   "grid"      "tools"     "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

other attached packages:
  genefilter     survival    rnaiUtils      flomisc        RODBC       
prada
    "1.13.5"       "2.29"        "1.0"      "1.0.2"      "1.1-7"    
"1.11.3"
RColorBrewer      Biobase
     "0.2-3"    "1.13.29"

Florian Hahne
Abt. Molekulare Genomanalyse (B050)
Deutsches Krebsforschungszentrum (DKFZ)
Im Neuenheimer Feld 580
D-69120 Heidelberg
phone: 0049 6221 424764
fax: 0049 6221 423454
web: www.dkfz.de/mga

Martin Morgan

Mon, Jan 8, 2007 10:45 AM #

Thanks Florian -- oddly, Crispin Miller sent email earlier today about
this same issue; it's fixed in R-devel. 

read.AnnotatedDataFrame was introduced to accommodate modifications to
affy; is this (affy) where the problem came from? I'm not really sure
how people get info into ExpressionSets, and would be happy to make
that process easier / more robust.

Martin

Florian Hahne <f.hahne at dkfz-heidelberg.de> writes:

Martin T. Morgan
Bioconductor / Computational Biology
http://bioconductor.org

Martin Morgan

Mon, Jan 8, 2007 12:56 PM #

Martin Morgan <mtmorgan at fhcrc.org> writes:

Sorry, meant that it was fixed in the development version of
Biobase. Martin

read.AnnotatedDataFrame was introduced to accommodate modifications to
affy; is this (affy) where the problem came from? I'm not really sure
how people get info into ExpressionSets, and would be happy to make
that process easier / more robust.

Martin

Florian Hahne <f.hahne at dkfz-heidelberg.de> writes:

Hi Martin, all

I tried to adopt the read.AnnotatedDataFrame method on files that I was
able to import with read.phenoData before and got the following error
message:

Error in data.frame(labelDescription = varLabels, row.names =
names(varLabels)) :
        row names supplied are of the wrong length

After taking a look at the code and changing the line
    varLabels <- as.list(rep("read from file", ncol(pData)))
to
    varLabels <- rep("read from file", ncol(pData))
the function created the AnnotatedDataFrame

Not sure if this is a bug or if my phenoData files should be formated in
another way, but I strongly doubt that the original version will work in
any case, since the implicit coercion in the following
varMetadata = data.frame(labelDescription = varLabels,
                row.names = names(varLabels))))
creates a data frame with 1 row and length(varLabels) columns, hence the
row.names=names(varLabels) argument will cause an error.

Am i wrong here?
Florian


 sessionInfo()
R version 2.5.0 Under development (unstable) (2006-05-24 r38188)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=de_DE.ISO-8859-1;LC_NUMERIC=C;LC_TIME=de_DE.ISO-8859-1;LC_COLLATE=de_DE.ISO-8859-1;LC_MONETARY=de_DE.ISO-8859-1;LC_MESSAGES=de_DE.ISO-8859-1;LC_PAPER=de_DE.ISO-8859-1;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.ISO-8859-1;LC_IDENTIFICATION=C

attached base packages:
 [1] "splines"   "grid"      "tools"     "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

other attached packages:
  genefilter     survival    rnaiUtils      flomisc        RODBC       
prada
    "1.13.5"       "2.29"        "1.0"      "1.0.2"      "1.1-7"    
"1.11.3"
RColorBrewer      Biobase
     "0.2-3"    "1.13.29"



-- 
Florian Hahne
Abt. Molekulare Genomanalyse (B050)
Deutsches Krebsforschungszentrum (DKFZ)
Im Neuenheimer Feld 580
D-69120 Heidelberg
phone: 0049 6221 424764
fax: 0049 6221 423454
web: www.dkfz.de/mga

-- 
Martin T. Morgan
Bioconductor / Computational Biology
http://bioconductor.org

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Martin T. Morgan
Bioconductor / Computational Biology
http://bioconductor.org

Florian Hahne

Tue, Jan 9, 2007 3:39 AM #

Hi Martin,
I stumbled across this because I formerly used the phenoData class and
now switched to AnnotatedDataFrames in objects of class cytoSet in my
prada package and also now in the flowSets of the new flowCore package.
So for me the problem is unrelated to affy. I cannot speak for all the
ExpressionSet users, but in my use cases I usually have a data frame (or
some table in a file) with all the necessary meta data for each sample.
I guess reading in such files is the most common way people get the info
into their data structures, in the end nobody wants to build a data
frame of possibly hundreds of rows interactively in R or via a widget.
I'm not sure about having the sample names as row.names, though. I think
there used to be a mandatory column "name" to store them, which I
personally liked better (in many spreadsheet programs the concept of row
names is somewhat vague...) . It might be helpful to improve the
documentation for read.AnnotatedDataFrame a bit, maybe adding an example
file so people can see how this is supposed to look like. Apart from
that it might be hard to make this procedure easier/more robust since
use cases and also the background/expertise of users differ a lot.
Hope these thoughts helped a bit,
Florian

Martin Morgan schrieb:

Thanks Florian -- oddly, Crispin Miller sent email earlier today about
this same issue; it's fixed in R-devel. 

read.AnnotatedDataFrame was introduced to accommodate modifications to
affy; is this (affy) where the problem came from? I'm not really sure
how people get info into ExpressionSets, and would be happy to make
that process easier / more robust.

Martin

Florian Hahne <f.hahne at dkfz-heidelberg.de> writes:

Hi Martin, all

I tried to adopt the read.AnnotatedDataFrame method on files that I was
able to import with read.phenoData before and got the following error
message:

Error in data.frame(labelDescription = varLabels, row.names =
names(varLabels)) :
        row names supplied are of the wrong length

After taking a look at the code and changing the line
    varLabels <- as.list(rep("read from file", ncol(pData)))
to
    varLabels <- rep("read from file", ncol(pData))
the function created the AnnotatedDataFrame

Not sure if this is a bug or if my phenoData files should be formated in
another way, but I strongly doubt that the original version will work in
any case, since the implicit coercion in the following
varMetadata = data.frame(labelDescription = varLabels,
                row.names = names(varLabels))))
creates a data frame with 1 row and length(varLabels) columns, hence the
row.names=names(varLabels) argument will cause an error.

Am i wrong here?
Florian


 sessionInfo()
R version 2.5.0 Under development (unstable) (2006-05-24 r38188)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=de_DE.ISO-8859-1;LC_NUMERIC=C;LC_TIME=de_DE.ISO-8859-1;LC_COLLATE=de_DE.ISO-8859-1;LC_MONETARY=de_DE.ISO-8859-1;LC_MESSAGES=de_DE.ISO-8859-1;LC_PAPER=de_DE.ISO-8859-1;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.ISO-8859-1;LC_IDENTIFICATION=C

attached base packages:
 [1] "splines"   "grid"      "tools"     "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

other attached packages:
  genefilter     survival    rnaiUtils      flomisc        RODBC       
prada
    "1.13.5"       "2.29"        "1.0"      "1.0.2"      "1.1-7"    
"1.11.3"
RColorBrewer      Biobase
     "0.2-3"    "1.13.29"



-- 
Florian Hahne
Abt. Molekulare Genomanalyse (B050)
Deutsches Krebsforschungszentrum (DKFZ)
Im Neuenheimer Feld 580
D-69120 Heidelberg
phone: 0049 6221 424764
fax: 0049 6221 423454
web: www.dkfz.de/mga

Florian Hahne
Abt. Molekulare Genomanalyse (B050)
Deutsches Krebsforschungszentrum (DKFZ)
Im Neuenheimer Feld 580
D-69120 Heidelberg
phone: 0049 6221 424764
fax: 0049 6221 423454
web: www.dkfz.de/mga

Seth Falcon

Tue, Jan 9, 2007 7:15 AM #

Florian Hahne <f.hahne at dkfz-heidelberg.de> writes:

Interesting.  The row names are special since they must often be aligned with
other object and can be used for subsetting.  

I have no problem with a "name" column being recognized by an import
tool (aside from issues of name collisions -- what if I have a
variable named "name").  But I think performance concerns will move us
away from having such a column in the actual representation of the
object.

What we are moving towards is a setup where the row names are stored
in a separate slot and may eventually be an external vector that can
be shared among other objects that need to align on that vector.

+ seth

Florian Hahne

Tue, Jan 9, 2007 7:44 AM #

Hi Seth,
internal representation is one part of the story and I agree that row
names are the way to go here. Another point however is how the user gets
the information into R. At some point we need to match sample names and
the sample meta data and IMO this should already be at the level of the
text file. The closest to the row names idea is probably to take the
first column in the file as the sample identifier, but this poses a
pretty strict layout on the files (maybe for some users the first column
is already the row numbering...). As far as I understand the current
implementation the default is to take the first column and that you can
pass row.names=x to read.AnnotatedDataFrame but there is this additional
sampleNames parameter and I find this pretty confusing. So currently you
can do almost everything with the function which is good in one sense
but on the other hand might cause mix ups and confusion to the user. If
the mapping is already clear at the level of the text file, we can sit
back and tell people to check their files if something isn't showing up
as they expect it to be, but currently you can do pretty stupid stuff
just by setting a wrong argument without even realizing.
I had the impression at the Bressanone courses that for the average user
the biggest obstacle is to get all the necessary data from files
somewhere on the hard disk into R and that it is important to provide a
straightforward default way of doing that.
Best,
Florian


Seth Falcon schrieb:

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Florian Hahne
Abt. Molekulare Genomanalyse (B050)
Deutsches Krebsforschungszentrum (DKFZ)
Im Neuenheimer Feld 580
D-69120 Heidelberg
phone: 0049 6221 424764
fax: 0049 6221 423454
web: www.dkfz.de/mga

Wolfgang Huber

Tue, Jan 9, 2007 12:16 PM #

Hi,

just to add to Florian's comment about user interface: I think the 
annotatedDataFrame and new eSet classes are beautiful and elegant, and 
much better than what we had.

Yet I find it now quite complex and unintuitive to construct an 
annotatedDataFrame or an ExpressionSet from scratch, IMHO anything that 
makes it simple to convert a simple dataframe or Excel table into a 
valid annotatedDataFrame will make many users happy.

  Best wishes
  Wolfgang

Florian Hahne wrote: