[Bioc-devel] rownames in SummerizedExperiments

9 messages · Simon Anders, Michael Lawrence, Martin Morgan +1 more

Original

1

9

Simon Anders

Sat, Apr 5, 2014 8:39 AM #

Hi Martin et al.

When I use "mcols" on a SummerizedExperiment object, I get a DataFrame
with the row metadata, but without rownames. This is quite annoying if I
want to select specific rows using my feature identifiers.

Would it be possible to change that?

  Simon



Demonstration of the issue:

[...]

Sample_1    Sample_2   Sample_3    Sample_4   Sample_5
Gene_G -0.3877467 -1.70879454  0.7939223 -2.34550441  0.8595643
Gene_K -0.2552112  0.08670308 -1.4158207  0.66415623 -2.3311998
Gene_M -0.4514518  0.30546322 -0.1799235  1.32129088  0.5253143
Gene_T  0.7403792  0.22984996  0.1972806  1.67471472  0.7371430
Gene_Z  1.0290360 -0.49934034  2.5813815  0.60148770  0.3925438
Gene_B -0.6196811 -0.23720962  1.6226878 -0.39416017 -0.9792215
Gene_W  0.6341272  0.70582774  0.8372586  2.20678476 -1.5472927
Gene_L  0.4160801 -0.60180955  0.5366517 -2.41960274 -0.1754423
Gene_U  1.1331233  0.19707903 -1.1297945  0.03272385 -2.6627403
Gene_A -0.9800668 -0.61572952 -1.4320614  0.16594756 -1.4636233

[1] "Gene_G" "Gene_K" "Gene_M" "Gene_T" "Gene_Z" "Gene_B" "Gene_W" "Gene_L"
 [9] "Gene_U" "Gene_A"

DataFrame with 10 rows and 2 columns
   yellowness  greenness
    <numeric>  <numeric>
1   0.0772510 0.77249283
2   0.9760104 0.01193128
3   0.8420076 0.61835914
4   0.2859476 0.97866574
5   0.8153054 0.91518594
6   0.4111647 0.06934432
7   0.9445907 0.39559884
8   0.7793326 0.58575488
9   0.4379305 0.53325324
10  0.2076865 0.46240856

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicRanges_1.14.4 XVector_0.2.0        IRanges_1.20.6
[4] BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] stats4_3.0.2

Simon Anders

Sat, Apr 5, 2014 8:42 AM #

Hi

On 05/04/14 17:39, Simon Anders wrote:

Okay, I should have read the help page for "mcols" before posting.
Hence, I amend my question to: Is there a reason why "use.names"
defaults to FALSE?

  Simon

1 day later

Michael Lawrence

Sun, Apr 6, 2014 2:32 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140406/0b60e317/attachment.pl>

Simon Anders

Sun, Apr 6, 2014 2:48 PM #

Hi Michael

On 06/04/14 23:32, Michael Lawrence wrote:

Thanks for the response, but I'm not sure I understand it. I thought
"use.names=TRUE" instructs "mcols" to use the rownames of the
SummerizedExperiment object as rownames for the returned DataFrame. Now,
as the rownames of the SummerizedExperiment have to be unique anyway (at
least, I suppose they have to -- they are names, too, after all, and not
just an arbitrary vector), how can it happen that duplicate names might
appear?

The use case: I have a SummerizedExperiment object with gene IDs in the
rownames. Let's say I want to get the value in the meta-data column
"yellowness" for "gene_D".

With en ExpressionSet, I could write:
   fData(es)["gene_D","yellowness"]

With SummerizeExperiment, it has to be:
   mcols(se,use.names=TRUE)["gene_D","yellowness"]

Of course, it's no big deal, but I find it quite clumsy, and I wonder
why it has to be this way.

  Simon

Michael Lawrence

Sun, Apr 6, 2014 4:21 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140406/a4992570/attachment.pl>

Sun, Apr 6, 2014 6:22 PM #

On 04/06/2014 04:21 PM, Michael Lawrence wrote:

Empirically, the row names can be duplicated, but the column names cannot.

The lack of constraint on row names is enabled by the rowData GenomicRanges, 
while the constraint on column names is introduced by the (rownames of the) 
colData DataFrame. So the lack of symmetry in the class leads to lack of 
symmetry for dimnames. The use of GenomicRanges for rows has been the subject of 
previous discussion.

It wouldn't be inconceivable to impose constraints on duplicate row names in 
SummarizedExperiment and set use.names=TRUE by default, or to redefine mcols(se) 
to use.names=!any(dupclicated(se)). There would be performance consequences (how 
much?) and an mcols inconsistency. I think this is part of the same discussion as

   https://stat.ethz.ch/pipermail/bioc-devel/2014-March/005409.html

which I have not yet followed through on.

Syntax wise, there is also

   mcols(se)[rownames(se) == "gene_D", "yellowness"]

This is more efficient (and more error prone) than either use.names or Michael's 
suggestion.

Martin

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Michael Lawrence

Sun, Apr 6, 2014 8:26 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140406/1e6104b9/attachment.pl>

Simon Anders

Sun, Apr 6, 2014 11:37 PM #

Hi

Thanks, Michael. That's the point I was missing. I never realized that
array dimnames don't have to be unique. Strange, actually.

  Simon

Sun, Apr 6, 2014 11:47 PM #

Hi Michael, Simon,

On 04/06/2014 02:32 PM, Michael Lawrence wrote:

Close but not exactly how it was. 2 (or maybe 3) years ago the names of
a GRanges object were forced to be unique. So elementMetadata() (mcols()
didn't exist at that time) was always propagating the names as the row
names with no need to mangle them.

Enforcing uniqueness of the names of a GRanges object was indirectly
causing many issues that I'm not going to re-discuss here. I'll only
mention that, because of this, the names of a GRangesList object had
to be mangled when unlisting the object. We're in a better place today
without that kind of feature :-)

H.

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319