[Bioc-devel] Changes in AnnotationDbi
OK Jim, I will put very simple messages in (one liners) that will simply state whether the relationship between keys and the requested columns was 1:1, 1:many, many:1, or many:many. Hopefully this will represent an acceptable compromise. Marc
On 06/05/2015 08:37 AM, James W. MacDonald wrote:
I agree that a warning is probably not the way to go, as it does imply
that there might have been something wrong with either the input or
output. Plus, not everybody understands the distinction between error
and warning.
And having additional documentation can't possibly hurt. But that
assumes that most/some/all of the end users both peruse and understand
the documentation, which we all know is not the case. The main issue,
for me at least, is that a significant proportion of people seem to
think there is some sort of uniqueness imposed on things like Entrez
Gene IDs and Hugo symbols, etc. While that is the ultimate goal, we do
not have and maybe never will achieve unique IDs for each annotatable
object.
I used to work for a PI who was a very smart, well informed
statistical geneticist who was absolutely shocked when I informed her
that a) there are SNPs in dbSNP that have more than one RS ID, and
that b.) there are RS IDs in dbSNP that have been assigned to multiple
SNPs. She just assumed that there was a one-to-one RS ID -> SNP mapping.
So this is to me the crux of the problem. It is perfectly valid to
return one-to-many mappings, and that is what should be expected /by
those of us who already understand such things. /But for those of us
who are ignorant of the details, and those who assume uniqueness of
IDs, it would be really nice if they got a message telling them
something like
/Please note that there are one-to-many mappings between the input and
output IDs, so the output is longer than your input vector. Please see
?select for more detail./
/
/
And if the message is objectionable to some, you could give the option
for people to set a global flag to shut it off. Something like
if(!pleaseMakeItStop)
message(<message goes here>)
and they could set
pleaseMakeItStop = TRUE in their .Rprofile
Is that a reasonable compromise?
Jim
On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org
<mailto:mcarlson at fredhutch.org>> wrote:
Hi Jim,
I do agree that the warning was protective for that (this is why I
put it there).
But it was also annoying for many and a source of some confusion
because when people see a warning() they think that something has
gone wrong with the code that was just run. And in this case the
select method was actually doing exactly what it was supposed to
be doing. What it was actually warning you about was what you did
separately in that assignment to fit2... Which is the step right
after the select method already did it's work. And I can
understand why that seems a little bit confusing since you are
basically telling someone to be careful with the data you just
gave them.
Now I could replace it with a message() I guess, but in cases like
this where the warning is about something that happens outside of
the function you are calling, shouldn't that probably be handled
by documentation? Or at least, that is the argument that finally
persuaded me to remove it. That and that fact that almost every
call to select() ended up accompanied by the warning you
mentioned, because it turns out that perfect 1:1 relationships are
pretty rare for annotation data. Very often, you are going to get
back multiple results.
But I didn't just remove the warning, I also supplied an
alternative for people who have a real need for consistent 1:1
mapping.
The mapIds() method takes most of the same arguments as select,
except that unlike select(), it only looks up one column and it
always returns a vector that is the same size as the vector that
came in.
So for your example, you could do something like this psuedocode here:
mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
keytype="PROBEID")
And mapIds will follow a rule specified by the default value for
the multiVals argument so that you can get back your results in a
1:1 manner. And if you don't like any of the options available
for the multiVals argument, you can make your own function and
pass it in.
Anyhow please continue to let us know what you think?
Marc
On 06/04/2015 10:50 AM, James W. MacDonald wrote:
In the last release, the warning message from select() telling
people that
their results include one-to-many mappings was removed. While
some may find
this warning annoying, I think silently returning something
unexpected to
our users is dangerous.
In other words, for me it is a common practice to do something
like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already
know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except
for the
first few lines) will now be completely incorrect, and there
wasn't a
warning to let the end user know that they may have made a
mistake.
lmFit() will parse the featureData slot of an ExpressionSet
and use those
data for annotation, so that gives some hypothetical
protections, for those
who first put their annotation data into their ExpressionSet.
However,
?eSet says:
?featureData?: Contains variables describing features (i.e.,
rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the
experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really
intended to
contain annotation data, but instead some unique information
that pertains
to a given experiment. But maybe I misunderstand.
Is the featureData slot actually intended for annotation data?
If not, what
is the intended pipeline for annotating data in an
ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099