In the last release, the warning message from select() telling people that
their results include one-to-many mappings was removed. While some may find
this warning annoying, I think silently returning something unexpected to
our users is dangerous.
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except for the
first few lines) will now be completely incorrect, and there wasn't a
warning to let the end user know that they may have made a mistake.
lmFit() will parse the featureData slot of an ExpressionSet and use those
data for annotation, so that gives some hypothetical protections, for those
who first put their annotation data into their ExpressionSet. However,
?eSet says:
?featureData?: Contains variables describing features (i.e., rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really intended to
contain annotation data, but instead some unique information that pertains
to a given experiment. But maybe I misunderstand.
Is the featureData slot actually intended for annotation data? If not, what
is the intended pipeline for annotating data in an ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
On Thu, Jun 4, 2015 at 1:50 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
In the last release, the warning message from select() telling people that
their results include one-to-many mappings was removed. While some may find
this warning annoying, I think silently returning something unexpected to
our users is dangerous.
I would agree. I have no problem with the warning (maybe a message would
be better).
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except for the
first few lines) will now be completely incorrect, and there wasn't a
warning to let the end user know that they may have made a mistake.
lmFit() will parse the featureData slot of an ExpressionSet and use those
data for annotation, so that gives some hypothetical protections, for those
who first put their annotation data into their ExpressionSet. However,
?eSet says:
?featureData?: Contains variables describing features (i.e., rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really intended to
contain annotation data, but instead some unique information that pertains
to a given experiment. But maybe I misunderstand.
I think we did not want to bloat the expression data with annotation if all
could be resolved via the annotation tag. I'd speculate that one of the
reasons limma plays nicely with the featureData is to cover the two-color
case custom array case, where no annotation package will resolve the
identifiers.
Now if eBayes returned an S4 class instance ....
Is the featureData slot actually intended for annotation data? If not, what
is the intended pipeline for annotating data in an ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
Hi Jim,
I do agree that the warning was protective for that (this is why I put
it there).
But it was also annoying for many and a source of some confusion because
when people see a warning() they think that something has gone wrong
with the code that was just run. And in this case the select method was
actually doing exactly what it was supposed to be doing. What it was
actually warning you about was what you did separately in that
assignment to fit2... Which is the step right after the select method
already did it's work. And I can understand why that seems a little bit
confusing since you are basically telling someone to be careful with the
data you just gave them.
Now I could replace it with a message() I guess, but in cases like this
where the warning is about something that happens outside of the
function you are calling, shouldn't that probably be handled by
documentation? Or at least, that is the argument that finally persuaded
me to remove it. That and that fact that almost every call to select()
ended up accompanied by the warning you mentioned, because it turns out
that perfect 1:1 relationships are pretty rare for annotation data.
Very often, you are going to get back multiple results.
But I didn't just remove the warning, I also supplied an alternative for
people who have a real need for consistent 1:1 mapping.
The mapIds() method takes most of the same arguments as select, except
that unlike select(), it only looks up one column and it always returns
a vector that is the same size as the vector that came in.
So for your example, you could do something like this psuedocode here:
mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
keytype="PROBEID")
And mapIds will follow a rule specified by the default value for the
multiVals argument so that you can get back your results in a 1:1
manner. And if you don't like any of the options available for the
multiVals argument, you can make your own function and pass it in.
Anyhow please continue to let us know what you think?
Marc
On 06/04/2015 10:50 AM, James W. MacDonald wrote:
In the last release, the warning message from select() telling people that
their results include one-to-many mappings was removed. While some may find
this warning annoying, I think silently returning something unexpected to
our users is dangerous.
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except for the
first few lines) will now be completely incorrect, and there wasn't a
warning to let the end user know that they may have made a mistake.
lmFit() will parse the featureData slot of an ExpressionSet and use those
data for annotation, so that gives some hypothetical protections, for those
who first put their annotation data into their ExpressionSet. However,
?eSet says:
?featureData?: Contains variables describing features (i.e., rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really intended to
contain annotation data, but instead some unique information that pertains
to a given experiment. But maybe I misunderstand.
Is the featureData slot actually intended for annotation data? If not, what
is the intended pipeline for annotating data in an ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
I agree that a warning is probably not the way to go, as it does imply that
there might have been something wrong with either the input or output.
Plus, not everybody understands the distinction between error and warning.
And having additional documentation can't possibly hurt. But that assumes
that most/some/all of the end users both peruse and understand the
documentation, which we all know is not the case. The main issue, for me at
least, is that a significant proportion of people seem to think there is
some sort of uniqueness imposed on things like Entrez Gene IDs and Hugo
symbols, etc. While that is the ultimate goal, we do not have and maybe
never will achieve unique IDs for each annotatable object.
I used to work for a PI who was a very smart, well informed statistical
geneticist who was absolutely shocked when I informed her that a) there are
SNPs in dbSNP that have more than one RS ID, and that b.) there are RS IDs
in dbSNP that have been assigned to multiple SNPs. She just assumed that
there was a one-to-one RS ID -> SNP mapping.
So this is to me the crux of the problem. It is perfectly valid to return
one-to-many mappings, and that is what should be expected *by those of us
who already understand such things. *But for those of us who are ignorant
of the details, and those who assume uniqueness of IDs, it would be really
nice if they got a message telling them something like
*Please note that there are one-to-many mappings between the input and
output IDs, so the output is longer than your input vector. Please see
?select for more detail.*
And if the message is objectionable to some, you could give the option for
people to set a global flag to shut it off. Something like
if(!pleaseMakeItStop)
message(<message goes here>)
and they could set
pleaseMakeItStop = TRUE in their .Rprofile
Is that a reasonable compromise?
Jim
On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org> wrote:
Hi Jim,
I do agree that the warning was protective for that (this is why I put it
there).
But it was also annoying for many and a source of some confusion because
when people see a warning() they think that something has gone wrong with
the code that was just run. And in this case the select method was
actually doing exactly what it was supposed to be doing. What it was
actually warning you about was what you did separately in that assignment
to fit2... Which is the step right after the select method already did
it's work. And I can understand why that seems a little bit confusing
since you are basically telling someone to be careful with the data you
just gave them.
Now I could replace it with a message() I guess, but in cases like this
where the warning is about something that happens outside of the function
you are calling, shouldn't that probably be handled by documentation? Or
at least, that is the argument that finally persuaded me to remove it.
That and that fact that almost every call to select() ended up accompanied
by the warning you mentioned, because it turns out that perfect 1:1
relationships are pretty rare for annotation data. Very often, you are
going to get back multiple results.
But I didn't just remove the warning, I also supplied an alternative for
people who have a real need for consistent 1:1 mapping.
The mapIds() method takes most of the same arguments as select, except
that unlike select(), it only looks up one column and it always returns a
vector that is the same size as the vector that came in.
So for your example, you could do something like this psuedocode here:
mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
keytype="PROBEID")
And mapIds will follow a rule specified by the default value for the
multiVals argument so that you can get back your results in a 1:1 manner.
And if you don't like any of the options available for the multiVals
argument, you can make your own function and pass it in.
Anyhow please continue to let us know what you think?
Marc
On 06/04/2015 10:50 AM, James W. MacDonald wrote:
In the last release, the warning message from select() telling people that
their results include one-to-many mappings was removed. While some may
find
this warning annoying, I think silently returning something unexpected to
our users is dangerous.
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except for the
first few lines) will now be completely incorrect, and there wasn't a
warning to let the end user know that they may have made a mistake.
lmFit() will parse the featureData slot of an ExpressionSet and use those
data for annotation, so that gives some hypothetical protections, for
those
who first put their annotation data into their ExpressionSet. However,
?eSet says:
?featureData?: Contains variables describing features (i.e., rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really intended to
contain annotation data, but instead some unique information that pertains
to a given experiment. But maybe I misunderstand.
Is the featureData slot actually intended for annotation data? If not,
what
is the intended pipeline for annotating data in an ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
OK Jim,
I will put very simple messages in (one liners) that will simply state
whether the relationship between keys and the requested columns was 1:1,
1:many, many:1, or many:many. Hopefully this will represent an
acceptable compromise.
Marc
On 06/05/2015 08:37 AM, James W. MacDonald wrote:
I agree that a warning is probably not the way to go, as it does imply
that there might have been something wrong with either the input or
output. Plus, not everybody understands the distinction between error
and warning.
And having additional documentation can't possibly hurt. But that
assumes that most/some/all of the end users both peruse and understand
the documentation, which we all know is not the case. The main issue,
for me at least, is that a significant proportion of people seem to
think there is some sort of uniqueness imposed on things like Entrez
Gene IDs and Hugo symbols, etc. While that is the ultimate goal, we do
not have and maybe never will achieve unique IDs for each annotatable
object.
I used to work for a PI who was a very smart, well informed
statistical geneticist who was absolutely shocked when I informed her
that a) there are SNPs in dbSNP that have more than one RS ID, and
that b.) there are RS IDs in dbSNP that have been assigned to multiple
SNPs. She just assumed that there was a one-to-one RS ID -> SNP mapping.
So this is to me the crux of the problem. It is perfectly valid to
return one-to-many mappings, and that is what should be expected /by
those of us who already understand such things. /But for those of us
who are ignorant of the details, and those who assume uniqueness of
IDs, it would be really nice if they got a message telling them
something like
/Please note that there are one-to-many mappings between the input and
output IDs, so the output is longer than your input vector. Please see
?select for more detail./
/
/
And if the message is objectionable to some, you could give the option
for people to set a global flag to shut it off. Something like
if(!pleaseMakeItStop)
message(<message goes here>)
and they could set
pleaseMakeItStop = TRUE in their .Rprofile
Is that a reasonable compromise?
Jim
On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org
<mailto:mcarlson at fredhutch.org>> wrote:
Hi Jim,
I do agree that the warning was protective for that (this is why I
put it there).
But it was also annoying for many and a source of some confusion
because when people see a warning() they think that something has
gone wrong with the code that was just run. And in this case the
select method was actually doing exactly what it was supposed to
be doing. What it was actually warning you about was what you did
separately in that assignment to fit2... Which is the step right
after the select method already did it's work. And I can
understand why that seems a little bit confusing since you are
basically telling someone to be careful with the data you just
gave them.
Now I could replace it with a message() I guess, but in cases like
this where the warning is about something that happens outside of
the function you are calling, shouldn't that probably be handled
by documentation? Or at least, that is the argument that finally
persuaded me to remove it. That and that fact that almost every
call to select() ended up accompanied by the warning you
mentioned, because it turns out that perfect 1:1 relationships are
pretty rare for annotation data. Very often, you are going to get
back multiple results.
But I didn't just remove the warning, I also supplied an
alternative for people who have a real need for consistent 1:1
mapping.
The mapIds() method takes most of the same arguments as select,
except that unlike select(), it only looks up one column and it
always returns a vector that is the same size as the vector that
came in.
So for your example, you could do something like this psuedocode here:
mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
keytype="PROBEID")
And mapIds will follow a rule specified by the default value for
the multiVals argument so that you can get back your results in a
1:1 manner. And if you don't like any of the options available
for the multiVals argument, you can make your own function and
pass it in.
Anyhow please continue to let us know what you think?
Marc
On 06/04/2015 10:50 AM, James W. MacDonald wrote:
In the last release, the warning message from select() telling
people that
their results include one-to-many mappings was removed. While
some may find
this warning annoying, I think silently returning something
unexpected to
our users is dangerous.
In other words, for me it is a common practice to do something
like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already
know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except
for the
first few lines) will now be completely incorrect, and there
wasn't a
warning to let the end user know that they may have made a
mistake.
lmFit() will parse the featureData slot of an ExpressionSet
and use those
data for annotation, so that gives some hypothetical
protections, for those
who first put their annotation data into their ExpressionSet.
However,
?eSet says:
?featureData?: Contains variables describing features (i.e.,
rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the
experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really
intended to
contain annotation data, but instead some unique information
that pertains
to a given experiment. But maybe I misunderstand.
Is the featureData slot actually intended for annotation data?
If not, what
is the intended pipeline for annotating data in an
ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
On Mon, Jun 8, 2015 at 3:12 PM, Marc Carlson <mcarlson at fredhutch.org> wrote:
OK Jim,
I will put very simple messages in (one liners) that will simply state
whether the relationship between keys and the requested columns was 1:1,
1:many, many:1, or many:many. Hopefully this will represent an acceptable
compromise.
Marc
On 06/05/2015 08:37 AM, James W. MacDonald wrote:
I agree that a warning is probably not the way to go, as it does imply
that there might have been something wrong with either the input or output.
Plus, not everybody understands the distinction between error and warning.
And having additional documentation can't possibly hurt. But that
assumes that most/some/all of the end users both peruse and understand the
documentation, which we all know is not the case. The main issue, for me at
least, is that a significant proportion of people seem to think there is
some sort of uniqueness imposed on things like Entrez Gene IDs and Hugo
symbols, etc. While that is the ultimate goal, we do not have and maybe
never will achieve unique IDs for each annotatable object.
I used to work for a PI who was a very smart, well informed statistical
geneticist who was absolutely shocked when I informed her that a) there are
SNPs in dbSNP that have more than one RS ID, and that b.) there are RS IDs
in dbSNP that have been assigned to multiple SNPs. She just assumed that
there was a one-to-one RS ID -> SNP mapping.
So this is to me the crux of the problem. It is perfectly valid to
return one-to-many mappings, and that is what should be expected *by
those of us who already understand such things. *But for those of us who
are ignorant of the details, and those who assume uniqueness of IDs, it
would be really nice if they got a message telling them something like
*Please note that there are one-to-many mappings between the input and
output IDs, so the output is longer than your input vector. Please see
?select for more detail.*
And if the message is objectionable to some, you could give the option
for people to set a global flag to shut it off. Something like
if(!pleaseMakeItStop)
message(<message goes here>)
and they could set
pleaseMakeItStop = TRUE in their .Rprofile
Is that a reasonable compromise?
Jim
On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org>
wrote:
Hi Jim,
I do agree that the warning was protective for that (this is why I put it
there).
But it was also annoying for many and a source of some confusion because
when people see a warning() they think that something has gone wrong with
the code that was just run. And in this case the select method was
actually doing exactly what it was supposed to be doing. What it was
actually warning you about was what you did separately in that assignment
to fit2... Which is the step right after the select method already did
it's work. And I can understand why that seems a little bit confusing
since you are basically telling someone to be careful with the data you
just gave them.
Now I could replace it with a message() I guess, but in cases like this
where the warning is about something that happens outside of the function
you are calling, shouldn't that probably be handled by documentation? Or
at least, that is the argument that finally persuaded me to remove it.
That and that fact that almost every call to select() ended up accompanied
by the warning you mentioned, because it turns out that perfect 1:1
relationships are pretty rare for annotation data. Very often, you are
going to get back multiple results.
But I didn't just remove the warning, I also supplied an alternative for
people who have a real need for consistent 1:1 mapping.
The mapIds() method takes most of the same arguments as select, except
that unlike select(), it only looks up one column and it always returns a
vector that is the same size as the vector that came in.
So for your example, you could do something like this psuedocode here:
mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
keytype="PROBEID")
And mapIds will follow a rule specified by the default value for the
multiVals argument so that you can get back your results in a 1:1 manner.
And if you don't like any of the options available for the multiVals
argument, you can make your own function and pass it in.
Anyhow please continue to let us know what you think?
Marc
On 06/04/2015 10:50 AM, James W. MacDonald wrote:
In the last release, the warning message from select() telling people
that
their results include one-to-many mappings was removed. While some may
find
this warning annoying, I think silently returning something unexpected to
our users is dangerous.
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
Which will work just fine, but then all the annotation (except for the
first few lines) will now be completely incorrect, and there wasn't a
warning to let the end user know that they may have made a mistake.
lmFit() will parse the featureData slot of an ExpressionSet and use those
data for annotation, so that gives some hypothetical protections, for
those
who first put their annotation data into their ExpressionSet. However,
?eSet says:
?featureData?: Contains variables describing features (i.e., rows
in ?assayData?) unique to this experiment. Use the
?annotation? slot to efficiently reference feature data
common to the annotation package used in the experiment.
Class: ?AnnotatedDataFrame-class?
Which to me indicates that the featureData slot isn't really intended to
contain annotation data, but instead some unique information that
pertains
to a given experiment. But maybe I misunderstand.
Is the featureData slot actually intended for annotation data? If not,
what
is the intended pipeline for annotating data in an ExpressionSet? Am I
alone in being concerned about this?
Best,
Jim
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
I'm not even that happy with James' first solution, as it relies on the
order being correct after removing the duplicates. I'd feel safer to use
'match' to ensure that. (What if an EntrezId is not found in the
Annotation DB? Will we have a line with NA, or is the line simply
missing? The latter would break James' code.)
What users really want here is a way to get the "preferred" symbol for
an entrezId, and for lack of this, they accept simply a random one or
the first one (in some unspecified collation). So, we should have a
function, maybe 'select1', to select one and only one hit for each query
value.
select1(x, keys, columns, keytype, requireUnique=FALSE, ... )
This would query the AnnotationDbi object 'x' as does 'select', but
return a data frame with the columns specified in 'columns', and the
vector that was passed as 'keys' as row names, thus guaranteeing that
each line in the data frame corresponds to one query key. If there were
multiple records for a key, the first one is used, unless
'requireUnique' is set, in which case an error is issued. And if no
record is present for a key, the data frame contains a row of NAs for
this key.
This would be quite convenient for any kind of ID conversion issues.
Simon
Hi
My two cents:
On 04/06/15 19:50, James W. MacDonald wrote:
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
I'm not even that happy with James' first solution, as it relies on the order
being correct after removing the duplicates. I'd feel safer to use 'match' to
ensure that. (What if an EntrezId is not found in the Annotation DB? Will we
have a line with NA, or is the line simply missing? The latter would break
James' code.)
What users really want here is a way to get the "preferred" symbol for an
entrezId, and for lack of this, they accept simply a random one or the first one
(in some unspecified collation). So, we should have a function, maybe 'select1',
to select one and only one hit for each query value.
select1(x, keys, columns, keytype, requireUnique=FALSE, ... )
This would query the AnnotationDbi object 'x' as does 'select', but return a
data frame with the columns specified in 'columns', and the vector that was
passed as 'keys' as row names, thus guaranteeing that each line in the data
frame corresponds to one query key. If there were multiple records for a key,
the first one is used, unless 'requireUnique' is set, in which case an error is
issued. And if no record is present for a key, the data frame contains a row of
NAs for this key.
This would be quite convenient for any kind of ID conversion issues.
In case you missed it in Marc's reply, and acknowledging that this is different
from your suggestion, there is mapIds() for doing this on a single column basis,
which is the common use case where one doesn't care too much about multiple
mapping ids
> org = org.Hs.eg.db
> head(select(org, keys(org), "ALIAS"))
ENTREZID ALIAS
1 1 A1B
2 1 ABG
3 1 GAB
4 1 HYST2477
5 1 A1BG
6 2 A2MD
> head(mapIds(org, keys(org), "ALIAS", "ENTREZID"))
1 2 3 9 10 11
"A1B" "A2MD" "A2MP" "AAC1" "AAC2" "AACP"
> head(mapIds(org, keys(org), "ALIAS", "ENTREZID", multiVals="CharacterList"))
CharacterList of length 6
[["1"]] A1B ABG GAB HYST2477 A1BG
[["2"]] A2MD CPAMD5 FWP007 S863-7 A2M
[["3"]] A2MP A2MP1
[["9"]] AAC1 MNAT NAT-1 NATI NAT1
[["10"]] AAC2 NAT-2 PNAT NAT2
[["11"]] AACP NATP1 NATP
> str(head(mapIds(org, keys(org), "ALIAS", "ENTREZID", multiVals="list")))
List of 6
$ 1 : chr [1:5] "A1B" "ABG" "GAB" "HYST2477" ...
$ 2 : chr [1:5] "A2MD" "CPAMD5" "FWP007" "S863-7" ...
$ 3 : chr [1:2] "A2MP" "A2MP1"
$ 9 : chr [1:5] "AAC1" "MNAT" "NAT-1" "NATI" ...
$ 10: chr [1:4] "AAC2" "NAT-2" "PNAT" "NAT2"
$ 11: chr [1:3] "AACP" "NATP1" "NATP"
Also since this is the devel list, there is
> library(dplyr)
> d = src_sqlite(org.Hs.eg_dbfile())
> d
src: sqlite 3.8.6
[/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.2-BiocDevel/org.Hs.eg.db/extdata/org.Hs.eg.sqlite]
tbls: accessions, alias, chrlengths, chromosome_locations, chromosomes,
cytogenetic_locations, ec, ensembl, ensembl_prot, ensembl_trans,
ensembl2ncbi, gene_info, genes, go, go_all, go_bp, go_bp_all, go_cc,
go_cc_all, go_mf, go_mf_all, kegg, map_counts, map_metadata, metadata,
ncbi2ensembl, omim, pfam, prosite, pubmed, refseq, sqlite_stat1, ucsc,
unigene, uniprot
> d %>% tbl("alias") %>% group_by(`_id`) %>% summarize(alias_symbol)
Source: sqlite 3.8.6
[/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.2-BiocDevel/org.Hs.eg.db/extdata/org.Hs.eg.sqlite]
From: <derived table> [?? x 2]
_id alias_symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 4 NAT1
5 5 NAT2
6 6 NATP
7 7 SERPINA3
8 8 AADAC
9 9 AAMP
10 10 AANAT
.. ... ...
(with lots of nice confusion there, including extensive masking of symbols
between dplyr / AnnotationDbi, need for knowledge of the schema (basically a
central id, ENTREZID for org packages, and tables of mappings from the central
id to other ids), and the more-or-less arbitrary choice of alias_symbol).
Martin
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
As select() works currently, the returned keys are in identical order as
the input, with extra rows inserted as needed. And any one-to-nothing
mapping results in a NA returned. So by definition my (admittedly naive)
method is guaranteed to work. But your point is well taken - match() is
safer. But my point isn't to say how people should deal with duplicates,
and I didn't intend for my example to be an example of what anybody should
do. Instead, I object to the idea that we would silently return something
that the average user might not expect, without some indication to alert
the user.
Anyway, the issue is much more complicated than you seem to realize. If you
ask for more than one thing to be returned (e.g., ENTREZID and SYMBOL), if
either has duplicates, you get multiple rows returned. So if there is a
single Entrez Gene ID, but multiple symbols, which symbol do you choose?
Which is the 'preferred' symbol? For that matter, which is the 'preferred'
Entrez Gene ID? Are there duplicates because NCBI hasn't yet discontinued
some duplicates that point to the same thing, or are there duplicates
because a given reporter is measuring multiple highly similar genes?
We have long insisted that the annotation packages we provide are 'as is',
and that we are not the arbiters of correctness. We simply give end users
the ability to easily get available data from within R. I would argue that
we should maintain that stance. I am not particularly enthused with any
filtering we might do, as I don't think we have the time or personnel
required to do it well. We should leave it to the manufacturer of the array
and the people at NCBI and Ensembl to do that part. And it should be up to
the end user to figure which ID is the 'preferred' one. As a simple example:
CharacterList of length 6
[["16657436"]] 100287102 100288486 100287029 84771 100287596
[["16657440"]] 100422919 100422834 100422831 100302278
[["16657450"]] 729737 101929819 441124 100132062 101059936
[["16657473"]] 81399 729759 26683 441308
[["16657910"]] 8511 8510
[["16658119"]] 284630 100287898
If we now go to NCBI and look at the five IDs for the first gene, they are
all current, and map to DDX11L1, DDX11L9, DDX11L10, DDX11L2, and DDX11L5.
We can certainly choose the first one (and that is what I do for my
collaborators, in general), but is that the right thing to do? If so, why?
Best,
Jim
On Tue, Jun 9, 2015 at 5:52 AM, Simon Anders <anders at embl.de> wrote:
Hi
My two cents:
On 04/06/15 19:50, James W. MacDonald wrote:
In other words, for me it is a common practice to do something like this:
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
I add in the step where dups are removed because I already know they are
there. But a naive user might instead do
fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))
I'm not even that happy with James' first solution, as it relies on the
order being correct after removing the duplicates. I'd feel safer to use
'match' to ensure that. (What if an EntrezId is not found in the Annotation
DB? Will we have a line with NA, or is the line simply missing? The latter
would break James' code.)
What users really want here is a way to get the "preferred" symbol for an
entrezId, and for lack of this, they accept simply a random one or the
first one (in some unspecified collation). So, we should have a function,
maybe 'select1', to select one and only one hit for each query value.
select1(x, keys, columns, keytype, requireUnique=FALSE, ... )
This would query the AnnotationDbi object 'x' as does 'select', but return
a data frame with the columns specified in 'columns', and the vector that
was passed as 'keys' as row names, thus guaranteeing that each line in the
data frame corresponds to one query key. If there were multiple records for
a key, the first one is used, unless 'requireUnique' is set, in which case
an error is issued. And if no record is present for a key, the data frame
contains a row of NAs for this key.
This would be quite convenient for any kind of ID conversion issues.
Simon
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
In case you missed it in Marc's reply, and acknowledging that this is
different from your suggestion, there is mapIds() for doing this on a
single column basis, which is the common use case where one doesn't care
too much about multiple mapping ids
I have indeed missed this point in Marc's reply -- and you are right,
the single column case is the only one where it is common that one does
not care for multiple mapping. So, sorry for the noise.
How comes I never knew 'mapIds' even though it is clearly mentioned in
the AnnotationDb help page? Maybe the page is too long, or --more
likely-- I'm to impatient when browsing through help pages.
Simon