Dear list I have quite a small data set in which I need to have the following values ignored - not used when performing an analysis but they need to be included later in the report that I write. Can anyone help with a suggestion as to how this can be accomplished Values to be ignored 0 - zero and 1 this is in addition to NA (null) The reason is that I need to use the log10 of the values when performing the calculation. Currently I hand massage the data set, about a 100 values, of which less than 5 to 10 are in this category. The NA values are NOT the problem What I was hoping was that I did not have to use a series of if and ifelse statements. Perhaps there is a more elegant solution. Any ideas would be welcomed. Regards Steve
How to ignore data
4 messages · Steve Sidney, Ben Bolker, Bert Gunter
Steve Sidney <sbsidney <at> mweb.co.za> writes:
Dear list I have quite a small data set in which I need to have the following values ignored - not used when performing an analysis but they need to be included later in the report that I write. Can anyone help with a suggestion as to how this can be accomplished Values to be ignored 0 - zero and 1 this is in addition to NA (null) The reason is that I need to use the log10 of the values when performing the calculation. Currently I hand massage the data set, about a 100 values, of which less than 5 to 10 are in this category. The NA values are NOT the problem What I was hoping was that I did not have to use a series of if and ifelse statements. Perhaps there is a more elegant solution.
It would help to have a more precise/reproducible example, but if your data set (a data frame) is d, and you want to ignore cases where the response variable x is either 0 or 1, you could say ds <- subset(d,!x %in% c(0,1)) Some modeling functions (such as lm()), but not all of them, have a 'subset' argument so you can provide this criterion on the fly: lm(...,subset=(!x %in% c(0,1)))
Values to be ignored 0 - zero and 1 this is in addition to NA (null) The reason is that I need to use the log10 of the values when performing the calculation. Currently I hand massage the data set, about a 100 values, of which less than 5 to 10 are in this category.
This is probably a bad idea, perhaps even a VERY bad idea, though without knowing the details of what you are doing, one cannot be sure. The reason is that by removing these values you may be biasing the analysis. For example, if they are values that are below some threshhold LOD (limit of detection) they are censored, and this censoring needs to be explicitly accounted for (e.g. with the survival package). If they represent in some sense "unusual" values (some call these "outliers", a pejorative label that I believe should be avoided given all the scientfic and statistical BS associated with the term), one is then bound to ask, "How unusual? Why unusual? What do they tell us about the scientific questions of concern?" If they are just "errors" of some sort (like negative incomes or volumes), well then, you're probably OK removing them. The reason I mention this is that I have seen scientists too often use poor strategies for analyzing censored data, and this can end up producing baloney conclusions that don't replicate. It's a somewhat subtle, but surprisingly common issue (due to measurement limitations) that most scientists are neither trained to recognize nor deal with. So their problematical approaches are understandable, but unfortunate. Therefore take care ... and, if necessary, consuilt your local statistician for help. -- Bert
The NA values are NOT the problem What I was hoping was that I did not have to use a series of if and ifelse statements. Perhaps there is a more elegant solution.
?It would help to have a more precise/reproducible example, but if your data set (a data frame) is d, and you want to ignore cases where the response variable x is either 0 or 1, you could say ?ds <- subset(d,!x %in% c(0,1)) Some modeling functions (such as lm()), but not all of them, have a 'subset' argument so you can provide this criterion on the fly: ?lm(...,subset=(!x %in% c(0,1)))
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bert Gunter Genentech Nonclinical Biostatistics 467-7374 http://devo.gene.com/groups/devo/depts/ncb/home.shtml
Thanks for the comments Please see my reply to Stavros - the counts represent organisms and btw both mean and the median are virtually unaffected by the removal of these valuse. Furthermore, experience rather than statistics indicates that these values are in fact gross errors and as you of course mention I think one can quite safely remove them. I totally agree about the question of what is an outlier but since these results are obtained from a Proficiency Testing programme, we are pretty sure what the aniticpated results. At least the range and in this case these values are considered errors. Steve
On 2010/12/13 07:09 PM, Bert Gunter wrote:
Values to be ignored 0 - zero and 1 this is in addition to NA (null) The reason is that I need to use the log10 of the values when performing the calculation. Currently I hand massage the data set, about a 100 values, of which less than 5 to 10 are in this category.
This is probably a bad idea, perhaps even a VERY bad idea, though without knowing the details of what you are doing, one cannot be sure. The reason is that by removing these values you may be biasing the analysis. For example, if they are values that are below some threshhold LOD (limit of detection) they are censored, and this censoring needs to be explicitly accounted for (e.g. with the survival package). If they represent in some sense "unusual" values (some call these "outliers", a pejorative label that I believe should be avoided given all the scientfic and statistical BS associated with the term), one is then bound to ask, "How unusual? Why unusual? What do they tell us about the scientific questions of concern?" If they are just "errors" of some sort (like negative incomes or volumes), well then, you're probably OK removing them. The reason I mention this is that I have seen scientists too often use poor strategies for analyzing censored data, and this can end up producing baloney conclusions that don't replicate. It's a somewhat subtle, but surprisingly common issue (due to measurement limitations) that most scientists are neither trained to recognize nor deal with. So their problematical approaches are understandable, but unfortunate. Therefore take care ... and, if necessary, consuilt your local statistician for help. -- Bert
The NA values are NOT the problem What I was hoping was that I did not have to use a series of if and ifelse statements. Perhaps there is a more elegant solution.
It would help to have a more precise/reproducible example, but if your data set (a data frame) is d, and you want to ignore cases where the response variable x is either 0 or 1, you could say ds<- subset(d,!x %in% c(0,1)) Some modeling functions (such as lm()), but not all of them, have a 'subset' argument so you can provide this criterion on the fly: lm(...,subset=(!x %in% c(0,1)))
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.