Skip to content

R how to find outliers and zero mean columns?

8 messages · Norman Pat, Jordan Meyer, Jim Lemon +1 more

#
Hi team

I am new to R so please help me to do this task.

Please find the  attached data sample. But in the original data frame I
have 350 features and 400000 observations.

I need to carryout these tasks.

1. How to Identify features (names) that have all zeros?

2. How to remove features that have all zeros from the dataset?

3. How to identify features (names) that have outliers such as 99999,-1 in
the data frame.

4. How to remove outliers?


Many thanks
#
No. Nothing attached. Please read the Rhelp Info page and the Posting Guide.
Who is assigning you this task? Homework? (Read the Posting Guide.)
That's generally pretty simple if "names" refers to columns in a dataframe.
But maybe you mean to process by rows?
You could start by defining "outliers" in something other than vague examples. If this is data from a real-life data gathering effort, then defining outliers would start with an explanation of the context.
Please at least do the following "homework".
David Winsemius
Alameda, CA, USA
#
I strongly suggest checking out some R tutorials. Most of these tasks are
basic data management that are likely covered in just about any tutorial.
I'm afraid that this isn't the appropriate forum for such basics.
On Mar 30, 2016 9:14 PM, "Norman Pat" <normanmath1 at gmail.com> wrote:

            

  
  
#
Hi David,
No. Nothing attached. Please read the Rhelp Info page and the Posting Guide.
*I attached it. Anyway I have attached it again (sample train.xlsx).*

Who is assigning you this task? Homework? (Read the Posting Guide.)
*This is my new job role so I have to do that. I know some basic R *
That's generally pretty simple if "names" refers to columns in a data frame.
*You mean such as something like names(data.nrow(means==0))*
But maybe you mean to process by rows?
*in a column(feature) *
*Please refer to the attached excel file*
You could start by defining "outliers" in something other than vague
examples. If this is data from a real-life data gathering effort, then
defining outliers would start with an explanation of the context.
*By looking at data I need to find the outliers*

*Thanks *


On Thu, Mar 31, 2016 at 12:20 PM, David Winsemius <dwinsemius at comcast.net>
wrote:
#
Hi Norman,
To check whether all values of an object (say "x") fulfill a certain
condition (==0):

all(x==0)

If your object (X) is indeed a data frame, you can only do this by
column, so if you want to get the results:

X<-data.frame(A=c(0,1:10),B=c(0,2:10,99999),
 C=c(0,-1,3:11),D=rep(0,11))
all_zeros<-function(x) return(all(x==0))
which_cols<-unlist(lapply(X,all_zeros))

If your data frame (or a subset) contains all numeric values, you can
finesse the problem like this:

which_rows<-apply(as.matrix(X),1,all_zeros)

What you get is a list of logical (TRUE/FALSE) values from lapply, so
it has to be unlisted to get a vector of logical values like you get
with "apply".

You can then use that vector to index (subset) the original data frame
by logically inverting it with ! (NOT):

X[,!which_cols]
X[!which_rows,]

Your "outliers" look suspiciously like missing values from certain
statistical packages. If you know the values you are looking for, you
can do something like:

NA99999<-X==99999

and then "remove" them by replacing those values with NA:

X[NA99999]<-NA

Be aware that all these hackles (diminutive of hacks) are pretty
specific to this example. Also remember that if this is homework, your
karma has just gone down the cosmic sinkhole.

Jim
On Thu, Mar 31, 2016 at 9:56 AM, Norman Pat <normanmath1 at gmail.com> wrote:
#
Hi Jim,
    Thanks for your reply. I know these basic stuffs in R.

But I want to know let say you have a data frame X with 300 features.
that has zero values for all the observations in that sample.

Here I am looking for a package or a function to do that.

And how do I know whether there are abnormal values for each feature. Let
say
I have 300 features and 100000 observations. It is hard to look everything
in the excel file. Instead of that I am looking for a package that does the
work.

I hope you understood.

Thanks a lot

Cheers
On Thu, Mar 31, 2016 at 1:13 PM, Jim Lemon <drjimlemon at gmail.com> wrote:

            

  
  
#
How about:

# if a data frame
names(X)[which_cols]

# and if you have rownames:
rownames(X)[which_rows]

My note about hackles was that packages generally don't know what
values are "abnormal" unless you specify them. Just like us. So you
have to specify what the range of "normal" values are, or what
specific values are "abnormal". There is a package named "outliers",
and while it would identify the 99999 value in the example I used, it
wouldn't do so for the -1.

Jim
On Thu, Mar 31, 2016 at 1:30 PM, Norman Pat <normanmath1 at gmail.com> wrote:
#
I didn't say you didn't attach it. I only said there was nothing attached. There's a difference. The mail-server strips most attachments. I _told_ you to read certain documents. You are not demonstrating that you are capable of following basic instructions.