remove
Sorry Jeff, I did not finish my email. I accidentally touched the send button.
My question was the
when I used this one
length(unique(result2$first))
vs
dim(result2[!duplicated(result2[,c('first')]),]) [1]
I did get different results but now I found out the problem.
Thank you!.
On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
Your question mystifies me, since it looks to me like you already know the answer. -- Sent from my phone. Please excuse my brevity. On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote:
Hi Jeff and all, How do I get the number of unique first names in the two data sets? for the first one, result2 <- DF[ 1 == err2, ] length(unique(result2$first)) On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
The "by" function aggregates and returns a result with generally
fewer rows
than the original data. Since you are looking to index the rows in
the
original data set, the "ave" function is better suited because it
always
returns a vector that is just as long as the input vector:
# I usually work with character data rather than factors if I plan
# to modify the data (e.g. removing rows)
DF <- read.table( text=
'first week last
Alex 1 West
Bob 1 John
Cory 1 Jack
Cory 2 Jack
Bob 2 John
Bob 3 John
Alex 2 Joseph
Alex 3 West
Alex 4 West
', header = TRUE, as.is = TRUE )
err <- ave( DF$last
, DF[ , "first", drop = FALSE]
, FUN = function( lst ) {
length( unique( lst ) )
}
)
result <- DF[ "1" == err, ]
result
Notice that the ave function returns a vector of the same type as was
given
to it, so even though the function returns a numeric the err
vector is character.
If you wanted to be able to examine more than one other column in
determining the keep/reject decision, you could do:
err2 <- ave( seq_along( DF$first )
, DF[ , "first", drop = FALSE]
, FUN = function( n ) {
length( unique( DF[ n, "last" ] ) )
}
)
result2 <- DF[ 1 == err2, ]
result2
and then you would have the option to re-use the "n" index to look at
other
columns as well.
Finally, here is a dplyr solution:
library(dplyr)
result3 <- ( DF
%>% group_by( first ) # like a prep for ave or by
%>% mutate( err = length( unique( last ) ) ) # similar to
ave
%>% filter( 1 == err ) # drop the rows with too many last
names
%>% select( -err ) # drop the temporary column
%>% as.data.frame # convert back to a plain-jane data
frame
) result3 which uses a small set of verbs in a pipeline of functions to go from
input
to result in one pass. If your data set is really big (running out of memory big) then you
might
want to investigate the data.table or sqlite packages, either of
which can
be combined with dplyr to get a standardized syntax for managing
larger
amounts of data. However, most people actually aren't running out of
memory
so in most cases the extra horsepower isn't actually needed. On Sun, 12 Feb 2017, P Tennant wrote:
Hi Val, The by() function could be used here. With the dataframe dfr: # split the data by first name and check for more than one last name
for
each first name res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) # make the result more easily manipulated res <- as.table(res) res # first # Alex Bob Cory # TRUE FALSE FALSE # then use this result to subset the data nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] # sort if needed nw.dfr[order(nw.dfr$first) , ] first week last 2 Bob 1 John 5 Bob 2 John 6 Bob 3 John 3 Cory 1 Jack 4 Cory 2 Jack Philip On 12/02/2017 4:02 PM, Val wrote:
Hi all, I have a big data set and want to remove rows conditionally. In my data file each person were recorded for several weeks.
Somehow
during the recording periods, their last name was misreported.
For
each person, the last name should be the same. Otherwise remove
from
the data. Example, in the following data set, Alex was found to
have
two last names .
Alex West
Alex Joseph
Alex should be removed from the data. if this happens then I want
remove all rows with Alex. Here is my data set
df<- read.table(header=TRUE, text='first week last
Alex 1 West
Bob 1 John
Cory 1 Jack
Cory 2 Jack
Bob 2 John
Bob 3 John
Alex 2 Joseph
Alex 3 West
Alex 4 West ')
Desired output
first week last
1 Bob 1 John
2 Bob 2 John
3 Bob 3 John
4 Cory 1 Jack
5 Cory 2 Jack
Thank you in advance
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#.
rocks...1k
---------------------------------------------------------------------------