Skip to content
Prev 367090 / 398506 Next

remove

The "by" function aggregates and returns a result with generally fewer 
rows than the original data. Since you are looking to index the rows in 
the original data set, the "ave" function is better suited because it 
always returns a vector that is just as long as the input vector:

# I usually work with character data rather than factors if I plan
# to modify the data (e.g. removing rows)
DF <- read.table( text=
'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  Jack
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )

err <- ave( DF$last
           , DF[ , "first", drop = FALSE]
           , FUN = function( lst ) {
               length( unique( lst ) )
             }
           )
result <- DF[ "1" == err, ]
result

Notice that the ave function returns a vector of the same type as was 
given to it, so even though the function returns a numeric the err
vector is character.

If you wanted to be able to examine more than one other column in 
determining the keep/reject decision, you could do:

err2 <- ave( seq_along( DF$first )
            , DF[ , "first", drop = FALSE]
            , FUN = function( n ) {
               length( unique( DF[ n, "last" ] ) )
              }
            )
result2 <- DF[ 1 == err2, ]
result2

and then you would have the option to re-use the "n" index to look at 
other columns as well.

Finally, here is a dplyr solution:

library(dplyr)
result3 <- (   DF
            %>% group_by( first ) # like a prep for ave or by
            %>% mutate( err = length( unique( last ) ) ) # similar to ave
            %>% filter( 1 == err ) # drop the rows with too many last names
            %>% select( -err ) # drop the temporary column
            %>% as.data.frame # convert back to a plain-jane data frame
            )
result3

which uses a small set of verbs in a pipeline of functions to go from 
input to result in one pass.

If your data set is really big (running out of memory big) then you might 
want to investigate the data.table or sqlite packages, either of which can 
be combined with dplyr to get a standardized syntax for managing larger 
amounts of data. However, most people actually aren't running out of 
memory so in most cases the extra horsepower isn't actually needed.
On Sun, 12 Feb 2017, P Tennant wrote:

            
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k