Skip to content

Deduping in R by multiple variables

3 messages · ramoss, William Dunlap

#
I have a dataset w/ 184K obs & 16 variables.  In SAS I proc sort nodupkey it
in seconds by 11 variables.
I tried to do the same thing in R using both the unique & then the
!duplicated functions but it just hangs there & I get no output.  Does
anyone know how to solve this?

This is how I tried to do it in R:


detail3 <-
[!duplicated(c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
                             detail2$BEGTIME,
detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
                             detail2$ACCTYP
,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
                             detail2$STKFUL)),]

detail3 <-
unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
          detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
          detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
          detail2$STKFUL)])




Thanks in advance



--
View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-multiple-variables-tp4641778.html
Sent from the R help mailing list archive at Nabble.com.
#
You can find out which rows of a data.frame called dataFrame
are duplicates of previous rows with
   dups <- duplicated(dataFrame)
To make a new data.frame without them do
   duplessDataFrame <- dataFrame[!dups, ]
You could use unique(dataFrame), but, as in your examples, I
think one often wants to remove duplicates based on only
some of the columns.  E.g., with the following data.frame
dataFrame <- data.frame(Name=LETTERS[1:9],
                                               One=rep(1:3,3),
                                               Two=c(11,12,13,11,11,12,12,13,13),
                                              Three=c(101,102,103,101,101,103,101,102,103))
we get
  > dataFrame
    Name One Two Three
  1    A   1  11   101
  2    B   2  12   102
  3    C   3  13   103
  4    D   1  11   101
  5    E   2  11   101
  6    F   3  12   103
  7    G   1  12   101
  8    H   2  13   102
  9    I   3  13   103
  > duplicated(dataFrame)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  > dups123 <- duplicated(dataFrame[,c("One","Two","Three")])
  > dups123
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
  > dataFrame[!dups123, ]
    Name One Two Three
  1    A   1  11   101
  2    B   2  12   102
  3    C   3  13   103
  5    E   2  11   101
  6    F   3  12   103
  7    G   1  12   101
  8    H   2  13   102

Your first expression
   detail3 <- [!duplicated(...)]
must have caused a syntax error, as "[" is the subscript operator
and requires something before it, as in datail2[...].

To see why your second attempt
   detail3 <-
   unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
           detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
           detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
           detail2$STKFUL)])
will not do what you want (even if it did finish in a reasonable amount of time)
break it into pieces and use the example dataset above.  You asked it to extract
the columns specified by 'tmp' where 'tmp' was constructed by:
  > print(tmp <- c(dataFrame$One, dataFrame$Two, dataFrame$Three))
   [1]   1   2   3   1   2   3   1   2   3  11  12
  [12]  13  11  11  12  12  13  13 101 102 103 101
  [23] 101 103 101 102 103
Then dataFrame[, tmp] is asking it to make a 27-column data.frame based
on those columns (which don't exist in the original 4-column data.frame).
You should have gotten an 'undefined columns selected' error.  Perhaps
it ran out of memory while checking all 184K * 13 columns.  That would be
odd.

Now if you used the calls I mentioned at first (in the working example)
and R hung, there might be ways to speed up the process.
      
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Thanks for your help guys. I was refering to the variables the wrong way. 
This worked for me:

idx <- !duplicated(detail2[,c("TDATE","FIRM","CM","BRANCH", 
                     "BEGTIME", "ENDTIME","OTYPE","OCOND", 
                     "ACCTYP","OSIDE","SHARES","STOCKS", 
                     "STKFUL")])
detail3 <- detail2[idx,]





--
View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-multiple-variables-tp4641778p4641854.html
Sent from the R help mailing list archive at Nabble.com.