I have a dataset w/ 184K obs & 16 variables. In SAS I proc sort nodupkey it
in seconds by 11 variables.
I tried to do the same thing in R using both the unique & then the
!duplicated functions but it just hangs there & I get no output. Does
anyone know how to solve this?
This is how I tried to do it in R:
detail3 <-
[!duplicated(c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
detail2$BEGTIME,
detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
detail2$ACCTYP
,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
detail2$STKFUL)),]
detail3 <-
unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
detail2$STKFUL)])
Thanks in advance
--
View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-multiple-variables-tp4641778.html
Sent from the R help mailing list archive at Nabble.com.
Deduping in R by multiple variables
3 messages · ramoss, William Dunlap
You can find out which rows of a data.frame called dataFrame
are duplicates of previous rows with
dups <- duplicated(dataFrame)
To make a new data.frame without them do
duplessDataFrame <- dataFrame[!dups, ]
You could use unique(dataFrame), but, as in your examples, I
think one often wants to remove duplicates based on only
some of the columns. E.g., with the following data.frame
dataFrame <- data.frame(Name=LETTERS[1:9],
One=rep(1:3,3),
Two=c(11,12,13,11,11,12,12,13,13),
Three=c(101,102,103,101,101,103,101,102,103))
we get
> dataFrame
Name One Two Three
1 A 1 11 101
2 B 2 12 102
3 C 3 13 103
4 D 1 11 101
5 E 2 11 101
6 F 3 12 103
7 G 1 12 101
8 H 2 13 102
9 I 3 13 103
> duplicated(dataFrame)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dups123 <- duplicated(dataFrame[,c("One","Two","Three")])
> dups123
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
> dataFrame[!dups123, ]
Name One Two Three
1 A 1 11 101
2 B 2 12 102
3 C 3 13 103
5 E 2 11 101
6 F 3 12 103
7 G 1 12 101
8 H 2 13 102
Your first expression
detail3 <- [!duplicated(...)]
must have caused a syntax error, as "[" is the subscript operator
and requires something before it, as in datail2[...].
To see why your second attempt
detail3 <-
unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
detail2$STKFUL)])
will not do what you want (even if it did finish in a reasonable amount of time)
break it into pieces and use the example dataset above. You asked it to extract
the columns specified by 'tmp' where 'tmp' was constructed by:
> print(tmp <- c(dataFrame$One, dataFrame$Two, dataFrame$Three))
[1] 1 2 3 1 2 3 1 2 3 11 12
[12] 13 11 11 12 12 13 13 101 102 103 101
[23] 101 103 101 102 103
Then dataFrame[, tmp] is asking it to make a 27-column data.frame based
on those columns (which don't exist in the original 4-column data.frame).
You should have gotten an 'undefined columns selected' error. Perhaps
it ran out of memory while checking all 184K * 13 columns. That would be
odd.
Now if you used the calls I mentioned at first (in the working example)
and R hung, there might be ways to speed up the process.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of ramoss
Sent: Wednesday, August 29, 2012 1:58 PM
To: r-help at r-project.org
Subject: [R] Deduping in R by multiple variables
I have a dataset w/ 184K obs & 16 variables. In SAS I proc sort nodupkey it
in seconds by 11 variables.
I tried to do the same thing in R using both the unique & then the
!duplicated functions but it just hangs there & I get no output. Does
anyone know how to solve this?
This is how I tried to do it in R:
detail3 <-
[!duplicated(c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
detail2$BEGTIME,
detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
detail2$ACCTYP
,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
detail2$STKFUL)),]
detail3 <-
unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
detail2$STKFUL)])
Thanks in advance
--
View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-
multiple-variables-tp4641778.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks for your help guys. I was refering to the variables the wrong way.
This worked for me:
idx <- !duplicated(detail2[,c("TDATE","FIRM","CM","BRANCH",
"BEGTIME", "ENDTIME","OTYPE","OCOND",
"ACCTYP","OSIDE","SHARES","STOCKS",
"STKFUL")])
detail3 <- detail2[idx,]
--
View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-multiple-variables-tp4641778p4641854.html
Sent from the R help mailing list archive at Nabble.com.