Message-ID: <378129.41105.qm@web56003.mail.re3.yahoo.com>
Date: 2009-05-29T21:58:50Z
From: Jason Rupert
Subject: Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
Jay,
Thanks much for the reply. I think you are right about the prob. Unfortunately, I was not able to find the old emails I had discussing the use of the more powerful setdiff that essentially inherits from the base class R setdiff functionality but extends that functionality by now working with data.frames instead of just a simple array of values. Love this functionality.
However, for the following example,
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"))
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"), Price = c("Low"))
Test2_DF<-rbind(Test1_DF, Test1_DF)
setdiff(Test1_DF, Test2_DF)
[1] HouseSize LandLocation Price
<0 rows> (or 0-length row.names)
> setdiff(Test2_DF, Test1_DF)
[1] HouseSize LandLocation Price
<0 rows> (or 0-length row.names)
I was hoping for this example one of the setdiff's would have returned essentially Test1_DF, since it is duplicated and that is what is different between the two dataframes.
So, I guess I am trying to figure out a way to truely diff the dataframes, i.e. determine when two data.frames are different from one another and then receive the output of the results.
Does this capability exist in a function within a current R package or does it exist within a typically used pattern to create this functionality?
Thanks again for any feedback you can provide.
Also, I tried to determine my Session Info and the packages I have loaded, but I received the following:
> sessionInfo()
Error in x$Priority : $ operator is invalid for atomic vectors
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'prob' is missing or broken
2: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'ggplot2' is missing or broken
3: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'reshape' is missing or broken
4: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'RColorBrewer' is missing or broken
5: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'proto' is missing or broken
6: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'plyr' is missing or broken
7: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'nortest' is missing or broken
8: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'fBasics' is missing or broken
9: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'timeSeries' is missing or broken
10: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'timeDate' is missing or broken
11: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'vcd' is missing or broken
12: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'colorspace' is missing or broken
However, I typically load the following ones:
library(colorspace, lib.loc=RLibraryPathLocation)
library(vcd, lib.loc=RLibraryPathLocation)
library(timeDate, lib.loc=RLibraryPathLocation)
library(timeSeries, lib.loc=RLibraryPathLocation)
library(fBasics, lib.loc=RLibraryPathLocation)
library(nortest, lib.loc=RLibraryPathLocation)
library(plyr, lib.loc=RLibraryPathLocation)
library(proto, lib.loc=RLibraryPathLocation)
library(RColorBrewer, lib.loc=RLibraryPathLocation)
library(reshape, lib.loc=RLibraryPathLocation)
library(ggplot2, lib.loc=RLibraryPathLocation)
library(prob, lib.loc=RLibraryPathLocation)
--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
> From: G. Jay Kerns <gkerns at ysu.edu>
> Subject: Re: [R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
> To: "Jason Rupert" <jasonkrupert at yahoo.com>
> Cc: R-help at r-project.org
> Date: Friday, May 29, 2009, 3:21 PM
> Dear Jason,
>
> On Fri, May 29, 2009 at 2:48 PM, Jason Rupert <jasonkrupert at yahoo.com>
> wrote:
> >
> > I think I am using the improved version of
> setdiff(...) that handles data.frames, so I think some odd
> behavior was expected but this one is escaping me.
> >
> > It appears that the the addition of duplicate entries
> is not caught by the setdiff(...). ?Is this expected
> behavior?
>
> [snip]
>
> > Thanks in advance for any feedback.
> >
> > Test1_DF<-data.frame(HouseSize=c(1:100))
> > Test2_DF<-rbind(Test1_DF, Test1_DF)
> > setdiff(Test1_DF, Test2_DF)
> > integer(0)
> > setdiff(Test2_DF, Test1_DF)
> > integer(0)
> >
> > However,
> > Test3_DF<-data.frame(HouseSize=c(1:25))
> > setdiff(Test1_DF, Test3_DF)
> > ?[1] ?26 ?27 ?28 ?29 ?30 ?31 ?32 ?33 ?34
> ?35 ?36 ?37 ?38 ?39 ?40 ?41
> > [17] ?42 ?43 ?44 ?45 ?46 ?47 ?48 ?49 ?50 ?51
> ?52 ?53 ?54 ?55 ?56 ?57
> > [33] ?58 ?59 ?60 ?61 ?62 ?63 ?64 ?65 ?66 ?67
> ?68 ?69 ?70 ?71 ?72 ?73
> > [49] ?74 ?75 ?76 ?77 ?78 ?79 ?80 ?81 ?82 ?83
> ?84 ?85 ?86 ?87 ?88 ?89
> > [65] ?90 ?91 ?92 ?93 ?94 ?95 ?96 ?97 ?98 ?99
> 100
> >
> > setdiff(Test3_DF, Test1_DF)
> > integer(0)
>
>
> You didn't explicitly say which "improved version" of
> setdiff() that
> you are using, so I can only presume that you are using
> the
> setdiff.data.frame in the prob package.
>
> The behaviour you are observing is expected and matches
> the
> base:::setdiff behaviour in the case of vectors;? cf.
>
> x1 <- c(1:100)
> x2 <- c(x1,x1)
>
> setdiff(x1, x2)? # integer(0)
> setdiff(x2, x1)? # integer(0)
>
> x3 <- c(1:25)
> setdiff(x1, x3)? # 26:100
> setdiff(x3, x1)? # integer(0)
>
>
> >
> > If so, is there another method or approach that should
> be used to identify duplicate row entries between two
> different data frames?
> >
>
> The R-help archives are chock full of every possible
> variant of
> questions (and answers) about this, and you haven't said
> _exactly_
> what you are looking for. In the absence of an already
> posted
> solution, please specify exactly what you want and I'll
> wager an R
> Ninja could dispatch it in moments.
>
> Regards,
> Jay
>
>
>
>
>
>
>
>
>
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>