Skip to content

Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified

3 messages · Jason Rupert, David Winsemius, G. Jay Kerns

#
Jay, 


Thanks much for the reply.    I think you are right about the prob. Unfortunately, I was not able to find the old emails I had discussing the use of the more powerful setdiff that essentially inherits from the base class R setdiff functionality but extends that functionality by now working with data.frames instead of just a simple array of values.  Love this functionality.   

However, for the following example, 
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"))
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"), Price = c("Low"))
Test2_DF<-rbind(Test1_DF, Test1_DF)
setdiff(Test1_DF, Test2_DF)
[1] HouseSize    LandLocation Price       
<0 rows> (or 0-length row.names)
[1] HouseSize    LandLocation Price       
<0 rows> (or 0-length row.names)

I was hoping for this example one of the setdiff's would have returned essentially Test1_DF, since it is duplicated and that is what is different between the two dataframes.  

So, I guess I am trying to figure out a way to truely diff the dataframes, i.e. determine when two data.frames are different from one another and then receive the output of the results.  

Does this capability exist in a function within a current R package or does it exist within a typically used pattern to create this functionality?  

Thanks again for any feedback you can provide. 
 

Also, I tried to determine my Session Info and the packages I have loaded, but I received the following:
Error in x$Priority : $ operator is invalid for atomic vectors
In addition: There were 12 warnings (use warnings() to see them)
Warning messages:
1: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'prob' is missing or broken
2: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'ggplot2' is missing or broken
3: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'reshape' is missing or broken
4: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'RColorBrewer' is missing or broken
5: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'proto' is missing or broken
6: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'plyr' is missing or broken
7: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'nortest' is missing or broken
8: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'fBasics' is missing or broken
9: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'timeSeries' is missing or broken
10: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'timeDate' is missing or broken
11: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'vcd' is missing or broken
12: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
  DESCRIPTION file of package 'colorspace' is missing or broken


However, I typically load the following ones:
library(colorspace, lib.loc=RLibraryPathLocation)
library(vcd, lib.loc=RLibraryPathLocation)
library(timeDate, lib.loc=RLibraryPathLocation)
library(timeSeries, lib.loc=RLibraryPathLocation)
library(fBasics, lib.loc=RLibraryPathLocation)
library(nortest, lib.loc=RLibraryPathLocation)
library(plyr, lib.loc=RLibraryPathLocation)
library(proto, lib.loc=RLibraryPathLocation)
library(RColorBrewer, lib.loc=RLibraryPathLocation)
library(reshape, lib.loc=RLibraryPathLocation)
library(ggplot2, lib.loc=RLibraryPathLocation)
library(prob, lib.loc=RLibraryPathLocation)
--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:

            
#
But I get:

#omitted initial line which would have create an object only to be  
overwritten.

 > Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"),  
Price = c("Low"))
 > Test2_DF<-rbind(Test1_DF, Test1_DF)
 > setdiff(Test1_DF, Test2_DF)
     HouseSize LandLocation Price
1           1         Here   Low
2           2         Here   Low
3           3         Here   Low
4           4         Here   Low
5           5         Here   Low
.... snipped additional 95 rows.

Furthermore I did not load any library (nor did your indicate what  
packages you have loaded), and there does not seem to be a  
setdiff.data.frame in my workspace:
 > setdiff.data.frame
Error: object "setdiff.data.frame" not found

 > sessionInfo()
R version 2.8.1 Patched (2009-01-19 r47650)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    splines   stats     graphics  grDevices utils      
datasets  methods   base

other attached packages:
[1] MASS_7.2-46       reshape_0.8.2     plyr_0.1.5         
modeltools_0.2-16 mvtnorm_0.9-4
[6] survival_2.35-4

loaded via a namespace (and not attached):
[1] coin_1.0-1
On May 29, 2009, at 5:58 PM, Jason Rupert wrote:

            
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
#
Jason,
On Fri, May 29, 2009 at 5:58 PM, Jason Rupert <jasonkrupert at yahoo.com> wrote:
Your previous post is here

[1]  http://tolstoy.newcastle.edu.au/R/e6/help/09/03/7781.html

and my earlier post is here:

[2]  https://stat.ethz.ch/pipermail/r-devel/2007-December/047706.html

(please note that the link in [1] referring to [2] is now broken).
As mentioned in [2], the notions of "set" and "element" are ambiguous
in the data frame case... what is an element...? a row, a column, or a
single entry?
Your question speaks to the ambiguity above.  For instance, your 2nd
example would be solved by a setdiff for data frames that operates
column-wise.  If that is all you want, then IIRC there are at least 3
independent solutions in [2] to the row-wise problem.  It should be
easy enough to tweak one of them to operate on columns instead.

For an efficient setdiff() for data frames that can decipher
on-the-fly which of row/column/entry is desired, I am going to have to
defer to the aforementioned Ninjas.  :-)
Ninjas.


Hope this helps,
Jay