Jay, I really appreciate all your help help. I posted to Nabble an R file and input CSV files more accurately demonstrating what I am seeing and the output I desire to achieve when I difference two dataframes. http://n2.nabble.com/Support-SetDiff-Discussion-Items...-td2999739.html It may be that "setdiff" as intended in the base R functionality and "prob" was never intended to provide the type of result I desire. If that is the case then I will need to ask the "Ninjas" for help to produce the out come I seek. That is, when I different the data within RSetDiffEntry.csv and RSetDuplicatesRemoved.csv, I desire to get the result shown in RDesired.csv. Note that, it would not be enough to just work to remove duplicate "CostPerSquareFoot" values, since that variable is tied to "EntryDate" and "HouseNumber". Any further help and insights are much appreciated. Thanks again, Jason
--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
From: G. Jay Kerns <gkerns at ysu.edu> Subject: setdiff bizarre (was: odd behavior out of setdiff) To: r-devel at r-project.org Cc: dwinsemius at comcast.net, jasonkrupert at yahoo.com Date: Friday, May 29, 2009, 11:35 PM Dear R-devel, Please see the recent thread on R-help, "Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified" posted by Jason Rupert.? I gave an answer, then read David Winsemius' answer, and then did some follow-up investigation. I would like to change my answer. My current version of setdiff() is acting in a way that I do not understand, and a way that I suspect? has changed.? Consider the following, derived from Jason's OP: The base package setdiff(), atomic vectors: x <- 1:100 y <- c(x,x) setdiff(x, y)? # integer(0) setdiff(y, x)? # integer(0) z <- 1:25 setdiff(x,z)???# 26:100 setdiff(z,x)???# integer(0) Everything is fine. Now look at base package setdiff(), data frames??? ################################ A <- data.frame(x = 1:100) B <- rbind(A, A) setdiff(A, B)? ? ? ? ? ? ???# df 1:100? setdiff(B, A)? ? ? ? ? ? ???# df 1:100? C <- data.frame(x = 1:25) setdiff(A, C)? ? ? ? ? ? ???# df 1:100? setdiff(C, A)? ? ? ? ? ? ???# df 1:25? ############################ I have read ?setdiff 37 times now, and I cannot divine any interpretation that matches the above output.? From the source, it appears that match(x, y, 0L) == 0L is evaluating to TRUE, of length equal to the columns of x, and then x[match(x, y, 0L) == 0L] is returning the entire data frame. Compare with the output from package "prob", which uses a setdiff that operates row-wise: ########################### library(prob) A <- data.frame(x = 1:100) B <- rbind(A, A) setdiff(A, B)? ? ? ? ? ? ???# integer(0) setdiff(B, A)? ? ? ? ? ? ???# integer(0) C <- data.frame(x = 1:25) setdiff(A, C)? ? ? ? ? ? ???# 26:100 setdiff(C, A)? ? ? ? ? ? ???# integer(0) IMHO, the entire notion of "set" and "element" is problematic in the df case, so I am not advocating the adoption of the prob:::setdiff approach;? rather, setdiff is behaving in a way that I cannot believe with my own eyes, and I would like to alert those who can speak as to why this may be happening. Thanks to Jason for bringing this up, and to David for catching the discrepancy. Session info is below.? I use the binaries prepared by the Debian group so I do not have the latest patched-revision-4440986745343b. This must have been related to something which has been fixed since April 17, and in that case, please disregard my message. Yours truly, Jay
sessionInfo()
R version 2.9.0 (2009-04-17) x86_64-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats? ???graphics? grDevices utils? ???datasets? methods???base other attached packages: [1] prob_0.9-1 -- *************************************************** G. Jay Kerns, Ph.D. Associate Professor Department of Mathematics & Statistics Youngstown State University Youngstown, OH 44555-0002 USA Office: 1035 Cushwa Hall Phone: (330) 941-3310 Office (voice mail) -3302 Department -3170 FAX E-mail: gkerns at ysu.edu http://www.cc.ysu.edu/~gjkerns/