identify duplicate from more than one column
Hi Carlos,
Here is one option:
## read in your data
dat <- read.table(textConnection("
obs unit home z sex age
1 015029 18 1 1 053
2 015029 18 1 2 049
3 015029 01 1 1 038
4 015029 01 1 2 033
5 015029 02 1 1 036
6 015029 02 1 2 033
7 015029 03 1 1 023
8 015029 03 1 2 019
9 015029 04 1 2 045
10 015029 05 1 2 047"),
header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()
## create a unique ID for matching unit and home
dat$mID <- with(dat, paste(unit, home, sep = ''))
## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
i <<- i + 1L
if (length(x) > 1 && length(unique(x)) > 1) {
rep(i, length(x))
} else 0L
})))
## view results
dat
obs unit home z sex age mID couple
1 1 15029 18 1 1 53 1502918 1
2 2 15029 18 1 2 49 1502918 1
3 3 15029 1 1 1 38 150291 2
4 4 15029 1 1 2 33 150291 2
5 5 15029 2 1 1 36 150292 3
6 6 15029 2 1 2 33 150292 3
7 7 15029 3 1 1 23 150293 4
8 8 15029 3 1 2 19 150293 4
9 9 15029 4 1 2 45 150294 0
10 10 15029 5 1 2 47 150295 0
See these functions for more details:
?ave # where I got my idea
?split
?lapply
?`<<-`
Cheers,
Josh
On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4life at gmail.com> wrote:
Hi all, I've searched everywhere to try to find out how to do this and have had no luck. I am trying to construct identifiers for couples in a dataset. Essentially, I want to identify couples using more than one column as identifiers. Take for instance: obs ? ? unit ? ? ? ? ? ?home ? ? ? z ? ?sex ? ? age 1 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 053 2 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 049 3 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 038 4 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033 5 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 036 6 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033 7 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 023 8 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 019 9 ? ? ? 015029 ?04 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 045 10 ? ? ?015029 ?05 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 047 Where unit is the housing unit, home is household. Of course, there are more values for unit, although these first ten observations consist of the same unit (which could possibly be an apartment complex). Nonetheless, I want to construct an identifier for couples if unit, home match, but only if both male and female are within the same household. Taking the example data above, I want to see this: ? ? ? ?unit ? ? ? ? ? ?home ? ?z ? ? ? sex ? ? age ? ? ?couple 1 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 053 ? ? ?1 2 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 049 ? ? ?1 3 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 038 ? ? ?2 4 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033 ? ? ?2 5 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 036 ? ? ?3 6 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033 ? ? ?3 7 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 023 ? ? ?4 8 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 019 ? ? ?4 9 ? ? ? 015029 ?04 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 045 ? ? ?0 10 ? ? ?015029 ?05 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 047 ? ? ?0 As you can see in the last two observations, there were no males identified within the same household, thus the last two observations would not contain couple identifiers, rather some other identifier (but the same one) so I can detect them and remove them later. I've tried using the duplicated function but was not very useful. Any help would be greatly appreciated!!! Thanks, Carlos -- View this message in context: http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/