Dear r experts,
Sorry for this basic question, but I can't seem to find a solution?
I have this data frame:
df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A =
c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 =
c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y",
"N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))
df
id A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA N 0 3
2 id1 11907 3 2 1 Y 0 0
3 id1 11907 NA NA NA N 0 3
4 id2 11829 1 2 1 Y 1 0
5 id2 11829 2 NA NA N 0 2
6 id2 11829 NA NA NA N 0 3
And I need to keep, of the rows that have the same value for "A" by id, only
the ones with the least amount of missing values for all the variables (with
min(numMiss)) to get this:
id A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA N 0 3
2 id1 11907 3 2 1 Y 0 0
4 id2 11829 1 2 1 Y 1 0
Then I have to choose the records with the least value of "A" of the rows
that have the same id like this:
id A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA N 0 3
4 id2 11829 1 2 1 Y 1 0
For groupings I have used the package "plyr" before, but this would involve
a sort of double-grouping by id and by duplicated values of A?Could you
please help me understand how this can be done?
Thank you very much.
-f
--
View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
Sent from the R help mailing list archive at Nabble.com.
Date: Sat, 14 Apr 2012 12:03:36 -0700
From: francy.casalino at gmail.com
To: r-help at r-project.org
Subject: [R] Choose between duplicated rows
Dear r experts,
Sorry for this basic question, but I can't seem to find a solution?
I have this data frame:
df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A =
c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 =
c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y",
"N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))
df
id A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA N 0 3
2 id1 11907 3 2 1 Y 0 0
3 id1 11907 NA NA NA N 0 3
4 id2 11829 1 2 1 Y 1 0
5 id2 11829 2 NA NA N 0 2
6 id2 11829 NA NA NA N 0 3
And I need to keep, of the rows that have the same value for "A" by id, only
the ones with the least amount of missing values for all the variables (with
min(numMiss)) to get this:
id A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA N 0 3
2 id1 11907 3 2 1 Y 0 0
4 id2 11829 1 2 1 Y 1 0
Then I have to choose the records with the least value of "A" of the rows
that have the same id like this:
id A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA N 0 3
4 id2 11829 1 2 1 Y 1 0
For groupings I have used the package "plyr" before, but this would involve
a sort of double-grouping by id and by duplicated values of A?Could you
please help me understand how this can be done?
Thank you very much.
-f
--
View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
Sent from the R help mailing list archive at Nabble.com.
Thank you very much to both your replies.
Trinker's solution works great for small dataset, but the 'split' function
just hangs when I try to apply it to all my data (around 9,000 rows)?Has
anyone encountered this problem before, and do you know what I could try?
Thanks again.
--
View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4559319.html
Sent from the R help mailing list archive at Nabble.com.
I also tried using Jim's code, but it doesn't work as expected with my real
dataset. This is what I did:
Best.na <- do.call(rbind, lapply(split(x, x$A), function(.grp){
best <- which.min(apply(.grp, 1, function(a) sum(is.na(a))))
.grp[best, ]
}))
df.split <- split(Best.na, Best.na$id)
Best.date <- lapply(df.split, function(x){
# Select by given criterion
y <- x[which(max(x$A) == x$A),]
y
})
Best.date <- do.call(rbind, Best.date)
Thank you again for your help.
--
View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4559792.html
Sent from the R help mailing list archive at Nabble.com.