Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
subsets
10 messages · Den, Taras Zakharko, Keith Jewell +5 more
Hi! I think you should read the intro to R, as well as ?"[" and ?subset. It should help you to understand. Let's say your data is in a data.frame called df: # 1. ah and ihd df_ah_ihd <- df[df$diagnosis=="ah" | df$diagnosis=="ihd", ] ## the "|" is the boolean OR (you want one OR the other). Note the last comma #2. ah df_ah <- df[df$diagnosis=="ah", ] #3. ihd df_ihd <- df[df$diagnosis=="ihd", ] You could do the same using subset() if you feel better with this function. HTH, Ivan Le 1/20/2011 09:53, Den a ?crit :
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. S?ugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calandra at uni-hamburg.de ********** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php
Hello Den, your problem is not as it may seem so Ivan's suggestion is only a partial answer. I see that each patient can have more then one diagnosis and I take that you want to isolate patients based on particular conditions. Thus, simply looking for "ah" or "idh" as Ivan suggests will yield patients which can have either of those but not necessarily patients that have both. Instead, what one must do is apply the condition to the whole set of diagnosis associated with each patient. I think that its done best with the aggregate function. This function splits the data according to some factor (in our case it will be the patient id) and performs a routine on each subset (in our case it will be a condition test): ids <- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x && "ihd" %in% x) ids <- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x && !"ihd" %in% x) ids <- aggregate(diagnosis ~ id, df, function(x) ! "ah" %in% x && "ihd" %in% x) Now, ids will contain a data frame like: id diagnosis 1 TRUE 2 FALSE 3 FALSE ... which shows which patients have the set of diagnoses you asked for. You can then apply these patients to the original data by something like: subset(df, id %in% subset(ids, diagnosis == TRUE)$id) this will extract only patients from the 'ids' data frame for which the diagnosis applies and then extract the associated diagnosis sets from the original 'df' data frame. Hope it helps, Taras
On Jan 20, 2011, at 9:53 , Den wrote:
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I don't think Ivan's solution meets the OP's needs.
I think you could do it using %in% and the approriate logical operations
e.g.
aDF <- data.frame(id=c(1,2,2,2,3,3,4,4,4,5),
diagnosis=c("ah", "ah", "ihd", "im", "ah", "stroke", "ah", "ihd",
"angina", "ihd"))
aDF[with(aDF,(id %in% id[diagnosis=="ah"]) & (id %in%
id[diagnosis=="ihd"])),]
aDF[with(aDF,(id %in% id[diagnosis=="ah"]) & !(id %in%
id[diagnosis=="ihd"])),]
aDF[with(aDF,!(id %in% id[diagnosis=="ah"]) & (id %in%
id[diagnosis=="ihd"])),]
That starts to feel a bit fiddly for me. You might want to look at package
sqldf.
HTH
Keith J
--------------------------
"Ivan Calandra" <ivan.calandra at uni-hamburg.de> wrote in message
news:4D37FBEA.5070100 at uni-hamburg.de...
Hi!
I think you should read the intro to R, as well as ?"[" and ?subset. It
should help you to understand.
Let's say your data is in a data.frame called df:
# 1. ah and ihd
df_ah_ihd <- df[df$diagnosis=="ah" | df$diagnosis=="ihd", ] ## the "|"
is the boolean OR (you want one OR the other). Note the last comma
#2. ah
df_ah <- df[df$diagnosis=="ah", ]
#3. ihd
df_ihd <- df[df$diagnosis=="ihd", ]
You could do the same using subset() if you feel better with this function.
HTH,
Ivan
Le 1/20/2011 09:53, Den a ?crit :
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. S?ugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calandra at uni-hamburg.de ********** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110120/0909e5fe/attachment.pl>
On Thu, Jan 20, 2011 at 10:53:01AM +0200, Den wrote:
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
This may be understood as a two step procedure:
1. Split the id into disjoint groups according the above criteria.
2. Split the data cases into the groups from step 1.
If this is what you want, then function table() may be used to
collect information on each id.
df <- structure(list(id = c(1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 5L),
diagnosis = structure(c(1L, 1L, 3L, 4L, 1L, 5L, 1L, 3L, 2L, 3L),
.Label = c("ah", "angina", "ihd", "im", "stroke"), class = "factor")),
.Names = c("id", "diagnosis"), class = "data.frame", row.names = c(NA, -10L))
tab <- table(df$id, df$diag)
Then, for example, the data cases for "2. Patients with ah but no ihd"
may be obtained
sel <- tab[, "ah"] != 0 & tab[, "ihd"] == 0
ah.noihd <- dimnames(tab)[[1]][sel] # [1] "1" "3"
df[df$id %in% ah.noihd, ]
# id diagnosis
# 1 1 ah
# 5 3 ah
# 6 3 stroke
I hope, this helps.
Petr Savicky.
I did try it. It gave me [[1]] id diagnosis 1 1 ah 5 3 ah 7 4 ah 8 4 ihd 10 5 ihd [[2]] id diagnosis 1 1 ah 2 2 ah 5 3 ah 7 4 ah [[3]] id diagnosis 3 2 ihd 8 4 ihd 10 5 ihd Which isn't what the OP asked for
Q: How to make three data sets:
1. Patients with ah and ihd
id diagnosis 2 2 ah 3 2 ihd 4 2 im 7 4 ah 8 4 ihd 9 4 angina
2. Patients with ah but no ihd
id diagnosis 1 1 ah 5 3 ah 6 3 stroke
3. Patients with ihd but no ah?
id diagnosis
10 5 ihd
Regards,
KJ
---------------------------------
"Henrique Dallazuanna" <wwwhsd at gmail.com> wrote in message
news:AANLkTikQnw_hNtDyXdrJ+yTyqf6tGHLmH0qsLEoufTdJ at mail.gmail.com...
Try this:
lapply(list(c('ah', 'ihd'), 'ah', 'ihd'), function(x)subset(aDF, diagnosis
== x))
On Thu, Jan 20, 2011 at 6:53 AM, Den <d.kazakiewicz at gmail.com> wrote:
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O [[alternative HTML version deleted]] -------------------------------------------------------------------------------- >
Hi Taras, Indeed, I've overlooked the problem. Anyway, I'm not sure I would have been able to give a complete answer like you did! Ivan Le 1/20/2011 11:05, Taras Zakharko a ?crit :
Hello Den, your problem is not as it may seem so Ivan's suggestion is only a partial answer. I see that each patient can have more then one diagnosis and I take that you want to isolate patients based on particular conditions. Thus, simply looking for "ah" or "idh" as Ivan suggests will yield patients which can have either of those but not necessarily patients that have both. Instead, what one must do is apply the condition to the whole set of diagnosis associated with each patient. I think that its done best with the aggregate function. This function splits the data according to some factor (in our case it will be the patient id) and performs a routine on each subset (in our case it will be a condition test): ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&& "ihd" %in% x) ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&& !"ihd" %in% x) ids<- aggregate(diagnosis ~ id, df, function(x) ! "ah" %in% x&& "ihd" %in% x) Now, ids will contain a data frame like: id diagnosis 1 TRUE 2 FALSE 3 FALSE ... which shows which patients have the set of diagnoses you asked for. You can then apply these patients to the original data by something like: subset(df, id %in% subset(ids, diagnosis == TRUE)$id) this will extract only patients from the 'ids' data frame for which the diagnosis applies and then extract the associated diagnosis sets from the original 'df' data frame. Hope it helps, Taras On Jan 20, 2011, at 9:53 , Den wrote:
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. S?ugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calandra at uni-hamburg.de ********** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php
On 2011-01-20 02:05, Taras Zakharko wrote:
Hello Den, your problem is not as it may seem so Ivan's suggestion is only a partial answer. I see that each patient can have more then one diagnosis and I take that you want to isolate patients based on particular conditions. Thus, simply looking for "ah" or "idh" as Ivan suggests will yield patients which can have either of those but not necessarily patients that have both. Instead, what one must do is apply the condition to the whole set of diagnosis associated with each patient. I think that its done best with the aggregate function. This function splits the data according to some factor (in our case it will be the patient id) and performs a routine on each subset (in our case it will be a condition test): ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&& "ihd" %in% x) ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&& !"ihd" %in% x) ids<- aggregate(diagnosis ~ id, df, function(x) ! "ah" %in% x&& "ihd" %in% x) Now, ids will contain a data frame like: id diagnosis 1 TRUE 2 FALSE 3 FALSE ... which shows which patients have the set of diagnoses you asked for. You can then apply these patients to the original data by something like: subset(df, id %in% subset(ids, diagnosis == TRUE)$id) this will extract only patients from the 'ids' data frame for which the diagnosis applies and then extract the associated diagnosis sets from the original 'df' data frame. Hope it helps, Taras
Here's a tidy version using the plyr package:
require(plyr)
df1 <- ddply(df, .(id), summarize,
has.both = ("ah" %in% diagnosis) & ("ihd" %in% diagnosis),
has.only.ah = ("ah" %in% diagnosis) & !("ihd" %in% diagnosis),
has.only.ihd = !("ah" %in% diagnosis) & ("ihd" %in% diagnosis)
)
Further processing on the columns of df1 is straightforward.
Peter Ehlers
On Jan 20, 2011, at 9:53 , Den wrote:
Dear R people
Could you please help.
Basically, there are two variables in my data set. Each patient ('id')
may have one or more diseases ('diagnosis'). It looks like
id diagnosis
1 ah
2 ah
2 ihd
2 im
3 ah
3 stroke
4 ah
4 ihd
4 angina
5 ihd
..............
Q: How to make three data sets:
1. Patients with ah and ihd
2. Patients with ah but no ihd
3. Patients with ihd but no ah?
If you have any ideas could just guide what should I look for. Is a
subset or aggregate, or loops, or something else??? I am a bit lost. (F1
F1 F1 !!!:)
Thank you
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
3 days later
require(data.table)
DT = as.data.table(df)
# 1. Patients with ah and ihd
DT[,.SD["ah"%in%diagnosis && "ihd"%in%diagnosis],by=id]
id diagnosis
[1,] 2 ah
[2,] 2 ihd
[3,] 2 im
[4,] 4 ah
[5,] 4 ihd
[6,] 4 angina
# 2. Patients with ah but no ihd
DT[,.SD["ah"%in%diagnosis && !"ihd"%in%diagnosis],by=id]
id diagnosis
[1,] 1 ah
[2,] 3 ah
[3,] 3 stroke
# 3. Patients with ihd but no ah?
DT[,.SD[!"ah"%in%diagnosis && "ihd"%in%diagnosis],by=id]
id diagnosis
[1,] 5 ihd
View this message in context: http://r.789695.n4.nabble.com/subsets-tp3227143p3233177.html Sent from the R help mailing list archive at Nabble.com.