Given a dataframe:
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
D=c(5,1,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
Still trying to avoid loops
11 messages · Rui Barradas, Bert Gunter, Tom +3 more
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do
the job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
D=c(5,1,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
tapply() (of which by() is essentially a wrapper) **is** a (disguised) loop (at the R level, of course). Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll
On Wed, Feb 4, 2015 at 11:49 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do the
job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
D=c(5,1,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks, I was not aware of order().
I did deliberately mess up the order of S. The following example breaks
your solution
dat_2<-data.frame(S=factor(c('a','c','a','b','c','c')),
D=c(5,3,1,3,2,4))
which should give the answer c(2,2,1,1,2,3)
Your solution does indicate that sorting the data correctly before
starting might solve the problem.
On Wed, 2015-02-04 at 19:49 +0000, Rui Barradas wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do
the job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c('a','b','a','c','c','c',levels=c('b','a','c')),
D=c(1,5,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
No problem with disguise, I'm looking for pretty.
On Wed, 2015-02-04 at 12:06 -0800, Bert Gunter wrote:
tapply() (of which by() is essentially a wrapper) **is** a (disguised) loop (at the R level, of course). Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Wed, Feb 4, 2015 at 11:49 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do the
job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
D=c(5,1,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
+ D=c(5,1,3,2,3,4))
dat
S D 1 a 5 2 a 1 3 b 3 4 c 2 5 c 3 6 c 4
dat$visit <- ave(seq(nrow(dat)), dat$S, FUN = seq_along) dat
S D visit 1 a 5 1 2 a 1 2 3 b 3 1 4 c 2 1 5 c 3 2 6 c 4 3 Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
On Wed, Feb 4, 2015 at 3:08 PM, Tom Wright <tom at maladmin.com> wrote:
No problem with disguise, I'm looking for pretty. On Wed, 2015-02-04 at 12:06 -0800, Bert Gunter wrote:
tapply() (of which by() is essentially a wrapper) **is** a (disguised) loop (at the R level, of course). Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Wed, Feb 4, 2015 at 11:49 AM, Rui Barradas <ruipbarradas at sapo.pt>
wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will
do the
job. unname(unlist(tapply(dat$D, dat$S, order))) Hope this helps, Rui Barradas Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
D=c(5,1,3,2,3,4)) where S is a subject identifier and D a visit (actually a date in my real dataset). I would like to generate another column giving the
visit
number R=c(2,1,1,1,2,3) My current solution uses nested loops and is slow and ugly. I've
looked
at by() but can't see how to keep the order of R correct. Thanks, Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Sorry Jim, That messes up on S=='a'. Should be 2,1 not 1,2 Neat answer though and looks like it should be pretty quick after I apply some sorting.
On Wed, 2015-02-04 at 15:37 -0500, jim holtman wrote:
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
+ D=c(5,1,3,2,3,4))
dat
S D 1 a 5 2 a 1 3 b 3 4 c 2 5 c 3 6 c 4
dat$visit <- ave(seq(nrow(dat)), dat$S, FUN = seq_along) dat
S D visit
1 a 5 1
2 a 1 2
3 b 3 1
4 c 2 1
5 c 3 2
6 c 4 3
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Wed, Feb 4, 2015 at 3:08 PM, Tom Wright <tom at maladmin.com> wrote:
No problem with disguise, I'm looking for pretty.
On Wed, 2015-02-04 at 12:06 -0800, Bert Gunter wrote:
> tapply() (of which by() is essentially a wrapper) **is** a
(disguised)
> loop (at the R level, of course).
>
> Cheers,
> Bert
>
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And
knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Wed, Feb 4, 2015 at 11:49 AM, Rui Barradas
<ruipbarradas at sapo.pt> wrote:
> > Hello,
> >
> > Aren't the levels of your example wrong? If the levels are
> > levels=c('a','b','c'), not c('b', 'a', 'c'), then the
following will do the
> > job.
> >
> > unname(unlist(tapply(dat$D, dat$S, order)))
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> > Em 04-02-2015 19:34, Tom Wright escreveu:
> >>
> >> Given a dataframe:
> >>
> >>
dat<-data.frame(S=factor(c(rep('a',2),rep('b',1),rep('c',3)),levels=c('b','a','c')),
> >> D=c(5,1,3,2,3,4))
> >>
> >> where S is a subject identifier and D a visit (actually a
date in my
> >> real dataset). I would like to generate another column
giving the visit
> >> number
> >>
> >> R=c(2,1,1,1,2,3)
> >>
> >> My current solution uses nested loops and is slow and
ugly. I've looked
> >> at by() but can't see how to keep the order of R correct.
> >>
> >> Thanks,
> >> Tom
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained,
reproducible code.
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
A useful technique when it is easy to compute a vector from an ordered
data.frame but you need to do it for an unordered one is to compute the
order
vector 'ord', compute the vector from df[ord,], and use df[ord,...] <-
vector
to reorder the vector. In your case you could do:
> dat_2<-data.frame(S=factor(c('a','c','a','b','c','c')),
+ D=c(5,3,1,3,2,4))
> ord <- with(dat_2, order(S, D)) # order by subject, break ties by date
> dat_2$visitNo <- integer(nrow(dat_2)) # will fill this in next
> dat_2$visitNo[ord] <- with(dat_2[ord,], ave(visitNo, S, FUN=seq_along))
> dat_2
S D visitNo
1 a 5 2
2 c 3 2
3 a 1 1
4 b 3 1
5 c 2 1
6 c 4 3
Now this is different from your answer, c(2,2,1,1,2,3). Which is correct?
You can also do the reordering of the result from the ordered dataset by
subscripting the right hand side with [order(ord)], but I find using [ord]
on left side easier to remember.
with(dat_2[ord,], ave(visitNo, S, FUN=seq_along))[order(ord)]
[1] 2 2 1 1 1 3
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Feb 4, 2015 at 12:07 PM, Tom Wright <tom at maladmin.com> wrote:
Thanks, I was not aware of order().
I did deliberately mess up the order of S. The following example breaks
your solution
dat_2<-data.frame(S=factor(c('a','c','a','b','c','c')),
D=c(5,3,1,3,2,4))
which should give the answer c(2,2,1,1,2,3)
Your solution does indicate that sorting the data correctly before
starting might solve the problem.
On Wed, 2015-02-04 at 19:49 +0000, Rui Barradas wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do
the job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c('a','b','a','c','c','c',levels=c('b','a','c')),
D=c(1,5,3,2,3,4)) where S is a subject identifier and D a visit (actually a date in my real dataset). I would like to generate another column giving the visit number R=c(2,1,1,1,2,3) My current solution uses nested loops and is slow and ugly. I've looked at by() but can't see how to keep the order of R correct. Thanks, Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
How about?
ave(dat$D, dat$S, FUN=order)
[1] 2 1 1 1 2 3
ave(dat_2$D, dat_2$S, FUN=order)
[1] 2 2 1 1 1 3
Note, your answer for the second example is incorrect since row 2 (c, 3) and row 5 (c, 2) are both assigned 2.
-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Tom Wright
Sent: Wednesday, February 4, 2015 2:08 PM
To: Rui Barradas
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Still trying to avoid loops
Thanks, I was not aware of order().
I did deliberately mess up the order of S. The following example breaks
your solution
dat_2<-data.frame(S=factor(c('a','c','a','b','c','c')),
D=c(5,3,1,3,2,4))
which should give the answer c(2,2,1,1,2,3)
Your solution does indicate that sorting the data correctly before
starting might solve the problem.
On Wed, 2015-02-04 at 19:49 +0000, Rui Barradas wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do
the job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c('a','b','a','c','c','c',levels=c('b','a','c')),
D=c(1,5,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
A potential problem with
ave(dat_2$D, dat_2$S, FUN=order)
is that it will silently give the wrong answer
or give an error if dat_2$D is not numeric.
E.g., if D is a Date vector we get
> dat_3 <- dat_2[,1:2]
> dat_3$D <- as.Date(paste0("2015-02-", dat_2$D))
> with(dat_3, ave(D, S, FUN=order))
Error in as.Date.numeric(value) : 'origin' must be supplied
Another problem is that it may take a lot more time than
is required if you have a lot of small groups in your data.
Both of those are avoided if you sort the entire dataset first
and 'unsort' the results when putting them into dataset.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Feb 4, 2015 at 12:53 PM, David L Carlson <dcarlson at tamu.edu> wrote:
How about?
ave(dat$D, dat$S, FUN=order)
[1] 2 1 1 1 2 3
ave(dat_2$D, dat_2$S, FUN=order)
[1] 2 2 1 1 1 3
Note, your answer for the second example is incorrect since row 2 (c, 3)
and row 5 (c, 2) are both assigned 2.
-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Tom Wright
Sent: Wednesday, February 4, 2015 2:08 PM
To: Rui Barradas
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Still trying to avoid loops
Thanks, I was not aware of order().
I did deliberately mess up the order of S. The following example breaks
your solution
dat_2<-data.frame(S=factor(c('a','c','a','b','c','c')),
D=c(5,3,1,3,2,4))
which should give the answer c(2,2,1,1,2,3)
Your solution does indicate that sorting the data correctly before
starting might solve the problem.
On Wed, 2015-02-04 at 19:49 +0000, Rui Barradas wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do
the job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c('a','b','a','c','c','c',levels=c('b','a','c')),
D=c(1,5,3,2,3,4)) where S is a subject identifier and D a visit (actually a date in my real dataset). I would like to generate another column giving the visit number R=c(2,1,1,1,2,3) My current solution uses nested loops and is slow and ugly. I've looked at by() but can't see how to keep the order of R correct. Thanks, Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Of course you are correct the second answer should be c(2,2,1,1,1,3) Thanks everyone.
On Wed, 2015-02-04 at 20:53 +0000, David L Carlson wrote:
How about?
ave(dat$D, dat$S, FUN=order)
[1] 2 1 1 1 2 3
ave(dat_2$D, dat_2$S, FUN=order)
[1] 2 2 1 1 1 3
Note, your answer for the second example is incorrect since row 2 (c, 3) and row 5 (c, 2) are both assigned 2.
-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Tom Wright
Sent: Wednesday, February 4, 2015 2:08 PM
To: Rui Barradas
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Still trying to avoid loops
Thanks, I was not aware of order().
I did deliberately mess up the order of S. The following example breaks
your solution
dat_2<-data.frame(S=factor(c('a','c','a','b','c','c')),
D=c(5,3,1,3,2,4))
which should give the answer c(2,2,1,1,2,3)
Your solution does indicate that sorting the data correctly before
starting might solve the problem.
On Wed, 2015-02-04 at 19:49 +0000, Rui Barradas wrote:
Hello,
Aren't the levels of your example wrong? If the levels are
levels=c('a','b','c'), not c('b', 'a', 'c'), then the following will do
the job.
unname(unlist(tapply(dat$D, dat$S, order)))
Hope this helps,
Rui Barradas
Em 04-02-2015 19:34, Tom Wright escreveu:
Given a dataframe:
dat<-data.frame(S=factor(c('a','b','a','c','c','c',levels=c('b','a','c')),
D=c(1,5,3,2,3,4))
where S is a subject identifier and D a visit (actually a date in my
real dataset). I would like to generate another column giving the visit
number
R=c(2,1,1,1,2,3)
My current solution uses nested loops and is slow and ugly. I've looked
at by() but can't see how to keep the order of R correct.
Thanks,
Tom
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.