Hi all, I think this should be an easy question for the guru's out here. I have this large data frame (2.500.000 rows, 15 columns) and I want to add a column named "SEGMENT" to it. The first 5% rows (first 125.000 rows) should have the value "Top 5%" in the SEGMENT column Then the rows from 5% to 20% should have the value "5 to 20" Then 20-50% should have the value "20 to 50" And the last 50% of the rows should have the value "Bottom 50" What is the easiest way of doing this? I was thinking of using quantile but then I should have some rownumber column. Regards Derk -- View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711.html Sent from the R help mailing list archive at Nabble.com.
Add a column to a data frame with value based on the percentile of the row
7 messages · Rui Barradas, arun, Dark
Hello,
Combine quantile() with findInterval(). Something like the following.
# sample data
x <- rnorm(100)
val <- c("Bottom 50", "20 to 50", "5 to 20", "Top 5%")
qq <- quantile(x, probs = c(0, 0.50, 0.70, 0.95, 1))
idx <- findInterval(x, qq)
val[idx]
Hope this helps,
Rui Barradas
Em 31-07-2013 10:37, Dark escreveu:
Hi all, I think this should be an easy question for the guru's out here. I have this large data frame (2.500.000 rows, 15 columns) and I want to add a column named "SEGMENT" to it. The first 5% rows (first 125.000 rows) should have the value "Top 5%" in the SEGMENT column Then the rows from 5% to 20% should have the value "5 to 20" Then 20-50% should have the value "20 to 50" And the last 50% of the rows should have the value "Bottom 50" What is the easiest way of doing this? I was thinking of using quantile but then I should have some rownumber column. Regards Derk -- View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi,
May be this helps:
set.seed(24)
dat1<- data.frame(ID=1:500,value=rnorm(500))
indx<-round(quantile(as.numeric(row.names(dat1)),probs=c(0.05,0.20,0.50,1)))
indx1<-findInterval(row.names(dat1),indx,rightmost.closed=TRUE)
dat1$SEGMENT<- as.character(factor(indx1,labels=c("Top 5%","5 to 20","20 to 50", "Bottom 50")))
head(dat1)
#? ID????? value SEGMENT
#1? 3 -0.7859574? Top 5%
#2? 3? 1.0117428? Top 5%
#3? 8 -2.1558035? Top 5%
#4? 6? 1.7803880? Top 5%
#5? 7? 0.4192816? Top 5%
#6 10 -1.0142512? Top 5%
?tail(dat1)
#??? ID????? value?? SEGMENT
#495? 1? 0.3571848 Bottom 50
#496? 9 -1.1971854 Bottom 50
#497? 5? 0.3544896 Bottom 50
#498? 8 -0.1562356 Bottom 50
#499? 8 -0.2994321 Bottom 50
#500? 8 -0.4170319 Bottom 50
A.K.
----- Original Message -----
From: Dark <info at software-solutions.nl>
To: r-help at r-project.org
Cc:
Sent: Wednesday, July 31, 2013 5:37 AM
Subject: [R] Add a column to a data frame with value based on the percentile of the row
Hi all,
I think this should be an easy question for the guru's out here.
I have this large data frame (2.500.000 rows, 15 columns) and I want to add
a column named "SEGMENT" to it.
The first 5% rows (first 125.000 rows) should have the value "Top 5%" in the
SEGMENT column
Then the rows from 5% to 20% should have the value "5 to 20"
Then 20-50% should have the value "20 to 50"
And the last 50% of the rows should have the value "Bottom 50"
What is the easiest way of doing this? I was thinking of using quantile but
then I should have some rownumber column.
Regards Derk
--
View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi Arun Kirshna,
I have tested your method and it will work for me.
I only run into one problem. Before I want to do this operation I have
sorted my data frame so my rownumbers ar not subsequent.
You can see if you first order your example data frame like:
dat1 <- dat1[order(-dat1$value),]
head(dat1)
ID value SEGMENT
237 237 3.538552 20 to 50
21 21 3.376149 Top 5%
421 421 3.015634 Bottom 50
339 339 2.855991 Bottom 50
119 119 2.589574 20 to 50
12 12 2.512276 Top 5%
Do you have a solution for this?
--
View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711p4672726.html
Sent from the R help mailing list archive at Nabble.com.
Hi,
?set.seed(24)
dat1<- data.frame(ID=1:500,value=rnorm(500))
?dat1 <- dat1[order(-dat1$value),]
row.names(dat1)<-1:nrow(dat1) ########
indx<-round(quantile(as.numeric(row.names(dat1)),probs=c(0.05,0.20,0.50,1)))
indx1<-findInterval(row.names(dat1),indx,rightmost.closed=TRUE)
dat1$SEGMENT<- as.character(factor(indx1,labels=c("Top 5%","5 to 20","20 to 50", "Bottom 50")))
A.K.
Hi Arun Kirshna,
I have tested your method and it will work for me.
I only run into one problem. Before I want to do this operation I have sorted my data frame so my rownumbers ar not subsequent.
You can see if you first order your example data frame like:
dat1 <- dat1[order(-dat1$value),]
head(dat1)
? ? ?ID ? ?value ? SEGMENT
237 237 3.538552 ?20 to 50
21 ? 21 3.376149 ? ?Top 5%
421 421 3.015634 Bottom 50
339 339 2.855991 Bottom 50
119 119 2.589574 ?20 to 50
12 ? 12 2.512276 ? ?Top 5%
Do you have a solution for this?
----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Dark <info at software-solutions.nl>
Cc: R help <r-help at r-project.org>
Sent: Wednesday, July 31, 2013 7:48 AM
Subject: Re: [R] Add a column to a data frame with value based on the percentile of the row
Hi,
May be this helps:
set.seed(24)
dat1<- data.frame(ID=1:500,value=rnorm(500))
indx<-round(quantile(as.numeric(row.names(dat1)),probs=c(0.05,0.20,0.50,1)))
indx1<-findInterval(row.names(dat1),indx,rightmost.closed=TRUE)
dat1$SEGMENT<- as.character(factor(indx1,labels=c("Top 5%","5 to 20","20 to 50", "Bottom 50")))
head(dat1)
#? ID????? value SEGMENT
#1? 3 -0.7859574? Top 5%
#2? 3? 1.0117428? Top 5%
#3? 8 -2.1558035? Top 5%
#4? 6? 1.7803880? Top 5%
#5? 7? 0.4192816? Top 5%
#6 10 -1.0142512? Top 5%
?tail(dat1)
#??? ID????? value?? SEGMENT
#495? 1? 0.3571848 Bottom 50
#496? 9 -1.1971854 Bottom 50
#497? 5? 0.3544896 Bottom 50
#498? 8 -0.1562356 Bottom 50
#499? 8 -0.2994321 Bottom 50
#500? 8 -0.4170319 Bottom 50
A.K.
----- Original Message -----
From: Dark <info at software-solutions.nl>
To: r-help at r-project.org
Cc:
Sent: Wednesday, July 31, 2013 5:37 AM
Subject: [R] Add a column to a data frame with value based on the percentile of the row
Hi all,
I think this should be an easy question for the guru's out here.
I have this large data frame (2.500.000 rows, 15 columns) and I want to add
a column named "SEGMENT" to it.
The first 5% rows (first 125.000 rows) should have the value "Top 5%" in the
SEGMENT column
Then the rows from 5% to 20% should have the value "5 to 20"
Then 20-50% should have the value "20 to 50"
And the last 50% of the rows should have the value "Bottom 50"
What is the easiest way of doing this? I was thinking of using quantile but
then I should have some rownumber column.
Regards Derk
--
View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Works like a charm, thanks a lot! -- View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711p4672728.html Sent from the R help mailing list archive at Nabble.com.
Hello, Sorry, that should be 0.80, not 0.70. qq <- quantile(x, probs = c(0, 0.50, 0.80, 0.95, 1)) Rui Barradas Em 31-07-2013 12:22, Rui Barradas escreveu:
Hello,
Combine quantile() with findInterval(). Something like the following.
# sample data
x <- rnorm(100)
val <- c("Bottom 50", "20 to 50", "5 to 20", "Top 5%")
qq <- quantile(x, probs = c(0, 0.50, 0.70, 0.95, 1))
idx <- findInterval(x, qq)
val[idx]
Hope this helps,
Rui Barradas
Em 31-07-2013 10:37, Dark escreveu:
Hi all, I think this should be an easy question for the guru's out here. I have this large data frame (2.500.000 rows, 15 columns) and I want to add a column named "SEGMENT" to it. The first 5% rows (first 125.000 rows) should have the value "Top 5%" in the SEGMENT column Then the rows from 5% to 20% should have the value "5 to 20" Then 20-50% should have the value "20 to 50" And the last 50% of the rows should have the value "Bottom 50" What is the easiest way of doing this? I was thinking of using quantile but then I should have some rownumber column. Regards Derk -- View this message in context: http://r.789695.n4.nabble.com/Add-a-column-to-a-data-frame-with-value-based-on-the-percentile-of-the-row-tp4672711.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.