help with mysql and R: partitioning by quintile

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110508/65322ce7/attachment.pl>
try this:
# create some data
x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
+               , track = rep(1:20, 20)
+               , freq = floor(runif(400, 10, 200))
+               , stringsAsFactors = FALSE
+               )
# get the quantiles for each track
tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1))
# create a matrix with the rownames as the tracks to use in the findInterval
tqm <- do.call(rbind, tq)
# now put the ratings
require(data.table)
x.dt <- data.table(x)
x.new <- x.dt[,
+               list(userid = userid
+                  , freq = freq
+                  , rating = findInterval(freq
+                                        # use track as index into
quantile matrix
+                                        , tqm[as.character(track[1L]),]
+                                        , rightmost.closed = TRUE
+                                        ) + 1L
+                  )
+              , by = track]
head(x.new)
track userid freq rating
[1,]     1     u1   10      1
[2,]     1     u2   15      1
[3,]     1     u3  126      4
[4,]     1     u4  117      3
[5,]     1     u5   76      2
[6,]     1     u6  103      3

Hi,

I have a mysql table with fields userid,track,frequency e.g
u1,1,10
u1,2,100
u1,3,110
u1,4,200
u1,5,120
u1,6,130
.
u2,1,23
.
.
where "frequency" is the number of times a music track is played by a
"userid"

I need to turn my 'frequency' table into a rating table (it's for a
recommender system). So, for each user, I need to categorise the frequency
of tracks played by quintile so that each particular track can have 5
ratings (1-5), with the ratings allocated as follows: inter-quintile range
100-80% = rating 5, ? inter-quintile range 80-60% = rating 4,
..., inter-quintile range 20-0% = rating 1)

Hence, I need to create a table with fields userid,track,rating:
u1,1,1
u1,2, 3
...

Can anybody help me to do this with R?

Regards
Gawesh

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
One way to get the ratings would be to use the ave() function:

rating = ave(x$freq,x$track,
              FUN=function(x)cut(x,quantile(x,(0:5)/5),include.lowest=TRUE))

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu

Hi,

I have a mysql table with fields userid,track,frequency e.g
u1,1,10
u1,2,100
u1,3,110
u1,4,200
u1,5,120
u1,6,130
.
u2,1,23
.
.
where "frequency" is the number of times a music track is played by a
"userid"

I need to turn my 'frequency' table into a rating table (it's for a
recommender system). So, for each user, I need to categorise the frequency
of tracks played by quintile so that each particular track can have 5
ratings (1-5), with the ratings allocated as follows: inter-quintile range
100-80% = rating 5,   inter-quintile range 80-60% = rating 4,
..., inter-quintile range 20-0% = rating 1)

Hence, I need to create a table with fields userid,track,rating:
u1,1,1
u1,2, 3
...

Can anybody help me to do this with R?

Regards
Gawesh

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110514/9445a703/attachment.pl>
Hi:

Is this what you're after?

tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2)))
ds$int <- with(ds, cut(freq, c(0, tq)))
with(ds, table(int))

int
 (0,1]  (1,2]  (2,4]  (4,7] (7,16]
    10      6      7      6      6

HTH,
Dennis
Hi Jim,
Thanks very much for the code. I modified it a bit because I needed to
allocate the track ratings by userid (eg if user 1 plays track x once, he
gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and not
by track (sorry if this wasn't clear in my original post).

This is almost working! What I can't get right at the moment is the cutoff
interval for the ratings.
Any help please?

Sample data:

userid,track,freq
1,1,1
1,2,10
1,3,1
1,4,1

1,5,15
1,6,4
1,7,16
1,8,6
1,9,1
1,10,1
1,11,2
1,12,2
1,13,1
1,14,6
1,15,7
1,16,13
1,17,3
1,18,2
1,19,5
1,20,2
1,21,2
1,22,6
1,23,4
1,24,1
1,25,1
1,26,16
1,27,4
1,28,1
1,29,4
1,30,4
1,31,4
1,32,1
1,33,14
1,34,2
1,35,7

tq

$'1'

20% ?40% ?60% ?80% 100%
? 1 ? ?2 ? ?4 ? ?7 ? 16

From this distribution I would expect to distribute ratings as follows:
freq: <=1 ?<=2 ?<=4 ?<=7 ?<=16
rating: 1 ? ? ?2 ? ? ?3 ? ? ?4 ? ? ? 5

But my output is (the ratings are correct except at the cut off points):
"userid" "freq" "track" "rating"
1 1 1 2
1 10 2 5
1 1 3 2
1 1 4 2
1 15 5 5
1 4 6 4
1 16 7 5
1 6 8 4
1 1 9 2
1 1 10 2
1 2 11 3
1 2 12 3
1 1 13 2
1 6 14 4
1 7 15 5
1 13 16 5
1 3 17 3
1 2 18 3
1 5 19 4
1 2 20 3
1 2 21 3
1 6 22 4
1 4 23 4
1 1 24 2
1 1 25 2
1 16 26 5
1 4 27 4
1 1 28 2
1 4 29 4
1 4 30 4
1 4 31 4
1 1 32 2
1 14 33 5
1 2 34 3
1 7 35 5

This is the code:
# cheers jim holtman
x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
sep=',')
# get the quantiles for each track
tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
# create a matrix with the rownames as the tracks to use in the findInterval
tqm <- do.call(rbind, tq)
#now put the ratings
require(data.table)
x.dt <- data.table(x)
x.new <- x.dt[,list(freq = freq,track=track,rating =
findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) +
1L),by=userid]

Regards
Gawesh

On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholtman at gmail.com> wrote:

try this:

# create some data
x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
+ ? ? ? ? ? ? ? , track = rep(1:20, 20)
+ ? ? ? ? ? ? ? , freq = floor(runif(400, 10, 200))
+ ? ? ? ? ? ? ? , stringsAsFactors = FALSE
+ ? ? ? ? ? ? ? )
# get the quantiles for each track
tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1))
# create a matrix with the rownames as the tracks to use in the
findInterval
tqm <- do.call(rbind, tq)
# now put the ratings
require(data.table)
x.dt <- data.table(x)
x.new <- x.dt[,
+ ? ? ? ? ? ? ? list(userid = userid
+ ? ? ? ? ? ? ? ? ?, freq = freq
+ ? ? ? ? ? ? ? ? ?, rating = findInterval(freq
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# use track as index into
quantile matrix
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?, tqm[as.character(track[1L]),]
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?, rightmost.closed = TRUE
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?) + 1L
+ ? ? ? ? ? ? ? ? ?)
+ ? ? ? ? ? ? ?, by = track]
head(x.new)
? ? track userid freq rating
[1,] ? ? 1 ? ? u1 ? 10 ? ? ?1
[2,] ? ? 1 ? ? u2 ? 15 ? ? ?1
[3,] ? ? 1 ? ? u3 ?126 ? ? ?4
[4,] ? ? 1 ? ? u4 ?117 ? ? ?3
[5,] ? ? 1 ? ? u5 ? 76 ? ? ?2
[6,] ? ? 1 ? ? u6 ?103 ? ? ?3

On Sun, May 8, 2011 at 2:48 PM, gj <gawesh at gmail.com> wrote:
Hi,

I have a mysql table with fields userid,track,frequency e.g
u1,1,10
u1,2,100
u1,3,110
u1,4,200
u1,5,120
u1,6,130
.
u2,1,23
.
.
where "frequency" is the number of times a music track is played by a
"userid"

I need to turn my 'frequency' table into a rating table (it's for a
recommender system). So, for each user, I need to categorise the
frequency
of tracks played by quintile so that each particular track can have 5
ratings (1-5), with the ratings allocated as follows: inter-quintile
range
100-80% = rating 5, ? inter-quintile range 80-60% = rating 4,
..., inter-quintile range 20-0% = rating 1)

Hence, I need to create a table with fields userid,track,rating:
u1,1,1
u1,2, 3
...

Can anybody help me to do this with R?

Regards
Gawesh

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110514/e54913ca/attachment.pl>
An easy way is to just offset the quantiles by a small increment so
that boundary condition is less likely.  If you change the line

tqm <- do.call(rbind, tq) + 0.001

in my example, that should do the trick.
Hi,
I think I haven't been able to explain correctly what I want. Here another
try:
Given that I have the following input:
userid,track,freq

1,1,1
1,2,10
1,3,1
1,4,1
1,5,15
1,6,4
1,7,16
1,8,6
1,9,1
1,10,1
1,11,2
1,12,2
1,13,1
1,14,6
1,15,7
1,16,13
1,17,3
1,18,2
1,19,5
1,20,2
1,21,2
1,22,6
1,23,4
1,24,1
1,25,1
1,26,16
1,27,4
1,28,1
1,29,4
1,30,4
1,31,4
1,32,1
1,33,14
1,34,2
1,35,7

It is a sample of the history of tracks played: userid,track and frequency.
What I want is to convert the frequency into a rating scale (1-5) based on
the frequency at which a user has played a track, using the following
interquintile ranges for the cfd:
?0%-20% = rating 1, 20%-40% = rating 2, .... ,80%-100%=rating 5

Jim kindly provided the following code:
# cheers jim holtman
x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
sep=',')
# get the quantiles for each user(we want the frequency distribution to be
based on user)
tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
# create a matrix with the rownames as the tracks to use in the
findInterval
tqm <- do.call(rbind, tq)
#now put the ratings
require(data.table)
x.dt <- data.table(x)
x.new <- x.dt[,list(freq = freq,track=track,rating =
findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed = TRUE) +
1L),by=userid]
head(x.new)

? ? userid freq track rating
[1,] ? ? ?1 ? ?1 ? ? 1 ? ? ?2
[2,] ? ? ?1 ? 10 ? ? 2 ? ? ?5
[3,] ? ? ?1 ? ?1 ? ? 3 ? ? ?2
[4,] ? ? ?1 ? ?1 ? ? 4 ? ? ?2
[5,] ? ? ?1 ? 15 ? ? 5 ? ? ?5
[6,] ? ? ?1 ? ?4 ? ? 6 ? ? ?4

which is almost what I wanted except that the ratings are 1 point higher for
tracks where the frequency is at the cut-off points in the interquintile
range.
To illustrate the quintiles are:

tq$`1`
?20% ?40% ?60% ?80% 100%
? 1 ? ?2 ? ?4 ? ?7 ? 16

So, ideally I want (note the different ratings):

? ? userid freq track rating
[1,] ? ? ?1 ? ?1 ? ? 1 ? ? ?1
[2,] ? ? ?1 ? 10 ? ? 2 ? ? ?5
[3,] ? ? ?1 ? ?1 ? ? 3 ? ? ?1
[4,] ? ? ?1 ? ?1 ? ? 4 ? ? ?1
[5,] ? ? ?1 ? 15 ? ? 5 ? ? ?5
[6,] ? ? ?1 ? ?4 ? ? 6 ? ? ?3

Can anybody help me? I'm new to R (as you have probably guessed). Sorry for
the long explanation.

Regards
Gawesh

On Sat, May 14, 2011 at 7:37 PM, Dennis Murphy <djmuser at gmail.com> wrote:

Hi:

Is this what you're after?

tq <- with(ds, quantile(freq, seq(0.2, 1, by = 0.2)))
ds$int <- with(ds, cut(freq, c(0, tq)))
with(ds, table(int))

int
?(0,1] ?(1,2] ?(2,4] ?(4,7] (7,16]
? ?10 ? ? ?6 ? ? ?7 ? ? ?6 ? ? ?6

HTH,
Dennis

On Sat, May 14, 2011 at 9:42 AM, gj <gawesh at gmail.com> wrote:
Hi Jim,
Thanks very much for the code. I modified it a bit because I needed to
allocate the track ratings by userid (eg if user 1 plays track x once, he
gets rating 1, user 1 plays track y 100 times, he gets a rating 5) and
not
by track (sorry if this wasn't clear in my original post).

This is almost working! What I can't get right at the moment is the
cutoff
interval for the ratings.
Any help please?

Sample data:

userid,track,freq
1,1,1
1,2,10
1,3,1
1,4,1

1,5,15
1,6,4
1,7,16
1,8,6
1,9,1
1,10,1
1,11,2
1,12,2
1,13,1
1,14,6
1,15,7
1,16,13
1,17,3
1,18,2
1,19,5
1,20,2
1,21,2
1,22,6
1,23,4
1,24,1
1,25,1
1,26,16
1,27,4
1,28,1
1,29,4
1,30,4
1,31,4
1,32,1
1,33,14
1,34,2
1,35,7

tq

$'1'

20% ?40% ?60% ?80% 100%
? 1 ? ?2 ? ?4 ? ?7 ? 16

From this distribution I would expect to distribute ratings as follows:
freq: <=1 ?<=2 ?<=4 ?<=7 ?<=16
rating: 1 ? ? ?2 ? ? ?3 ? ? ?4 ? ? ? 5

But my output is (the ratings are correct except at the cut off points):
"userid" "freq" "track" "rating"
1 1 1 2
1 10 2 5
1 1 3 2
1 1 4 2
1 15 5 5
1 4 6 4
1 16 7 5
1 6 8 4
1 1 9 2
1 1 10 2
1 2 11 3
1 2 12 3
1 1 13 2
1 6 14 4
1 7 15 5
1 13 16 5
1 3 17 3
1 2 18 3
1 5 19 4
1 2 20 3
1 2 21 3
1 6 22 4
1 4 23 4
1 1 24 2
1 1 25 2
1 16 26 5
1 4 27 4
1 1 28 2
1 4 29 4
1 4 30 4
1 4 31 4
1 1 32 2
1 14 33 5
1 2 34 3
1 7 35 5

This is the code:
# cheers jim holtman
x=read.csv(file="C:\\Data\\lastfm\\ratings\\play_history_3.csv",header=T,
sep=',')
# get the quantiles for each track
tq <- tapply(x$freq,x$userid,quantile,prob=c(0.2,0.4,0.6,0.8,1))
# create a matrix with the rownames as the tracks to use in the
findInterval
tqm <- do.call(rbind, tq)
#now put the ratings
require(data.table)
x.dt <- data.table(x)
x.new <- x.dt[,list(freq = freq,track=track,rating =
findInterval(freq,tqm[as.character(userid[1L]),], rightmost.closed =
TRUE) +
1L),by=userid]

Regards
Gawesh

On Sun, May 8, 2011 at 10:42 PM, jim holtman <jholtman at gmail.com> wrote:

try this:

# create some data
x <- data.frame(userid = paste('u', rep(1:20, each = 20), sep = '')
+ ? ? ? ? ? ? ? , track = rep(1:20, 20)
+ ? ? ? ? ? ? ? , freq = floor(runif(400, 10, 200))
+ ? ? ? ? ? ? ? , stringsAsFactors = FALSE
+ ? ? ? ? ? ? ? )
# get the quantiles for each track
tq <- tapply(x$freq, x$track, quantile, prob = c(.2, .4, .6, .8, 1))
# create a matrix with the rownames as the tracks to use in the
findInterval
tqm <- do.call(rbind, tq)
# now put the ratings
require(data.table)
x.dt <- data.table(x)
x.new <- x.dt[,
+ ? ? ? ? ? ? ? list(userid = userid
+ ? ? ? ? ? ? ? ? ?, freq = freq
+ ? ? ? ? ? ? ? ? ?, rating = findInterval(freq
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# use track as index into
quantile matrix
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?, tqm[as.character(track[1L]),]
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?, rightmost.closed = TRUE
+ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?) + 1L
+ ? ? ? ? ? ? ? ? ?)
+ ? ? ? ? ? ? ?, by = track]
head(x.new)
? ? track userid freq rating
[1,] ? ? 1 ? ? u1 ? 10 ? ? ?1
[2,] ? ? 1 ? ? u2 ? 15 ? ? ?1
[3,] ? ? 1 ? ? u3 ?126 ? ? ?4
[4,] ? ? 1 ? ? u4 ?117 ? ? ?3
[5,] ? ? 1 ? ? u5 ? 76 ? ? ?2
[6,] ? ? 1 ? ? u6 ?103 ? ? ?3

On Sun, May 8, 2011 at 2:48 PM, gj <gawesh at gmail.com> wrote:
Hi,

I have a mysql table with fields userid,track,frequency e.g
u1,1,10
u1,2,100
u1,3,110
u1,4,200
u1,5,120
u1,6,130
.
u2,1,23
.
.
where "frequency" is the number of times a music track is played by a
"userid"

I need to turn my 'frequency' table into a rating table (it's for a
recommender system). So, for each user, I need to categorise the
frequency
of tracks played by quintile so that each particular track can have 5
ratings (1-5), with the ratings allocated as follows: inter-quintile
range
100-80% = rating 5, ? inter-quintile range 80-60% = rating 4,
..., inter-quintile range 20-0% = rating 1)

Hence, I need to create a table with fields userid,track,rating:
u1,1,1
u1,2, 3
...

Can anybody help me to do this with R?

Regards
Gawesh

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110515/3ea1b053/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110515/953514c8/attachment.pl>