I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
Converting unique strings to unique numbers
10 messages · Jeff Newmiller, MacQueen, Don, William Dunlap +3 more
Of course, but I would not recommend it. A factor is a vector of integers with an attribute containing the labels that those integers correspond to. You seem to be asking for a factor that has lost the definitions part. But hey, newvector <- as.integer(factor(oldvector)) should get you what you asked for one column at a time.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On May 29, 2015 9:58:22 AM PDT, Kate Ignatius <kate.ignatius at gmail.com> wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Here is an example to get you started:
mycol <- c('b','a','d','d','b','c')
as.numeric(factor(mycol))
-Don
Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 5/29/15, 9:58 AM, "Kate Ignatius" <kate.ignatius at gmail.com> wrote: >I have a pedigree file as so: > >X0001 BYX859 0 0 2 1 BYX859 >X0001 BYX894 0 0 1 1 BYX894 >X0001 BYX862 BYX894 BYX859 2 2 BYX862 >X0001 BYX863 BYX894 BYX859 2 2 BYX863 >X0001 BYX864 BYX894 BYX859 2 2 BYX864 >X0001 BYX865 BYX894 BYX859 2 2 BYX865 > >And I was hoping to change all unique string values to numbers. > >That is: > >BYX859 = 1 >BYX894 = 2 >BYX862 = 3 >BYX863 = 4 >BYX864 = 5 >BYX865 = 6 > >But only in columns 2 - 4. Essentially I would like the data to look >like this: > >X0001 1 0 0 2 1 BYX859 >X0001 2 0 0 1 1 BYX894 >X0001 3 2 1 2 2 BYX862 >X0001 4 2 1 2 2 BYX863 >X0001 5 2 1 2 2 BYX864 >X0001 6 2 1 2 2 BYX865 > >Is this possible with factors? > >Thanks! > >K. > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
match() will do what you want. E.g., run your data through
the following function.
f <- function (data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
}
data
}
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, May 29, 2015 at 9:58 AM, Kate Ignatius <kate.ignatius at gmail.com>
wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi Kate,
I found that matching the character vector to itself is a very
effective way to do this:
x <- c("a", "bunch", "of", "strings", "whose", "exact", "content",
"is", "of", "little", "interest")
ids <- match(x, x)
ids
# [1] 1 2 3 4 5 6 7 8 3 10 11
By using this trick, many manipulations on character vectors can
be replaced by manipulations on integer vectors, which are sometimes
way more efficient.
Cheers,
H.
On 05/29/2015 09:58 AM, Kate Ignatius wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
I found this helpful. However - the second to forth columns come out all zero - was this the intention? That is: X0001 0 0 0 2 1 BYX859 X0001 0 0 0 1 1 BYX894 X0001 0 0 0 2 2 BYX862 X0001 0 0 0 2 2 BYX863 X0001 0 0 0 2 2 BYX864 X0001 0 0 0 2 2 BYX865
On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
match() will do what you want. E.g., run your data through
the following function.
f <- function (data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
}
data
}
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, May 29, 2015 at 9:58 AM, Kate Ignatius <kate.ignatius at gmail.com>
wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Fri, May 29, 2015 at 2:16 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Hi Kate,
I found that matching the character vector to itself is a very
effective way to do this:
x <- c("a", "bunch", "of", "strings", "whose", "exact", "content",
"is", "of", "little", "interest")
ids <- match(x, x)
ids
# [1] 1 2 3 4 5 6 7 8 3 10 11
By using this trick, many manipulations on character vectors can
be replaced by manipulations on integer vectors, which are sometimes
way more efficient.
Hm. I hadn't thought of that approach - I use the
as.numeric(factor(...)) approach.
So I was curious, and compared the two:
set.seed(43)
x <- sample(letters, 10000, replace=TRUE)
system.time({
for(i in seq_len(20000)) {
ids1 <- match(x, x)
}})
# user system elapsed
# 9.657 0.000 9.657
system.time({
for(i in seq_len(20000)) {
ids2 <- as.numeric(factor(x, levels=letters))
}})
# user system elapsed
# 6.16 0.00 6.16
Using factor() is faster. More importantly, using factor() lets you
set the order of the indices in an expected fashion, where match()
assigns them in the order of occurrence.
head(data.frame(x, ids1, ids2))
x ids1 ids2
1 m 1 13
2 x 2 24
3 b 3 2
4 s 4 19
5 i 5 9
6 o 6 15
In a problem like Kate's where there are several columns for which the
same ordering of indices is desired, that becomes really important.
If you take Bill Dunlap's modification of the match() approach, it
resolves both problems: matching against the pooled unique values is
both faster than the factor() version and gives the same result:
On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
match() will do what you want. E.g., run your data through the following function.
f <- function (data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
}
data
}
##
y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE),
v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
replace=TRUE), stringsAsFactors=FALSE)
system.time({
for(i in seq_len(20000)) {
ids3 <- f(data.frame(y))
}})
# user system elapsed
# 22.515 0.000 22.518
ff <- function(data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings))
}
data
}
system.time({
for(i in seq_len(20000)) {
ids4 <- ff(data.frame(y))
}})
# user system elapsed
# 26.083 0.002 26.090
head(ids3)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
head(ids4)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
Kate, if you're getting all zeros, check str(yourdataframe) - it's
likely that when you imported your data into R the strings were
already converted to factors, which is not what you want (ask me how I
know this!).
Sarah
On 05/29/2015 09:58 AM, Kate Ignatius wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
Sarah Goslee http://www.functionaldiversity.org
Hi Sarah,
On 05/29/2015 12:04 PM, Sarah Goslee wrote:
On Fri, May 29, 2015 at 2:16 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Hi Kate,
I found that matching the character vector to itself is a very
effective way to do this:
x <- c("a", "bunch", "of", "strings", "whose", "exact", "content",
"is", "of", "little", "interest")
ids <- match(x, x)
ids
# [1] 1 2 3 4 5 6 7 8 3 10 11
By using this trick, many manipulations on character vectors can
be replaced by manipulations on integer vectors, which are sometimes
way more efficient.
Hm. I hadn't thought of that approach - I use the
as.numeric(factor(...)) approach.
So I was curious, and compared the two:
set.seed(43)
x <- sample(letters, 10000, replace=TRUE)
system.time({
for(i in seq_len(20000)) {
ids1 <- match(x, x)
}})
# user system elapsed
# 9.657 0.000 9.657
system.time({
for(i in seq_len(20000)) {
ids2 <- as.numeric(factor(x, levels=letters))
}})
# user system elapsed
# 6.16 0.00 6.16
Using factor() is faster.
That's an unfair comparison, because you already know what the levels
are so you can supply them to your call to factor(). Most of the time
you don't know what the levels are so either you just do factor(x) and
let the factor() constructor compute the levels for you, or you compute
them yourself upfront with something like factor(x, levels=unique(x)).
library(microbenchmark)
microbenchmark(
{ids1 <- match(x, x)},
{ids2 <- as.integer(factor(x, levels=letters))},
{ids3 <- as.integer(factor(x))},
{ids4 <- as.integer(factor(x, levels=unique(x)))}
)
Unit: microseconds
expr min lq
{ ids1 <- match(x, x) } 245.979 262.2390
{ ids2 <- as.integer(factor(x, levels = letters)) } 214.115 219.2320
{ ids3 <- as.integer(factor(x)) } 380.782 388.7295
{ ids4 <- as.integer(factor(x, levels = unique(x))) } 332.250 342.6630
mean median uq max neval
267.3210 264.4845 268.348 293.894 100
226.9913 220.9870 226.147 314.875 100
402.2242 394.7165 412.075 481.410 100
349.7405 345.3090 353.162 383.002 100
More importantly, using factor() lets you set the order of the indices in an expected fashion, where match() assigns them in the order of occurrence. head(data.frame(x, ids1, ids2)) x ids1 ids2 1 m 1 13 2 x 2 24 3 b 3 2 4 s 4 19 5 i 5 9 6 o 6 15 In a problem like Kate's where there are several columns for which the same ordering of indices is desired, that becomes really important.
I'm not sure why which particular ID gets assigned to each string would
matter but maybe I'm missing something. What really matters is that each
string receives a unique ID. match(x, x) does that.
In Kate's problem, where the strings are in more than one column,
and you want the ID to be unique across the columns, you need to do
match(x, x) where 'x' contains the strings from all the columns
that you want to replace:
m <- matrix(c(
"X0001", "BYX859", 0, 0, 2, 1, "BYX859",
"X0001", "BYX894", 0, 0, 1, 1, "BYX894",
"X0001", "BYX862", "BYX894", "BYX859", 2, 2, "BYX862",
"X0001", "BYX863", "BYX894", "BYX859", 2, 2, "BYX863",
"X0001", "BYX864", "BYX894", "BYX859", 2, 2, "BYX864",
"X0001", "BYX865", "BYX894", "BYX859", 2, 2, "BYX865"
), ncol=7, byrow=TRUE)
x <- m[ , 2:4]
id <- match(x, x, nomatch=0, incomparables="0")
m[ , 2:4] <- id
No factor needed. No loop needed. ;-)
Cheers,
H.
If you take Bill Dunlap's modification of the match() approach, it resolves both problems: matching against the pooled unique values is both faster than the factor() version and gives the same result: On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
match() will do what you want. E.g., run your data through the following function.
f <- function (data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
}
data
}
##
y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE),
v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
replace=TRUE), stringsAsFactors=FALSE)
system.time({
for(i in seq_len(20000)) {
ids3 <- f(data.frame(y))
}})
# user system elapsed
# 22.515 0.000 22.518
ff <- function(data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings))
}
data
}
system.time({
for(i in seq_len(20000)) {
ids4 <- ff(data.frame(y))
}})
# user system elapsed
# 26.083 0.002 26.090
head(ids3)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
head(ids4)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
Kate, if you're getting all zeros, check str(yourdataframe) - it's
likely that when you imported your data into R the strings were
already converted to factors, which is not what you want (ask me how I
know this!).
Sarah
On 05/29/2015 09:58 AM, Kate Ignatius wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
I'm not sure why which particular ID gets assigned to each string would matter but maybe I'm missing something. What really matters is that each string receives a unique ID. match(x, x) does that.
I think each row of the OP's dataset represented an individual (column 2) followed by its mother and father (columns 3 and 4). I assume that the marker "0" means that a parent is not in the dataset. If you match against the strings in column 2 only, in their original order, then the resulting numbers give the row number of an individual, making it straightforward to look up information regarding the ancestors of an individual. Hence the choice of numeric ID's may be important. Bill Dunlap TIBCO Software wdunlap tibco.com
On Fri, May 29, 2015 at 1:29 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Hi Sarah, On 05/29/2015 12:04 PM, Sarah Goslee wrote:
On Fri, May 29, 2015 at 2:16 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Hi Kate,
I found that matching the character vector to itself is a very
effective way to do this:
x <- c("a", "bunch", "of", "strings", "whose", "exact", "content",
"is", "of", "little", "interest")
ids <- match(x, x)
ids
# [1] 1 2 3 4 5 6 7 8 3 10 11
By using this trick, many manipulations on character vectors can
be replaced by manipulations on integer vectors, which are sometimes
way more efficient.
Hm. I hadn't thought of that approach - I use the
as.numeric(factor(...)) approach.
So I was curious, and compared the two:
set.seed(43)
x <- sample(letters, 10000, replace=TRUE)
system.time({
for(i in seq_len(20000)) {
ids1 <- match(x, x)
}})
# user system elapsed
# 9.657 0.000 9.657
system.time({
for(i in seq_len(20000)) {
ids2 <- as.numeric(factor(x, levels=letters))
}})
# user system elapsed
# 6.16 0.00 6.16
Using factor() is faster.
That's an unfair comparison, because you already know what the levels
are so you can supply them to your call to factor(). Most of the time
you don't know what the levels are so either you just do factor(x) and
let the factor() constructor compute the levels for you, or you compute
them yourself upfront with something like factor(x, levels=unique(x)).
library(microbenchmark)
microbenchmark(
{ids1 <- match(x, x)},
{ids2 <- as.integer(factor(x, levels=letters))},
{ids3 <- as.integer(factor(x))},
{ids4 <- as.integer(factor(x, levels=unique(x)))}
)
Unit: microseconds
expr min lq
{ ids1 <- match(x, x) } 245.979 262.2390
{ ids2 <- as.integer(factor(x, levels = letters)) } 214.115 219.2320
{ ids3 <- as.integer(factor(x)) } 380.782 388.7295
{ ids4 <- as.integer(factor(x, levels = unique(x))) } 332.250 342.6630
mean median uq max neval
267.3210 264.4845 268.348 293.894 100
226.9913 220.9870 226.147 314.875 100
402.2242 394.7165 412.075 481.410 100
349.7405 345.3090 353.162 383.002 100
More importantly, using factor() lets you
set the order of the indices in an expected fashion, where match() assigns them in the order of occurrence. head(data.frame(x, ids1, ids2)) x ids1 ids2 1 m 1 13 2 x 2 24 3 b 3 2 4 s 4 19 5 i 5 9 6 o 6 15 In a problem like Kate's where there are several columns for which the same ordering of indices is desired, that becomes really important.
I'm not sure why which particular ID gets assigned to each string would
matter but maybe I'm missing something. What really matters is that each
string receives a unique ID. match(x, x) does that.
In Kate's problem, where the strings are in more than one column,
and you want the ID to be unique across the columns, you need to do
match(x, x) where 'x' contains the strings from all the columns
that you want to replace:
m <- matrix(c(
"X0001", "BYX859", 0, 0, 2, 1, "BYX859",
"X0001", "BYX894", 0, 0, 1, 1, "BYX894",
"X0001", "BYX862", "BYX894", "BYX859", 2, 2, "BYX862",
"X0001", "BYX863", "BYX894", "BYX859", 2, 2, "BYX863",
"X0001", "BYX864", "BYX894", "BYX859", 2, 2, "BYX864",
"X0001", "BYX865", "BYX894", "BYX859", 2, 2, "BYX865"
), ncol=7, byrow=TRUE)
x <- m[ , 2:4]
id <- match(x, x, nomatch=0, incomparables="0")
m[ , 2:4] <- id
No factor needed. No loop needed. ;-)
Cheers,
H.
If you take Bill Dunlap's modification of the match() approach, it resolves both problems: matching against the pooled unique values is both faster than the factor() version and gives the same result: On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
match() will do what you want. E.g., run your data through the following function. f <- function (data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
}
data
}
##
y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE),
v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
replace=TRUE), stringsAsFactors=FALSE)
system.time({
for(i in seq_len(20000)) {
ids3 <- f(data.frame(y))
}})
# user system elapsed
# 22.515 0.000 22.518
ff <- function(data)
{
uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
uniqStrings <- setdiff(uniqStrings, "0")
for (j in 2:4) {
data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings))
}
data
}
system.time({
for(i in seq_len(20000)) {
ids4 <- ff(data.frame(y))
}})
# user system elapsed
# 26.083 0.002 26.090
head(ids3)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
head(ids4)
id v1 v2 v3
1 1 1 2 8
2 2 2 19 22
3 3 3 21 16
4 4 4 10 17
5 5 1 8 18
6 6 1 12 26
Kate, if you're getting all zeros, check str(yourdataframe) - it's
likely that when you imported your data into R the strings were
already converted to factors, which is not what you want (ask me how I
know this!).
Sarah
On 05/29/2015 09:58 AM, Kate Ignatius wrote:
I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K.
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi Bill,
On 05/29/2015 01:48 PM, William Dunlap wrote:
I'm not sure why which particular ID gets assigned to each string would matter but maybe I'm missing something. What really matters is that each string receives a unique ID. match(x, x) does that.
I think each row of the OP's dataset represented an individual (column 2) followed by its mother and father (columns 3 and 4). I assume that the marker "0" means that a parent is not in the dataset. If you match against the strings in column 2 only, in their original order, then the resulting numbers give the row number of an individual,
Note that the code I gave happens to do exactly that (assuming that column 2 contains no duplicates, but your code is also relying on that assumption in order to have the ids match the row numbers). We're discussing the merit of match(x, x) versus match(x, unique(x)). All I'm trying to say is that the unique(x) step (which doubles the cost of the whole operation, because it also uses hashing, like match() does) is generally not needed. It doesn't seem to be needed in Kate's use case. H.
making it straightforward to look up information regarding the ancestors of an individual. Hence the choice of numeric ID's may be important. Bill Dunlap TIBCO Software wdunlap tibco.com <http://tibco.com> On Fri, May 29, 2015 at 1:29 PM, Herv? Pag?s <hpages at fredhutch.org <mailto:hpages at fredhutch.org>> wrote: Hi Sarah, On 05/29/2015 12:04 PM, Sarah Goslee wrote: On Fri, May 29, 2015 at 2:16 PM, Herv? Pag?s <hpages at fredhutch.org <mailto:hpages at fredhutch.org>> wrote: Hi Kate, I found that matching the character vector to itself is a very effective way to do this: x <- c("a", "bunch", "of", "strings", "whose", "exact", "content", "is", "of", "little", "interest") ids <- match(x, x) ids # [1] 1 2 3 4 5 6 7 8 3 10 11 By using this trick, many manipulations on character vectors can be replaced by manipulations on integer vectors, which are sometimes way more efficient. Hm. I hadn't thought of that approach - I use the as.numeric(factor(...)) approach. So I was curious, and compared the two: set.seed(43) x <- sample(letters, 10000, replace=TRUE) system.time({ for(i in seq_len(20000)) { ids1 <- match(x, x) }}) # user system elapsed # 9.657 0.000 9.657 system.time({ for(i in seq_len(20000)) { ids2 <- as.numeric(factor(x, levels=letters)) }}) # user system elapsed # 6.16 0.00 6.16 Using factor() is faster. That's an unfair comparison, because you already know what the levels are so you can supply them to your call to factor(). Most of the time you don't know what the levels are so either you just do factor(x) and let the factor() constructor compute the levels for you, or you compute them yourself upfront with something like factor(x, levels=unique(x)). library(microbenchmark) microbenchmark( {ids1 <- match(x, x)}, {ids2 <- as.integer(factor(x, levels=letters))}, {ids3 <- as.integer(factor(x))}, {ids4 <- as.integer(factor(x, levels=unique(x)))} ) Unit: microseconds expr min lq { ids1 <- match(x, x) } 245.979 262.2390 { ids2 <- as.integer(factor(x, levels = letters)) } 214.115 219.2320 { ids3 <- as.integer(factor(x)) } 380.782 388.7295 { ids4 <- as.integer(factor(x, levels = unique(x))) } 332.250 342.6630 mean median uq max neval 267.3210 264.4845 268.348 293.894 100 226.9913 220.9870 226.147 314.875 100 402.2242 394.7165 412.075 481.410 100 349.7405 345.3090 353.162 383.002 100 More importantly, using factor() lets you set the order of the indices in an expected fashion, where match() assigns them in the order of occurrence. head(data.frame(x, ids1, ids2)) x ids1 ids2 1 m 1 13 2 x 2 24 3 b 3 2 4 s 4 19 5 i 5 9 6 o 6 15 In a problem like Kate's where there are several columns for which the same ordering of indices is desired, that becomes really important. I'm not sure why which particular ID gets assigned to each string would matter but maybe I'm missing something. What really matters is that each string receives a unique ID. match(x, x) does that. In Kate's problem, where the strings are in more than one column, and you want the ID to be unique across the columns, you need to do match(x, x) where 'x' contains the strings from all the columns that you want to replace: m <- matrix(c( "X0001", "BYX859", 0, 0, 2, 1, "BYX859", "X0001", "BYX894", 0, 0, 1, 1, "BYX894", "X0001", "BYX862", "BYX894", "BYX859", 2, 2, "BYX862", "X0001", "BYX863", "BYX894", "BYX859", 2, 2, "BYX863", "X0001", "BYX864", "BYX894", "BYX859", 2, 2, "BYX864", "X0001", "BYX865", "BYX894", "BYX859", 2, 2, "BYX865" ), ncol=7, byrow=TRUE) x <- m[ , 2:4] id <- match(x, x, nomatch=0, incomparables="0") m[ , 2:4] <- id No factor needed. No loop needed. ;-) Cheers, H. If you take Bill Dunlap's modification of the match() approach, it resolves both problems: matching against the pooled unique values is both faster than the factor() version and gives the same result: On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com <mailto:wdunlap at tibco.com>> wrote: match() will do what you want. E.g., run your data through the following function. f <- function (data) { uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4])) uniqStrings <- setdiff(uniqStrings, "0") for (j in 2:4) { data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L) } data } ## y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE), v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000, replace=TRUE), stringsAsFactors=FALSE) system.time({ for(i in seq_len(20000)) { ids3 <- f(data.frame(y)) }}) # user system elapsed # 22.515 0.000 22.518 ff <- function(data) { uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4])) uniqStrings <- setdiff(uniqStrings, "0") for (j in 2:4) { data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings)) } data } system.time({ for(i in seq_len(20000)) { ids4 <- ff(data.frame(y)) }}) # user system elapsed # 26.083 0.002 26.090 head(ids3) id v1 v2 v3 1 1 1 2 8 2 2 2 19 22 3 3 3 21 16 4 4 4 10 17 5 5 1 8 18 6 6 1 12 26 head(ids4) id v1 v2 v3 1 1 1 2 8 2 2 2 19 22 3 3 3 21 16 4 4 4 10 17 5 5 1 8 18 6 6 1 12 26 Kate, if you're getting all zeros, check str(yourdataframe) - it's likely that when you imported your data into R the strings were already converted to factors, which is not what you want (ask me how I know this!). Sarah On 05/29/2015 09:58 AM, Kate Ignatius wrote: I have a pedigree file as so: X0001 BYX859 0 0 2 1 BYX859 X0001 BYX894 0 0 1 1 BYX894 X0001 BYX862 BYX894 BYX859 2 2 BYX862 X0001 BYX863 BYX894 BYX859 2 2 BYX863 X0001 BYX864 BYX894 BYX859 2 2 BYX864 X0001 BYX865 BYX894 BYX859 2 2 BYX865 And I was hoping to change all unique string values to numbers. That is: BYX859 = 1 BYX894 = 2 BYX862 = 3 BYX863 = 4 BYX864 = 5 BYX865 = 6 But only in columns 2 - 4. Essentially I would like the data to look like this: X0001 1 0 0 2 1 BYX859 X0001 2 0 0 1 1 BYX894 X0001 3 2 1 2 2 BYX862 X0001 4 2 1 2 2 BYX863 X0001 5 2 1 2 2 BYX864 X0001 6 2 1 2 2 BYX865 Is this possible with factors? Thanks! K. -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319