Hi
I am new to R programming and am using subset to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score < -5);
However when I use unique to find the number of unique
genomes now present in prunedrelatives I get results
identical to calling unique(dataset$genome1) although
subset has eliminated many genomes and records.
I would greatly appreciate your input about using
"unique" correctly in this regard.
Thanks
Lalitha
____________________________________________________________________________________
TV dinner still cooling?
Check out "Tonight's Picks" on Yahoo! TV.
unique/subset problem
8 messages · lalitha viswanath, Sarah Goslee, Weiwei Shi
Hi, Even you removed "many" genomes1 by setting score< -5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 <- unique(dataset[dataset$score< -5, "genome1"]) # same as subset p1 <- unique(dataset[dataset$score>= -5, "genome1"]) setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei
On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi
I am new to R programming and am using subset to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score < -5);
However when I use unique to find the number of unique
genomes now present in prunedrelatives I get results
identical to calling unique(dataset$genome1) although
subset has eliminated many genomes and records.
I would greatly appreciate your input about using
"unique" correctly in this regard.
Thanks
Lalitha
____________________________________________________________________________________ TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
Hi The pruned dataset has 8 unique genomes in it while the dataset before pruning has 65 unique genomes in it. However calling unique on the pruned dataset seems to return 65 no matter what. Any assistance in this matter would be appreciated. Thanks Lalitha
--- Weiwei Shi <helprhelp at gmail.com> wrote:
Hi, Even you removed "many" genomes1 by setting score< -5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 <- unique(dataset[dataset$score< -5, "genome1"]) # same as subset p1 <- unique(dataset[dataset$score>= -5, "genome1"]) setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi
I am new to R programming and am using subset to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score < -5);
However when I use unique to find the number of
unique
genomes now present in prunedrelatives I get
results
identical to calling unique(dataset$genome1)
although
subset has eliminated many genomes and records. I would greatly appreciate your input about using "unique" correctly in this regard. Thanks Lalitha
____________________________________________________________________________________
TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
____________________________________________________________________________________ Bored stiff? Loosen up...
Without knowing more about your data, it is hard to say for certain, but might you be confusing unique _values_ with _factor levels_?
mydata <- as.factor(sort(rep(1:5, 2)))
# mydata has 10 values, 5 unique values, and 5 factor levels
mydata
[1] 1 1 2 2 3 3 4 4 5 5 Levels: 1 2 3 4 5
unique(mydata)
[1] 1 2 3 4 5 Levels: 1 2 3 4 5
mydata.subset <- mydata[1:4]
# the subset now has only 2 unique values, but the output # still lists all five factor levels
unique(mydata.subset)
[1] 1 2 Levels: 1 2 3 4 5 # try drop=TRUE as an option to subset
mydata.subset <- mydata[1:4, drop=TRUE] unique(mydata.subset)
[1] 1 2 Levels: 1 2 Alternatively, if this is the problem and you don't need those data to be factors, you could always convert them to a more appropriate form. Sarah
On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi
I am new to R programming and am using subset to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score < -5);
However when I use unique to find the number of
unique
genomes now present in prunedrelatives I get
results
identical to calling unique(dataset$genome1)
although
subset has eliminated many genomes and records. I would greatly appreciate your input about using "unique" correctly in this regard. Thanks Lalitha
Sarah Goslee http://www.functionaldiversity.org
Then you need to provide more details about the calls you made and your dataset. For example, you can tell us by str(prunedrelatives, 1) how did you call unique on prunedrelative and so on? I made a test data it gave me what you wanted (omitted here).
On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi The pruned dataset has 8 unique genomes in it while the dataset before pruning has 65 unique genomes in it. However calling unique on the pruned dataset seems to return 65 no matter what. Any assistance in this matter would be appreciated. Thanks Lalitha --- Weiwei Shi <helprhelp at gmail.com> wrote:
Hi, Even you removed "many" genomes1 by setting score< -5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 <- unique(dataset[dataset$score< -5, "genome1"]) # same as subset p1 <- unique(dataset[dataset$score>= -5, "genome1"]) setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi
I am new to R programming and am using subset to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score < -5);
However when I use unique to find the number of
unique
genomes now present in prunedrelatives I get
results
identical to calling unique(dataset$genome1)
although
subset has eliminated many genomes and records. I would greatly appreciate your input about using "unique" correctly in this regard. Thanks Lalitha
____________________________________________________________________________________ TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III ____________________________________________________________________________________ Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front
Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
Hi
I read in my dataset using
dt <read.table("filename")
calling unique(levels(dt$genome1)) yields the
following
"aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
It shows 60 genomes, which is correct.
I extracted a subset as follows
possible_relatives_subset <- subset(dt, Y < -5)
I am pasting the results below
genome1 genome2 parameterX Y
21 sent ecoliO157 0.00590 -200.633493
22 sent paer 0.18603 -100.200570
27 styp ecoliO157 0.00484 -240.708645
28 styp paer 0.18497 -30.250127
41 paer sent 0.18603 -60.200570
44 paer styp 0.18497 -80.250127
49 paer hinf 0.18913 -90.056333
53 paer vcho 0.18703 -10.153929
55 paer pmul 0.18587 -100.208042
67 paer buch 0.21485 -80.898667
70 paer ypes 0.18460 -107.267454
82 paer xfas 0.26268 -61.920552
95 hinf ecoliO157 0.07654 -163.018417
96 hinf paer 0.18913 -10.056333
103 vcho ecoliO157 0.09518 -140.921153
104 vcho paer 0.18703 -10.153929
107 pmul ecoliO157 0.07328 -165.215225
108 pmul paer 0.18587 -10.208042
131 buch ecoliO157 0.15412 -11.746939
132 buch paer 0.21485 -8.898667
137 ypes ecoliO157 0.02705 -19.171851
138 ypes paer 0.18460 -10.267454
171 ecoliO157 sent 0.00590 -20.633493
174 ecoliO157 styp 0.00484 -20.708645
179 ecoliO157 hinf 0.07654 -6.018417
183 ecoliO157 vcho 0.09518 -14.921153
185 ecoliO157 pmul 0.07328 -6.215225
197 ecoliO157 buch 0.15412 -11.746939
200 ecoliO157 ypes 0.02705 -9.171851
211 ecoliO157 xfas 0.25833 -71.091552
217 xfas ecoliO157 0.25833 -75.091552
218 xfas paer 0.26268 -64.920552
I think even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get
[1] "aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response
[1] sent styp paer hinf vcho
pmul buch ypes ecoliO157 xfas
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes
--- Weiwei Shi <helprhelp at gmail.com> wrote:
Then you need to provide more details about the calls you made and your dataset. For example, you can tell us by str(prunedrelatives, 1) how did you call unique on prunedrelative and so on? I made a test data it gave me what you wanted (omitted here). On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi The pruned dataset has 8 unique genomes in it
while
the dataset before pruning has 65 unique genomes
in
it. However calling unique on the pruned dataset seems
to
return 65 no matter what. Any assistance in this matter would be
appreciated.
Thanks Lalitha --- Weiwei Shi <helprhelp at gmail.com> wrote:
Hi, Even you removed "many" genomes1 by setting
score<
-5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 <- unique(dataset[dataset$score< -5,
"genome1"])
# same as subset p1 <- unique(dataset[dataset$score>= -5,
"genome1"])
setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi I am new to R programming and am using subset
to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score <
-5);
However when I use unique to find the number
of
unique
genomes now present in prunedrelatives I get
results
identical to calling unique(dataset$genome1)
although
subset has eliminated many genomes and
records.
I would greatly appreciate your input about
using
"unique" correctly in this regard. Thanks Lalitha
____________________________________________________________________________________
TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal,
self-contained,
reproducible code.
-- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
____________________________________________________________________________________
Bored stiff? Loosen up... Download and play hundreds of games for free on
-- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
____________________________________________________________________________________ We won't tell. Get more on shows you hate to love
check ?read.table and add "as.is=T" in the option. So you read string as character now and avoid the factor things. Then repeat your work. For example
x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10)
str(x0,1)
`data.frame': 10 obs. of 7 variables: $ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1 $ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2 $ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4 $ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1 $ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1 $ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2 $ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10, as.is=T)
str(x0,1)
`data.frame': 10 obs. of 7 variables: $ V1: chr "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ... $ V2: chr "ID1" "65259" "1020286" "-518245428" ... $ V3: chr "ID2" "6436" "6436" "-2099509019" ... $ V4: chr "MECHANISM" "5" "5" "5" ... $ V5: chr "TYPE" "0" "0" "0" ... $ V6: chr "USER_" "alexey" "alexey" "alexey" ... $ V7: chr "TRUST" "3" "3" "3" ... HTH, weiwei
On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi
I read in my dataset using
dt <read.table("filename")
calling unique(levels(dt$genome1)) yields the
following
"aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
It shows 60 genomes, which is correct.
I extracted a subset as follows
possible_relatives_subset <- subset(dt, Y < -5)
I am pasting the results below
genome1 genome2 parameterX Y
21 sent ecoliO157 0.00590 -200.633493
22 sent paer 0.18603 -100.200570
27 styp ecoliO157 0.00484 -240.708645
28 styp paer 0.18497 -30.250127
41 paer sent 0.18603 -60.200570
44 paer styp 0.18497 -80.250127
49 paer hinf 0.18913 -90.056333
53 paer vcho 0.18703 -10.153929
55 paer pmul 0.18587 -100.208042
67 paer buch 0.21485 -80.898667
70 paer ypes 0.18460 -107.267454
82 paer xfas 0.26268 -61.920552
95 hinf ecoliO157 0.07654 -163.018417
96 hinf paer 0.18913 -10.056333
103 vcho ecoliO157 0.09518 -140.921153
104 vcho paer 0.18703 -10.153929
107 pmul ecoliO157 0.07328 -165.215225
108 pmul paer 0.18587 -10.208042
131 buch ecoliO157 0.15412 -11.746939
132 buch paer 0.21485 -8.898667
137 ypes ecoliO157 0.02705 -19.171851
138 ypes paer 0.18460 -10.267454
171 ecoliO157 sent 0.00590 -20.633493
174 ecoliO157 styp 0.00484 -20.708645
179 ecoliO157 hinf 0.07654 -6.018417
183 ecoliO157 vcho 0.09518 -14.921153
185 ecoliO157 pmul 0.07328 -6.215225
197 ecoliO157 buch 0.15412 -11.746939
200 ecoliO157 ypes 0.02705 -9.171851
211 ecoliO157 xfas 0.25833 -71.091552
217 xfas ecoliO157 0.25833 -75.091552
218 xfas paer 0.26268 -64.920552
I think even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get
[1] "aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response
[1] sent styp paer hinf vcho
pmul buch ypes ecoliO157 xfas
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes
--- Weiwei Shi <helprhelp at gmail.com> wrote:
Then you need to provide more details about the calls you made and your dataset. For example, you can tell us by str(prunedrelatives, 1) how did you call unique on prunedrelative and so on? I made a test data it gave me what you wanted (omitted here). On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi The pruned dataset has 8 unique genomes in it
while
the dataset before pruning has 65 unique genomes
in
it. However calling unique on the pruned dataset seems
to
return 65 no matter what. Any assistance in this matter would be
appreciated.
Thanks Lalitha --- Weiwei Shi <helprhelp at gmail.com> wrote:
Hi, Even you removed "many" genomes1 by setting
score<
-5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 <- unique(dataset[dataset$score< -5,
"genome1"])
# same as subset p1 <- unique(dataset[dataset$score>= -5,
"genome1"])
setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi I am new to R programming and am using subset
to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score <
-5);
However when I use unique to find the number
of
unique
genomes now present in prunedrelatives I get
results
identical to calling unique(dataset$genome1)
although
subset has eliminated many genomes and
records.
I would greatly appreciate your input about
using
"unique" correctly in this regard. Thanks Lalitha
____________________________________________________________________________________ TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III ____________________________________________________________________________________ Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265
Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
oh, i forgot, you can also convert factor into string like dataset$genome1 <- as.character(dataset$genome1) so you don't have to use as.numeric(dataset$score) if you use "as.is=T" when you read.table HTH, weiwei
On 1/26/07, Weiwei Shi <helprhelp at gmail.com> wrote:
check ?read.table and add "as.is=T" in the option. So you read string as character now and avoid the factor things. Then repeat your work. For example
x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10)
str(x0,1)
`data.frame': 10 obs. of 7 variables: $ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1 $ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2 $ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4 $ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1 $ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1 $ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2 $ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10, as.is=T)
str(x0,1)
`data.frame': 10 obs. of 7 variables: $ V1: chr "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ... $ V2: chr "ID1" "65259" "1020286" "-518245428" ... $ V3: chr "ID2" "6436" "6436" "-2099509019" ... $ V4: chr "MECHANISM" "5" "5" "5" ... $ V5: chr "TYPE" "0" "0" "0" ... $ V6: chr "USER_" "alexey" "alexey" "alexey" ... $ V7: chr "TRUST" "3" "3" "3" ... HTH, weiwei On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi
I read in my dataset using
dt <read.table("filename")
calling unique(levels(dt$genome1)) yields the
following
"aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
It shows 60 genomes, which is correct.
I extracted a subset as follows
possible_relatives_subset <- subset(dt, Y < -5)
I am pasting the results below
genome1 genome2 parameterX Y
21 sent ecoliO157 0.00590 -200.633493
22 sent paer 0.18603 -100.200570
27 styp ecoliO157 0.00484 -240.708645
28 styp paer 0.18497 -30.250127
41 paer sent 0.18603 -60.200570
44 paer styp 0.18497 -80.250127
49 paer hinf 0.18913 -90.056333
53 paer vcho 0.18703 -10.153929
55 paer pmul 0.18587 -100.208042
67 paer buch 0.21485 -80.898667
70 paer ypes 0.18460 -107.267454
82 paer xfas 0.26268 -61.920552
95 hinf ecoliO157 0.07654 -163.018417
96 hinf paer 0.18913 -10.056333
103 vcho ecoliO157 0.09518 -140.921153
104 vcho paer 0.18703 -10.153929
107 pmul ecoliO157 0.07328 -165.215225
108 pmul paer 0.18587 -10.208042
131 buch ecoliO157 0.15412 -11.746939
132 buch paer 0.21485 -8.898667
137 ypes ecoliO157 0.02705 -19.171851
138 ypes paer 0.18460 -10.267454
171 ecoliO157 sent 0.00590 -20.633493
174 ecoliO157 styp 0.00484 -20.708645
179 ecoliO157 hinf 0.07654 -6.018417
183 ecoliO157 vcho 0.09518 -14.921153
185 ecoliO157 pmul 0.07328 -6.215225
197 ecoliO157 buch 0.15412 -11.746939
200 ecoliO157 ypes 0.02705 -9.171851
211 ecoliO157 xfas 0.25833 -71.091552
217 xfas ecoliO157 0.25833 -75.091552
218 xfas paer 0.26268 -64.920552
I think even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get
[1] "aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response
[1] sent styp paer hinf vcho
pmul buch ypes ecoliO157 xfas
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes
--- Weiwei Shi <helprhelp at gmail.com> wrote:
Then you need to provide more details about the calls you made and your dataset. For example, you can tell us by str(prunedrelatives, 1) how did you call unique on prunedrelative and so on? I made a test data it gave me what you wanted (omitted here). On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi The pruned dataset has 8 unique genomes in it
while
the dataset before pruning has 65 unique genomes
in
it. However calling unique on the pruned dataset seems
to
return 65 no matter what. Any assistance in this matter would be
appreciated.
Thanks Lalitha --- Weiwei Shi <helprhelp at gmail.com> wrote:
Hi, Even you removed "many" genomes1 by setting
score<
-5; it is not necessary saying you changed the uniqueness. To check this, you can do like p0 <- unique(dataset[dataset$score< -5,
"genome1"])
# same as subset p1 <- unique(dataset[dataset$score>= -5,
"genome1"])
setdiff(p1, p0) if the output above has NULL, then it means even though you remove many genomes1, but it does not help changing the uniqueness. HTH, weiwei On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
Hi I am new to R programming and am using subset
to
extract part of a data as follows
names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score <
-5);
However when I use unique to find the number
of
unique
genomes now present in prunedrelatives I get
results
identical to calling unique(dataset$genome1)
although
subset has eliminated many genomes and
records.
I would greatly appreciate your input about
using
"unique" correctly in this regard. Thanks Lalitha
____________________________________________________________________________________ TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III ____________________________________________________________________________________ Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265
-- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III