Skip to content

unique/subset problem

8 messages · lalitha viswanath, Sarah Goslee, Weiwei Shi

#
Hi
I am new to R programming and am using subset to
extract part of a data as follows

names(dataset) =
c("genome1","genome2","dist","score");
prunedrelatives <- subset(dataset, score < -5);

However when I use unique to find the number of unique
genomes now present in prunedrelatives I get results
identical to calling unique(dataset$genome1) although
subset has eliminated many genomes and records.

I would greatly appreciate your input about using
"unique" correctly  in this regard.

Thanks
Lalitha


 
____________________________________________________________________________________
TV dinner still cooling? 
Check out "Tonight's Picks" on Yahoo! TV.
#
Hi,

Even you removed "many" genomes1 by setting score< -5; it is not
necessary saying you changed the uniqueness.

To check this, you can do like
p0 <- unique(dataset[dataset$score< -5, "genome1"]) # same as subset
p1 <- unique(dataset[dataset$score>= -5, "genome1"])

setdiff(p1, p0)

if the output above has NULL, then it means even though you remove
many genomes1, but it does not help changing the uniqueness.

HTH,

weiwei
On 1/25/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:

  
    
#
Hi
The pruned dataset has 8 unique genomes in it while
the dataset before pruning has 65 unique genomes in
it.
However calling unique on the pruned dataset seems to
return 65 no matter what.

Any assistance in this matter would be appreciated.

Thanks
Lalitha
--- Weiwei Shi <helprhelp at gmail.com> wrote:

            
____________________________________________________________________________________
____________________________________________________________________________________
Bored stiff? Loosen up...
#
Without knowing more about your data, it is hard to say for certain,
but might you be confusing unique _values_ with _factor levels_?
# mydata has 10 values, 5 unique values, and 5 factor levels
[1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
# the subset now has only 2 unique values, but the output
# still lists all five factor levels
[1] 1 2
Levels: 1 2 3 4 5

# try drop=TRUE as an option to subset
[1] 1 2
Levels: 1 2

Alternatively, if this is the problem and you don't need those
data to be factors, you could always convert them to a more
appropriate form.

Sarah

  
    
#
Then you need to provide more details about the calls you made and your dataset.
For example, you can tell us by
str(prunedrelatives, 1)

how did you call unique on prunedrelative and so on? I made a test
data it gave me what you wanted (omitted here).
On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:

  
    
#
Hi
I read in my dataset using
dt <read.table("filename")
calling unique(levels(dt$genome1))  yields the
following 

 "aero"      "aful"      "aquae"     "atum_D"   
"bbur"      "bhal"      "bmel"      "bsub"     
 [9] "buch"      "cace"      "ccre"      "cglu"     
"cjej"      "cper"      "cpneuA"    "cpneuC"   
[17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"     
"hinf"      "hpyl"      "linn"      "llact"    
[25] "lmon"      "mgen"      "mjan"      "mlep"     
"mlot"      "mpneu"     "mpul"      "mthe"     
[33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"  
"paer"      "paero"     "pmul"      "pyro"     
[41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
"sent"      "smel"      "spneu"     "spyo"     
[49] "ssol"      "stok"      "styp"      "synecho"  
"tacid"     "tmar"      "tpal"      "tvol"     
[57] "uure"      "vcho"      "xfas"      "ypes"     

It shows 60 genomes, which is correct.

I extracted a subset as follows
possible_relatives_subset <- subset(dt, Y < -5)
I am pasting the results below
     genome1   genome2 parameterX          Y
21       sent ecoliO157  0.00590 -200.633493
22       sent      paer  0.18603 -100.200570
27       styp ecoliO157  0.00484 -240.708645
28       styp      paer  0.18497 -30.250127
41       paer      sent  0.18603 -60.200570
44       paer      styp  0.18497 -80.250127
49       paer      hinf  0.18913 -90.056333
53       paer      vcho  0.18703 -10.153929
55       paer      pmul  0.18587 -100.208042
67       paer      buch  0.21485  -80.898667
70       paer      ypes  0.18460 -107.267454
82       paer      xfas  0.26268  -61.920552
95       hinf ecoliO157  0.07654 -163.018417
96       hinf      paer  0.18913 -10.056333
103      vcho ecoliO157  0.09518 -140.921153
104      vcho      paer  0.18703 -10.153929
107      pmul ecoliO157  0.07328 -165.215225
108      pmul      paer  0.18587 -10.208042
131      buch ecoliO157  0.15412 -11.746939
132      buch      paer  0.21485  -8.898667
137      ypes ecoliO157  0.02705 -19.171851
138      ypes      paer  0.18460 -10.267454
171 ecoliO157      sent  0.00590 -20.633493
174 ecoliO157      styp  0.00484 -20.708645
179 ecoliO157      hinf  0.07654 -6.018417
183 ecoliO157      vcho  0.09518 -14.921153
185 ecoliO157      pmul  0.07328 -6.215225
197 ecoliO157      buch  0.15412 -11.746939
200 ecoliO157      ypes  0.02705 -9.171851
211 ecoliO157      xfas  0.25833  -71.091552
217      xfas ecoliO157  0.25833  -75.091552
218      xfas      paer  0.26268  -64.920552

I think  even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get

[1] "aero"      "aful"      "aquae"     "atum_D"   
"bbur"      "bhal"      "bmel"      "bsub"     
 [9] "buch"      "cace"      "ccre"      "cglu"     
"cjej"      "cper"      "cpneuA"    "cpneuC"   
[17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"     
"hinf"      "hpyl"      "linn"      "llact"    
[25] "lmon"      "mgen"      "mjan"      "mlep"     
"mlot"      "mpneu"     "mpul"      "mthe"     
[33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"  
"paer"      "paero"     "pmul"      "pyro"     
[41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
"sent"      "smel"      "spneu"     "spyo"     
[49] "ssol"      "stok"      "styp"      "synecho"  
"tacid"     "tmar"      "tpal"      "tvol"     
[57] "uure"      "vcho"      "xfas"      "ypes" 

Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response

[1] sent      styp      paer      hinf      vcho     
pmul      buch      ypes      ecoliO157 xfas     
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes
--- Weiwei Shi <helprhelp at gmail.com> wrote:

            
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
#
check
?read.table

and add "as.is=T" in the option. So you read string as character now
and avoid the factor things.

Then repeat your work.

For example
`data.frame':	10 obs. of  7 variables:
 $ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1
 $ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2
 $ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4
 $ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1
 $ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1
 $ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2
 $ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
`data.frame':	10 obs. of  7 variables:
 $ V1: chr  "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ...
 $ V2: chr  "ID1" "65259" "1020286" "-518245428" ...
 $ V3: chr  "ID2" "6436" "6436" "-2099509019" ...
 $ V4: chr  "MECHANISM" "5" "5" "5" ...
 $ V5: chr  "TYPE" "0" "0" "0" ...
 $ V6: chr  "USER_" "alexey" "alexey" "alexey" ...
 $ V7: chr  "TRUST" "3" "3" "3" ...

HTH,

weiwei
On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:

  
    
#
oh, i forgot, you can also convert factor into string like
dataset$genome1 <- as.character(dataset$genome1)

so you don't have to use
as.numeric(dataset$score) if you use "as.is=T" when you read.table

HTH,

weiwei
On 1/26/07, Weiwei Shi <helprhelp at gmail.com> wrote: