Hi
I have a data.frame with 371,718 obs. of 12 variables (see below for
an str). My problem is with V1, a Factor w/ 93144 levels, there should
actually be 93994 levels. Each entry looks like:
comp[number]_c[number]_seq[number]
for example
comp215489_c0_seq40
R is grouping as though the last number is a decimal for some reason, in
other words comp215489_c0_seq40 and comp215489_c0_seq4 are considered to be
the same. My problem is that they are not the same so when I group by this
factor I am losing 800 levels.
Here is an str
'data.frame': 371718 obs. of 12 variables:
$ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 30
1564 1564 1623 91700 91701 91848 ...
$ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: 10119
10779 13210 13210 11522 8115 13079 14493 14493 15858 ...
$ V3 : num 95.5 90.2 98.7 99.2 81.4 ...
$ V4 : int 335 153 237 122 258 127 306 258 120 177 ...
$ V5 : int 15 15 3 1 38 19 20 23 5 9 ...
$ V6 : int 0 0 0 0 4 2 0 0 0 0 ...
$ V7 : int 1 45 1 43 1 129 1 54 1 70 ...
$ V8 : int 335 197 237 164 254 254 306 311 120 246 ...
$ V9 : int 6866 18 3172 3438 67 122 3927 42 346 195 ...
$ V10: int 7200 170 3408 3559 318 247 4232 299 465 19 ...
$ V11: num 7e-155 2e-46 4e-125 2e-61 3e-24 ...
$ V12: num 545 184 446 234 111 69.9 448 329 198 280 ..
--
View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006.html
Sent from the R help mailing list archive at Nabble.com.
problem with factor levels
5 messages · Jeremy.Shearman, Milan Bouchet-Valat, PIKAL Petr
Le mardi 04 d?cembre 2012 ? 00:34 -0800, Jeremy.Shearman a ?crit :
Hi
I have a data.frame with 371,718 obs. of 12 variables (see below for
an str). My problem is with V1, a Factor w/ 93144 levels, there should
actually be 93994 levels. Each entry looks like:
comp[number]_c[number]_seq[number]
for example
comp215489_c0_seq40
R is grouping as though the last number is a decimal for some reason, in
other words comp215489_c0_seq40 and comp215489_c0_seq4 are considered to be
the same. My problem is that they are not the same so when I group by this
factor I am losing 800 levels.
What format is your original data using? How do you import it? Please provide us with an excerpt of your original file showing at least two different values of V1 that are considered the same once imported in R (which sounds very unlikely to me...). Regards
Here is an str 'data.frame': 371718 obs. of 12 variables: $ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 30 1564 1564 1623 91700 91701 91848 ... $ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: 10119 10779 13210 13210 11522 8115 13079 14493 14493 15858 ... $ V3 : num 95.5 90.2 98.7 99.2 81.4 ... $ V4 : int 335 153 237 122 258 127 306 258 120 177 ... $ V5 : int 15 15 3 1 38 19 20 23 5 9 ... $ V6 : int 0 0 0 0 4 2 0 0 0 0 ... $ V7 : int 1 45 1 43 1 129 1 54 1 70 ... $ V8 : int 335 197 237 164 254 254 306 311 120 246 ... $ V9 : int 6866 18 3172 3438 67 122 3927 42 346 195 ... $ V10: int 7200 170 3408 3559 318 247 4232 299 465 19 ... $ V11: num 7e-155 2e-46 4e-125 2e-61 3e-24 ... $ V12: num 545 184 446 234 111 69.9 448 329 198 280 .. -- View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Jeremy.Shearman
Sent: Tuesday, December 04, 2012 9:35 AM
To: r-help at r-project.org
Subject: [R] problem with factor levels
Hi
I have a data.frame with 371,718 obs. of 12 variables (see below
for an str). My problem is with V1, a Factor w/ 93144 levels, there
should actually be 93994 levels. Each entry looks like:
comp[number]_c[number]_seq[number]
for example
comp215489_c0_seq40
R is grouping as though the last number is a decimal for some reason,
in other words comp215489_c0_seq40 and comp215489_c0_seq4 are
considered to be the same. My problem is that they are not the same so
when I group by this factor I am losing 800 levels.
Hm. How did you constructed those factors?
factor(c("comp215489_c0_seq40", "comp215489_c0_seq4") )
[1] comp215489_c0_seq40 comp215489_c0_seq4 Levels: comp215489_c0_seq4 comp215489_c0_seq40 gives me 2 levels as expected. I also doubt that R will do such stripping during reading from other file. Regards Petr
Here is an str 'data.frame': 371718 obs. of 12 variables: $ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 30 1564 1564 1623 91700 91701 91848 ... $ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: 10119 10779 13210 13210 11522 8115 13079 14493 14493 15858 ... $ V3 : num 95.5 90.2 98.7 99.2 81.4 ... $ V4 : int 335 153 237 122 258 127 306 258 120 177 ... $ V5 : int 15 15 3 1 38 19 20 23 5 9 ... $ V6 : int 0 0 0 0 4 2 0 0 0 0 ... $ V7 : int 1 45 1 43 1 129 1 54 1 70 ... $ V8 : int 335 197 237 164 254 254 306 311 120 246 ... $ V9 : int 6866 18 3172 3438 67 122 3927 42 346 195 ... $ V10: int 7200 170 3408 3559 318 247 4232 299 465 19 ... $ V11: num 7e-155 2e-46 4e-125 2e-61 3e-24 ... $ V12: num 545 184 446 234 111 69.9 448 329 198 280 .. -- View this message in context: http://r.789695.n4.nabble.com/problem- with-factor-levels-tp4652006.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Oh, your skepticism was spot on! I was using excel to check the output (silly, but I am still in the process of moving from excel to R) and there was a discrepancy in the number of output from R and excel. Turns out the problem was with excel and not with R at all. That's a relief. SOLVED -- View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006p4652019.html Sent from the R help mailing list archive at Nabble.com.
Hi That is quite usual. Excel is so widespread that almost everybody assumes it shall not contain mistakes and behaves correctly. The contrary is true. Spreadsheet often guess what user have on mind and "corrects" values to fit such assumption, let aside mistakes in coded functions. R expects it is used by clever and able people and performs just what you tell it to do, not more not less. So whenever result does not fit your expectations, first proof your expectations. Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Jeremy.Shearman Sent: Tuesday, December 04, 2012 10:38 AM To: r-help at r-project.org Subject: Re: [R] problem with factor levels Oh, your skepticism was spot on! I was using excel to check the output (silly, but I am still in the process of moving from excel to R) and there was a discrepancy in the number of output from R and excel. Turns out the problem was with excel and not with R at all. That's a relief. SOLVED -- View this message in context: http://r.789695.n4.nabble.com/problem- with-factor-levels-tp4652006p4652019.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.