Skip to content

problem with factor levels

5 messages · Jeremy.Shearman, Milan Bouchet-Valat, PIKAL Petr

#
Hi
      I have a data.frame with 371,718 obs. of 12 variables (see below for
an str). My problem is with V1, a Factor w/ 93144 levels, there should
actually be 93994 levels. Each entry looks like:
comp[number]_c[number]_seq[number]
for example
comp215489_c0_seq40
R is grouping as though the last number is a decimal for some reason, in
other words comp215489_c0_seq40 and comp215489_c0_seq4 are considered to be
the same. My problem is that they are not the same so when I group by this
factor I am losing 800 levels.

Here is an str

'data.frame':	371718 obs. of  12 variables:
 $ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 30
1564 1564 1623 91700 91701 91848 ...
 $ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: 10119
10779 13210 13210 11522 8115 13079 14493 14493 15858 ...
 $ V3 : num  95.5 90.2 98.7 99.2 81.4 ...
 $ V4 : int  335 153 237 122 258 127 306 258 120 177 ...
 $ V5 : int  15 15 3 1 38 19 20 23 5 9 ...
 $ V6 : int  0 0 0 0 4 2 0 0 0 0 ...
 $ V7 : int  1 45 1 43 1 129 1 54 1 70 ...
 $ V8 : int  335 197 237 164 254 254 306 311 120 246 ...
 $ V9 : int  6866 18 3172 3438 67 122 3927 42 346 195 ...
 $ V10: int  7200 170 3408 3559 318 247 4232 299 465 19 ...
 $ V11: num  7e-155 2e-46 4e-125 2e-61 3e-24 ...
 $ V12: num  545 184 446 234 111 69.9 448 329 198 280 ..



--
View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006.html
Sent from the R help mailing list archive at Nabble.com.
#
Le mardi 04 d?cembre 2012 ? 00:34 -0800, Jeremy.Shearman a ?crit :
What format is your original data using? How do you import it?

Please provide us with an excerpt of your original file showing at least
two different values of V1 that are considered the same once imported in
R (which sounds very unlikely to me...).


Regards
#
Hi
Hm. How did you constructed those factors?
[1] comp215489_c0_seq40 comp215489_c0_seq4 
Levels: comp215489_c0_seq4 comp215489_c0_seq40

gives me 2 levels as expected. I also doubt that R will do such stripping during reading from other file.

Regards
Petr
#
Oh, your skepticism was spot on!
I was using excel to check the output (silly, but I am still in the process
of moving from excel to R) and there was a discrepancy in the number of
output from R and excel. Turns out the problem was with excel and not with R
at all. That's a relief.

SOLVED




--
View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006p4652019.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi

That is quite usual. Excel is so widespread that almost everybody assumes it shall not contain mistakes and behaves correctly. The contrary is true. Spreadsheet often guess what user have on mind and "corrects" values to fit such assumption, let aside mistakes in coded functions.

R expects it is used by clever and able people and performs just what you tell it to do, not more not less.

So whenever result does not fit your expectations, first proof your expectations.

Regards
Petr