Say I have a tab-delimited table I want to read into R. What should I
expect to happen if some of the entries contain the character " ' "? I
thought it would read the file fine, but that is not what happens.
Instead, all the values in between two " ' "s get read into one field,
and things are just seriously messed up. Is this a bug, and besides
removing the offending characters, is there a fix?
Example Input file:
testFile.txt:
3499 ? ?9031 ? ?424823 ?COP'B2 ?118094989 ? ? ? XP_422637.2
3499 ? ?7955 ? ?114454 ?copb2 ? 50080158 ? ? ? ?NP_001001940.1
3499 ? ?7227 ? ?45757 ? betaCop 24584107 ? ? ? ?NP_524836.2
3499 ? ?7165 ? ?1278426 AgaP_AGAP004798 158297839 ? ? ? XP_318012.4
3499 ? ?6239 ? ?177779 ?F38E11.5 ? ? ? ?17540286 ? ? ? ?NP_501671.1
3499 ? ?4896 ? ?2540050 sec'27 ?19113604 ? ? ? ?NP_596811.1
3499 ? ?4932 ? ?852740 ?SEC27 ? 6321301 NP_011378.1
3499 ? ?28985 ? 2897447 KLLA0B01958g ? ?50303353 ? ? ? ?XP_451618.1
3499 ? ?33169 ? 4621659 AGOS_AFL118W ? ?45198403 ? ? ? ?NP_985432.1
3499 ? ?148305 ?2682116 MGG_10504 ? ? ? 145615762 ? ? ? XP_366285.2
3499 ? ?5141 ? ?2709504 NCU07319.1 ? ? ?32414251 ? ? ? ?XP_327605.1
3499 ? ?3702 ? ?820842 ?AT3G15980 ? ? ? 30683862 ? ? ? ?NP_850592.1
3499 ? ?3702 ? ?841666 ?AT1G52360 ? ? ? 15218215 ? ? ? ?NP_175645.1
3499 ? ?3702 ? ?844339 ?AT1G79990 ? ? ? 30699476 ? ? ? ?NP_178116.2
3499 ? ?4530 ? ?4340097 Os06g0143900 ? ?115466360 ? ? ? NP_001056779.1
testDat <- read.table('testFile.txt',sep='\t')
testDat
? ? V1 ? ? V2 ? ? ?V3
1 ?3499 ? 9031 ?424823
2 ?3499 ? 4932 ?852740
3 ?3499 ?28985 2897447
4 ?3499 ?33169 4621659
5 ?3499 148305 2682116
6 ?3499 ? 5141 2709504
7 ?3499 ? 3702 ?820842
8 ?3499 ? 3702 ?841666
9 ?3499 ? 3702 ?844339
10 3499 ? 4530 4340097
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? V4
1 ?COPB2\t118094989\tXP_422637.2\n3499\t7955\t114454\tcopb2\t50080158\tNP_001001940.1\n3499\t7227\t45757\tbetaCop\t24584107\tNP_524836.2\n3499\t7165\t1278426\tAgaP_AGAP004798\t158297839\tXP_318012.4\n3499\t6239\t177779\tF38E11.5\t17540286\tNP_501671.1\n3499\t4896\t2540050\tsec27
2
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SEC27
3
? ? ? ? ? ? ? ? ? ? ? ? ? ? KLLA0B01958g
4
? ? ? ? ? ? ? ? ? ? ? ? ? ? AGOS_AFL118W
5
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MGG_10504
6
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? NCU07319.1
7
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AT3G15980
8
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AT1G52360
9
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AT1G79990
10
? ? ? ? ? ? ? ? ? ? ? ? ? ? Os06g0143900
? ? ? ? ?V5 ? ? ? ? ? ? V6
1 ? 19113604 ? ?NP_596811.1
2 ? ?6321301 ? ?NP_011378.1
3 ? 50303353 ? ?XP_451618.1
4 ? 45198403 ? ?NP_985432.1
5 ?145615762 ? ?XP_366285.2
6 ? 32414251 ? ?XP_327605.1
7 ? 30683862 ? ?NP_850592.1
8 ? 15218215 ? ?NP_175645.1
9 ? 30699476 ? ?NP_178116.2
10 115466360 NP_001056779.1
I would appreciate any feedback.
Thanks,
-Robert
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 ?LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
loaded via a namespace (and not attached):
[1] tools_2.12.1
Robert M. Flight, Ph.D.
University of Louisville Bioinformatics Laboratory
University of Louisville
Louisville, KY
PH 502-852-1809 (HSC)
PH 502-852-0467 (Belknap)
EM robert.flight at louisville.edu
EM rflight79 at gmail.com
Williams and Holland's Law:
? ? ?? If enough data is collected, anything may be proven by
statistical methods.