Skip to content

write.table: strange output has been produced

7 messages · jim holtman, Igor Chernukhin, David Winsemius

#
Good afternoon all -

While making a steady progress in learning R after Matlab I encountered
a problem which seems to require some extra help to move over.
Basically I want to merge a data from biological statistical dataset
with annotation data extracted from another dataset using an 'id'
crossreference and write it to report file. The first part goes
absolutely fine, I have merged both data into data.frame but when I try
to write it to csv file using 'write.table' it seems like it does write
the 'data.frame' object but it also insert some parts from the
annotation data which are not suppose to be there...
There is a little snapshot of the file output below to illustrate. The
upper half is fine, that's how it should be. The lower half, which is
actually appears to be space-separated, not coma, obviously grabbed from
the annotation dataset and is not supposed to be here.

--------------------------------8<--------------------------------------------
"344","166128",126.44286392082,179.904700814932,72.9810270267088,0.40566492535281,-1.3016395254146,2.47449355237252e-07,4.2901159299567e-06,"Chitinas
"18816","238247",92.5282508325735,135.981255262454,49.0752464026927,0.36089714209487,-1.47034037615176,2.5330054329543e-07,4.38862252337004e-06,"Prot
"22072","222365",30.8191942806426,52.4262903365628,9.21209822472236,0.17571524068522,-2.50868876576414,2.54433836512085e-07,4.40531098485028e-06,NA,N
"25062","226605",30.808007579908,50.3976662241578,11.2183489356581,0.22259659575825,-2.16749656564076,2.54934711860645e-07,4.41103467375713e-06,NA,NA
"7539","247009",75.4175439970731,34.4643221134552,116.370765880691,3.37655751642533,1.75555313265164,2.60010673210741e-07,4.49585878338091e-06,NA,NA,
"407","267139",425.559675915702,279.393013150954,571.72633868045,2.04631580522577,1.03302881149302,2.61074218843609e-07,4.51123710239304e-06,NA,NA,NA
"26530","171300",146.80096060985,80.0063286553601,213.595592564339,2.66973370924738,1.4166958484644,2.68061220749976e-07,4.62888115991058e-06,NA,NA,N
"3078","159013",34.3260176515511,52.4580790080106,16.1939562950917,0.308702808057816,-1.69570948866688,2.69104298652827e-07,4.64379716436078e-06,"40S
"4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP-

171597  171597  KOG1347 Uncharacterized membrane protein, predicted
efflux pump General function prediction only    POORLY CHARACTERIZED
171658  171658  KOG4290 Predicted membrane protein  Function unknown
POORLY CHARACTERIZED
171660  171660  KOG0903 Phosphatidylinositol 4-kinase, involved in
intracellular trafficking and secretion  Signal transduction mechanisms
CELLULAR 
171660  171660  KOG0903 Phosphatidylinositol 4-kinase, involved in
intracellular trafficking and secretion  Intracellular trafficking,
secretion, and
171703  171703  KOG2674 Cysteine protease required for autophagy -
Apg4p/Aut2p  Cytoskeleton    CELLULAR PROCESSES AND SIGNALING
171703  171703  KOG2674 Cysteine protease required for autophagy -
Apg4p/Aut2p  Intracellular trafficking, secretion, and vesicular
transport   CELLU
and metabolism     METABOLISM
--------------------------------8<--------------------------------------------
And this is a piece of code that produced this:

--------------------------------8<--------------------------------------------
= rep(NA,n))
# strangely, if I do    
# extra[ML,] = as.character(annot[MR,2:4])
# it produces digits (???) instead of a string value
--------------------------------8<--------------------------------------------

Any ideas why this is happening?

Many thanks
-Igor
#
On Sep 19, 2012, at 9:12 AM, Igor wrote:

            
This looks like the sort of thing that occurs when there is a mismatched or missing double or single quote or perhaps comment character ( "#" that terminated a line read) somewhare. The logical place to look is in the line of data just above the pathological stretch of data. You have clearly only offered a truncated version of the data, since there are many instances of lines ending without matching quotes, even one in the first line.
#
It would also be helpful if you could provide the output of 'str' for
all the objects that you are using.

e.g.,  str(statdata)    str(extra)


Also in creating your data.frame, use "stringsAsFactors = FALSE":

extra = data.frame(kogdefline=rep(NA,n)
    , kogClass = rep(NA,n)
    , kogGroup = rep(NA,n)
    , stringsAsFactors = FALSE
)
On Wed, Sep 19, 2012 at 12:12 PM, Igor <igorc at essex.ac.uk> wrote:

  
    
#
Hi Jim - 
Thank you for your reply.

-----------------------------8<------------------------------------
'data.frame':   6895 obs. of  4 variables:
 $ id          : int  231803 231804 231805 231810 231811 231816 231818
177697 223131 231823 ...
 $ kogdefline  : Factor w/ 1898 levels "17 beta-hydroxysteroid
dehydrogenase type 3, HSD17B3",..: 1633 693 704 1627 1042 507 1870 1448
730 185 ...
 $ kogClass    : Factor w/ 26 levels "","Amino acid transport and
metabolism ",..: 26 4 24 20 18 24 10 22 25 6 ...
 $ kogGroup    : Factor w/ 5 levels "","CELLULAR PROCESSES AND
SIGNALING",..: 3 4 2 2 2 2 2 3 3 2 ...
'data.frame':   3887 obs. of  8 variables:
 $ id            : chr  "267533" "246792" "271961" "237478" ...
 $ baseMean      : num  288 519 309 189 341 ...
 $ baseMeanA     : num  574 1025 617 375 661 ...
 $ baseMeanB     : num  1.392 13.592 0.535 2.23 21.621 ...
 $ foldChange    : num  0.002426 0.013258 0.000866 0.00594 0.032733 ...
 $ log2FoldChange: num  -8.69 -6.24 -10.17 -7.4 -4.93 ...
 $ pval          : num  2.82e-104 1.70e-94 4.82e-81 1.63e-79
6.62e-78 ...
 $ padj          : num  7.31e-100 2.20e-90 4.16e-77 1.06e-75
3.43e-74 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:1235] 17 18 20 22 31
33 39 43 44 45 ...
  .. ..- attr(*, "names")= chr [1:1235] "NA" "NA.1" "NA.2" "NA.3" ...
'data.frame':   3887 obs. of  3 variables:
 $ kogdefline: chr  NA NA NA NA ...
 $ kogClass  : chr  NA NA NA NA ...
 $ kogGroup  : chr  NA NA NA NA ...
-------------------------8<----------------------------

Also I tried "stringsAsFactors = FALSE", it doesn't seem to make any
difference.

-Igor
On Wed, 2012-09-19 at 13:36 -0400, jim holtman wrote:
#
Hi David - 
Thank you for your reply. You are probably right. The last 'normal' line
doesn't have a double quote closed. There is the complete line below:

-------------------------8<------------------------------------
"4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP-GMP specific 5-nucleotidase	Nucleotide transport and metabolism 	METABOLISM
--------------------------8<------------------------------------

So it might be that the annotation dataset is actually the culprit. But
it gets more complicated when I try to find find this string in the
'annot' object using the id value. 
The id 159998 is present in the output from 'intersect' function:
[1] 539

It also present in statdata:
[1] 1502

But I cannot find it in the 'annot' object???
integer(0)
[1] "integer"

Could it be that the annot dataset contains some illegal symbols that
screw everything? Shall I just edit it first with 'sed' to remove
everything except alpha-numeric before importing to R...


-Igor
On Wed, 2012-09-19 at 10:26 -0700, David Winsemius wrote:

  
    
#
On Sep 19, 2012, at 12:20 PM, Igor Chernukhin wrote:

            
I find it very productive to use the count.fields function. It lets you play around with removing the comment character which you do not yet seem to have done. I find this code particularly useful:

table(count.fields(file = "fil.ext", sep="," quote="'", comment.char=""))

This would get tripped up with commas inside the double-quotes quoted strings, but I do not see any of those in the fragments your offered.
#
Thank you David - you put me into right direction.
Back to normal, problem sorted. 
I've missed a single quote in 'annot' data when I imported it from file
using read.table function with the default 'quote' argument. quote="\""
did the trick. 

Many thanks
-Igor
On Wed, 2012-09-19 at 14:55 -0700, David Winsemius wrote: