write.table: strange output has been produced
Hi David - Thank you for your reply. You are probably right. The last 'normal' line doesn't have a double quote closed. There is the complete line below: -------------------------8<------------------------------------ "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP-GMP specific 5-nucleotidase Nucleotide transport and metabolism METABOLISM --------------------------8<------------------------------------ So it might be that the annotation dataset is actually the culprit. But it gets more complicated when I try to find find this string in the 'annot' object using the id value. The id 159998 is present in the output from 'intersect' function:
which(subset == 159998)
[1] 539 It also present in statdata:
which(statdata$id == 159998)
[1] 1502 But I cannot find it in the 'annot' object???
which(annot$id == 159998)
integer(0)
class(annot$id)
[1] "integer" Could it be that the annot dataset contains some illegal symbols that screw everything? Shall I just edit it first with 'sed' to remove everything except alpha-numeric before importing to R... -Igor
On Wed, 2012-09-19 at 10:26 -0700, David Winsemius wrote:
On Sep 19, 2012, at 9:12 AM, Igor wrote:
Good afternoon all - While making a steady progress in learning R after Matlab I encountered a problem which seems to require some extra help to move over. Basically I want to merge a data from biological statistical dataset with annotation data extracted from another dataset using an 'id' crossreference and write it to report file. The first part goes absolutely fine, I have merged both data into data.frame but when I try to write it to csv file using 'write.table' it seems like it does write the 'data.frame' object but it also insert some parts from the annotation data which are not suppose to be there... There is a little snapshot of the file output below to illustrate. The upper half is fine, that's how it should be. The lower half, which is actually appears to be space-separated, not coma, obviously grabbed from the annotation dataset and is not supposed to be here. --------------------------------8<-------------------------------------------- "344","166128",126.44286392082,179.904700814932,72.9810270267088,0.40566492535281,-1.3016395254146,2.47449355237252e-07,4.2901159299567e-06,"Chitinas "18816","238247",92.5282508325735,135.981255262454,49.0752464026927,0.36089714209487,-1.47034037615176,2.5330054329543e-07,4.38862252337004e-06,"Prot "22072","222365",30.8191942806426,52.4262903365628,9.21209822472236,0.17571524068522,-2.50868876576414,2.54433836512085e-07,4.40531098485028e-06,NA,N "25062","226605",30.808007579908,50.3976662241578,11.2183489356581,0.22259659575825,-2.16749656564076,2.54934711860645e-07,4.41103467375713e-06,NA,NA "7539","247009",75.4175439970731,34.4643221134552,116.370765880691,3.37655751642533,1.75555313265164,2.60010673210741e-07,4.49585878338091e-06,NA,NA, "407","267139",425.559675915702,279.393013150954,571.72633868045,2.04631580522577,1.03302881149302,2.61074218843609e-07,4.51123710239304e-06,NA,NA,NA "26530","171300",146.80096060985,80.0063286553601,213.595592564339,2.66973370924738,1.4166958484644,2.68061220749976e-07,4.62888115991058e-06,NA,NA,N "3078","159013",34.3260176515511,52.4580790080106,16.1939562950917,0.308702808057816,-1.69570948866688,2.69104298652827e-07,4.64379716436078e-06,"40S "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP- 171597 171597 KOG1347 Uncharacterized membrane protein, predicted efflux pump General function prediction only POORLY CHARACTERIZED 171658 171658 KOG4290 Predicted membrane protein Function unknown POORLY CHARACTERIZED 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in intracellular trafficking and secretion Signal transduction mechanisms CELLULAR 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in intracellular trafficking and secretion Intracellular trafficking, secretion, and 171703 171703 KOG2674 Cysteine protease required for autophagy - Apg4p/Aut2p Cytoskeleton CELLULAR PROCESSES AND SIGNALING 171703 171703 KOG2674 Cysteine protease required for autophagy - Apg4p/Aut2p Intracellular trafficking, secretion, and vesicular transport CELLU and metabolism METABOLISM
This looks like the sort of thing that occurs when there is a mismatched or missing double or single quote or perhaps comment character ( "#" that terminated a line read) somewhare. The logical place to look is in the line of data just above the pathological stretch of data. You have clearly only offered a truncated version of the data, since there are many instances of lines ending without matching quotes, even one in the first line. -- David.
--------------------------------8<-------------------------------------------- And this is a piece of code that produced this: --------------------------------8<--------------------------------------------
n = nrow(statdata) extra = data.frame(kogdefline=rep(NA,n), kogClass = rep(NA,n), kogGroup
= rep(NA,n))
subset = intersect(statdata$id, annot$id) MR = match(subset, annot$id) ML = match(subset, statdata$id)
extra[ML,1] = as.character(annot[MR,2]) extra[ML,2] = as.character(annot[MR,3]) extra[ML,3] = as.character(annot[MR,4])
# strangely, if I do # extra[ML,] = as.character(annot[MR,2:4]) # it produces digits (???) instead of a string value
mergedData = data.frame(statdata, extra) write.table(mergedData, 'filename.csv', sep=',')
--------------------------------8<-------------------------------------------- Any ideas why this is happening? Many thanks -Igor
David Winsemius, MD Alameda, CA, USA
Dr I Chernukhin School of Biological Sciences University of Essex Wivenhoe Park Colchester Essex CO4 3SQ