Skip to content

[Bioc-devel] Non-ASCII in datase from Biomart EMBL via Gviz package

4 messages · Martin, Tiphaine, Vincent Carey, Hahne, Florian

#
Hi,


I need to create dataset BiomartGeneRegionTrack via Gviz package to run examples in my packages. But when I run

"R CMD check coMET", i have warning message for the checking :


 checking data for non-ASCII characters ... WARNING
  Warning: found non-ASCII strings
  '[alpha cell,acidophil cell,acinar cell,adipoblast,adipocyte,amacrine cell,beta cell,capsular cell,cementocyte,chief cell,chondroblast,chondrocyte,chromaffin cell,chromophobic cell,corticotroph,delta cell,dendritic cell,enterochromaffin cell,ependymocyte,epithelium,erythroblast,erythrocyte,fibroblast,fibrocyte,follicular cell,germ cell,germinal epithelium,giant cell,glial cell,glioblast,goblet cell,gonadotroph,granulosa cell,haemocytoblast,hair cell,hepatoblast,hepatocyte,hyalocyte,interstitial cell,juxtaglomerular cell,keratinocyte,keratocyte,lemmal cell,leukocyte,luteal cell,lymphocytic stem cell,lymphoid cell,lymphoid stem cell,macroglial cell,mammotroph,mast cell,medulloblast,megakaryoblast,megakaryocyte,melanoblast,melanocyte,mesangial cell,mesothelium,metamyelocyte,monoblast,monocyte,mucous neck cell,muscle cell,myelocyte,myeloid cell,myeloid stem cell,myoblast,myoepithelial cell,myofibrobast,neuroblast,neuroepithelium,neuron,odontoblast,osteoblast,osteoclast,osteocyte,oxyntic cell,parafollicular cell,paraluteal cell,peptic cell,pericyte,phaeochromocyte,phalangeal cell,pinealocyte,pituicyte,plasma cell,platelet,podocyte,proerythroblast,promonocyte,promyeloblast,promyelocyte,pronormoblast,reticulocyte,retinal pigment epithelium,retinoblast,somatotroph,stem cell,sustentacular cell,teloglial cell,zymogenic cell,small cell,Th1,Cell Type,M<c3><bc>ller cell,primary oocyte,Claudius' cell,Th2,follicular dendritic cell,astrocyte,white,T-lymphoblast,basal cell,T-lymphocyte,helper induced T-lymphocyte:Th2,B-lymphocyte,neutrophil,oocyte,unclassifiable (Cell Type),natural killer cell,helper induced T-lymphocyte,brown,CD4+,Hensen cell,lymphocyte,cardiac muscle cell,lymphoblast,Paneth cell,alveolar macrophage,macrophage,squamous cell,oligodendrocyte,smooth muscle cell,gamete,spermatid,Schwann cell,CD34+,spermatocyte,helper induced T-lymphocyte:Th1,astroblast,eosinophil,oligodendroblast,basophil,peripheral blood mononuclear cell,histiocyte,Sertoli cell,endothelium,granulocyte,spermatozoon,Merkel cell,skeletal muscle cell,thymocyte,foam cell,ovum,secondary spermatocyte,Langerhans cell,primary spermatocyte,transitional,Purkinje cell,Kupffer cell,secondary oocyte,B-lymphoblast]' in object 'biomTrack'


chrom <- "chr2"
start <- 38290160
end <- 38303219
gen <- "hg19"

  biomTrack <- BiomartGeneRegionTrack(genome = gen,
                                      chromosome = chr, start = start,
                                      end = end,  name = "ENSEMBL",
                                      fontcolor="black", groupAnnotation = "group",
                                      just.group = "above",showId=showId )


Do you have an idea to correct this error? I think that we need to discuss with EMBL to correct that, do we ?


Tiphaine


----------------------------
Tiphaine Martin
PhD Research Student | King's College
The Department of Twin Research & Genetic Epidemiology | Genetics & Molecular Medicine Division
St Thomas' Hospital
4th Floor, Block D, South Wing
SE1 7EH, London
United Kingdom

email : tiphaine.martin at kcl.ac.uk
Fax: +44 (0) 207 188 6761
#
I don't know exactly how you are triggering this warning.  If you have the
ability to prefilter your content before serializing, that may be best.
The following
is from the gwascat package.  You have very little chance, I believe, of
getting an
institutional guarantee that only ascii will go into their emissions.

fixNonASCII = function(df) {
 hasNonASCII = function(x) {
   asc = iconv(x, "latin1", "ASCII")
   any(asc != x | is.na(asc))
   }
 havebad = sapply(df, function(x) hasNonASCII(x))
 if (!(any(havebad))) return(df)
 message("NOTE: input data had non-ASCII characters replaced by '*'.")
 badinds = which(havebad)
 for (i in 1:length(badinds))
   df[,badinds[i]] = iconv(df[,badinds[i]], to="ASCII", sub="*")
 df
}



On Sun, Oct 12, 2014 at 2:14 PM, Martin, Tiphaine <tiphaine.martin at kcl.ac.uk

  
  
#
Hi Tiphaine,
You can follow Vince?s advice and transform all the data into proper ASCII
character. Or you can just get rid of the culprit (being the @biomart slot
of the object) before serialising. The easiest way to do that is:
foo at biomart <- NULL
The slot is only present to cache the BiomaRt connection, which is lost
anyways when serialising. The object is smart enough to realise that and
just reconnects the next time it is plotted. That is how I handled things
for the serialised BiomartGeneRegionTracks in Gviz.
Florian
On 12/10/14 20:35, "Vincent Carey" <stvjc at channing.harvard.edu> wrote:

            
#
both methods work well. 
Thanks,
Tiphaine