Dear all,
The 5th column of my data frame is like this-
.$.$.$.$.$,$,$...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
,..,,....,,,,,...,,,..,,......,,,,,,,....,,,.,,,,....,,...G.,,,,,,,,...,,,,,,.,,
,t.,,c,,.a.,,,.A,,,,....,,,.....,,,,..........,,,,,..,,,.,,,....,,,,,...,,,$....
.,,,,..,,,...,,,,,..,,,,,,.............$..,,,,,,...,,..,,$,...,,,,,,,....,,,,,,.
,,,,......,,,,.,,.......,.....,,,,,,.,,..,,...,,,,,.,......,.......,,....,,,,..,,
,,,,.........,,,,,.....,,,,...,,,.....,,.....,,......,....,,......,.,,..,,,,...,,
H.,,,..,,.....,,,,..,,,,,,,,,^~.^~.^\".^~.^~.^~.^~,^~,^~,^~,"
I just want to have A,a,C,c,G,g,T,t and dot and comma in the columns.
example of first row should be-
.....,,...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
currently i am using this code-
df$V5 <- apply(df, 1, function(x) gsub("\\:|\\$|\\^|!|\\-|1|2|3|4|5|6|7|8|10|~|H", "",x[5]))
this use of gsub looks odd to me,although result is coming good but I want something fast because data is large.I want something like this-
delete everything else except A,a,C,c,G,g,T,t and dot and comma.
Any suggestions Please.
Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
Removing funny characters from a column of a data frame
2 messages · Bansal, Vikas, Joshua Wiley
Hi Vikas,
You're overworking yourself here, gsub is vectorized!
df$V5 <- gsub("[^AaCcGgTt\\.,]", "", df$V5)
This will be *substantially* faster than looping (using apply) over
every row of your data frame, since you just care about the 5th column
anyways. Also, I switched your regexp for one that replaces not
AaCcGgTt.,
Cheers,
Josh
On Sun, Aug 7, 2011 at 12:57 PM, Bansal, Vikas <vikas.bansal at kcl.ac.uk> wrote:
Dear all,
The 5th column of my data frame is like this-
.$.$.$.$.$,$,$...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
,..,,....,,,,,...,,,..,,......,,,,,,,....,,,.,,,,....,,...G.,,,,,,,,...,,,,,,.,,
,t.,,c,,.a.,,,.A,,,,....,,,.....,,,,..........,,,,,..,,,.,,,....,,,,,...,,,$....
.,,,,..,,,...,,,,,..,,,,,,.............$..,,,,,,...,,..,,$,...,,,,,,,....,,,,,,.
,,,,......,,,,.,,.......,.....,,,,,,.,,..,,...,,,,,.,......,.......,,....,,,,..,,
,,,,.........,,,,,.....,,,,...,,,.....,,.....,,......,....,,......,.,,..,,,,...,,
H.,,,..,,.....,,,,..,,,,,,,,,^~.^~.^\".^~.^~.^~.^~,^~,^~,^~,"
I just want to have A,a,C,c,G,g,T,t and dot and comma in the columns.
example of first row should be-
.....,,...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
currently i am using this code-
df$V5 <- ?apply(df, 1, function(x) gsub("\\:|\\$|\\^|!|\\-|1|2|3|4|5|6|7|8|10|~|H", "",x[5]))
this use of gsub looks odd to me,although result is coming good but I want something fast because data is large.I want something like this-
delete everything else except ?A,a,C,c,G,g,T,t and dot and comma.
Any suggestions Please.
Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/