Skip to content

dataframe: string operations on columns

9 messages · boris pezzatti, Hadley Wickham, Peter Ehlers +4 more

#
Dear all,
how can I perform a string operation like strsplit(x," ")  on a column 
of a dataframe, and put the first or the second item of the split into a 
new dataframe column?
(so that on each row it is consistent)

Thanks
Boris
#
Hi,

I guess it's not the nicest way to do it, but it should work for you:

#create some sample data
df <- data.frame(a=c("A B", "C D", "A C", "A D", "B D"), 
stringsAsFactors=FALSE)
#split the column by space
df_split <- strsplit(df$a, split=" ")

#place the first element into column a1 and the second into a2
for (i in 1:length(df_split[[1]])){
  df[i+1] <- unlist(lapply(df_split, FUN=function(x) x[i]))
  names(df)[i+1] <- paste("a",i,sep="")
}

I hope people will give you more compact solutions.
HTH,
Ivan



Le 1/18/2011 16:30, boris pezzatti a ?crit :

  
    
#
Have a look at str_split_fixed in the stringr package.

Hadley
#
On 2011-01-18 08:14, Ivan Calandra wrote:
You can replace the loop with

  df <- transform(df, a1 = sapply(df_split, "[[", 1),
                      a2 = sapply(df_split, "[[", 2))

Peter Ehlers
#
df <- cbind(df, do.call(rbind, df_split)

seems to do the same (up to column names) but faster. However,
all the solutions rely on there being exactly two strings when
you split. The different solutions behave differently if this
assumption is violated and none of them really checks this. You
can, for instance, check this with all(sapply(df_split, length) == 2)

Best, Niels R. Hansen
#
Assuming every row is split into exactly two values by whatever string 
you choose as split, one fancy exercise in R data structures is

     dfsplit = function(df, split)
         as.data.frame(
             t(
                 structure(dim=c(2, nrow(df)),
                     unlist(
                         strsplit(split=split,
                             as.matrix(df))))))

so that if your data frame is

     df = data.frame(c('1 2', '3 4', '5 6'))

then

     dfsplit(df, ' ')
     #   V1 V2
     # 1  1  2
     # 2  3  4
     # 3  5  6

renaming the columns left as an exercise.

vQ
On 01/18/2011 05:22 PM, Peter Ehlers wrote:
#
Well, my solution with the loop might be slower (even though I don't see 
any difference with my system, at least with up to 100 lines and 3 
strings to separate), but it works whatever the number of strings.
But I should have renamed the columns outside of the loop:
names(df)[2:3] <- paste("a", 1:2, sep="")  ##or a more general solution 
for the indexes

Ivan


Le 1/19/2011 01:42, Niels Richard Hansen a ?crit :