Skip to content

Split a data.frame

6 messages · Christofer Bogaso, Rui Barradas, jim holtman +2 more

#
Hi,

I am struggling to split a data.frame as will below scheme :

DF = data.frame(name = c('a', 'v', 'c'), val = 0); DF

split_str = c('a', 'c')

Now, for each element in split_str, R should find which row of DF contains
that element, and return DF with all rows starting from next row of the
corresponding element and ending with the preceding value of the next
element.

So in my case, I should see 2 data.frames

1st data-frame with name = 'v' (i.e. 2nd row of DF)

2nd data.frame with number_of_rows as 0 (as there is no row left after 'c')

Similarly if split_str = c('v'') then, my 2 data.frames will be

1st data.frame with name = 'a'
2nd data.frame with name = 'c'

Any idea how to efficiently implement above scheme would be highly
appreciated. I tried with split() function, however, it is not giving the
right answer.

Thanks,
#
Hello,

Maybe something like the following.

splitDF <- function(data, col, s){
     n <- nrow(data)
     inx <- which(data[[col]] %in% s)
     lapply(seq_along(inx), function(i){
         k <- if(inx[i] < n) (inx[i] + 1):(inx[i + 1])
         data[k, ]
     })
}

splitDF(DF, "name", split_str)


Hope this helps,

Rui Barradas
On 5/19/2018 12:07 PM, Christofer Bogaso wrote:
#
DF = data.frame(name = c('a', 'v', 'c'), val = 0); DF
##   name val
## 1    a   0
## 2    v   0
## 3    c   0
split_str = c('a', 'c')
# If we assume that the values in split_str are ordered in the same order
as in the dataframe, then this might work.

offsets <- match(split_str, DF$name)
# Since you only want the rows in between

DF[diff(offsets), ]
##   name val
## 2    v   0


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Sat, May 19, 2018 at 7:58 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:

            

  
  
#
...
yes, but note that:

which(data[[col]] %in% s

can be replaced directly by match:

match(data[[col]], s)

Corner cases (nothing matches, etc.) would also have to be checked and
probably should sort the matched row numbers for safety.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, May 19, 2018 at 7:58 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:

            

  
  
#
Forgot to take care of the boundary conditions:

# revised data.frame to take care of boundary conditions
DF = data.frame(name = c('b', 'a','v','z', 'c','d'), val = 0); DF
##   name val
## 1    b   0
## 2    a   0
## 3    v   0
## 4    z   0
## 5    c   0
## 6    d   0
split_str = c('a', 'c')

# If we assume that the values in split_str are ordered in
# the same order as in the dataframe, then this might work.
offsets <- match(split_str, DF$name)

# now find the values inbetween the offsets
ret_indx <- NULL
for (i in seq_len(length(offsets) - 1)){
  if (offsets[i + 1] - offsets[i] > 1){  # something inbetween
    ret_indx <- c(ret_indx, (offsets[i] + 1):(offsets[i+1] - 1))
  }
}
DF[ret_indx, ]
##   name val
## 3    v   0
## 4    z   0



Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Sat, May 19, 2018 at 4:07 AM, Christofer Bogaso <
bogaso.christofer at gmail.com> wrote:

            

  
  
#
Hi!

How about this:

--- snip --

for (i in 1:(length(split_str)-1)) {
????assign(paste("DF",i,sep=""),DF[
c((which(DF$name==split_str[i])+1):(which(DF$name==split_str[i+1])-1)), 
])
}

--- snip ---

'assign' creates for each subset a new data.frame DFn, where n ist a
count (1,2,...).

But note: if your DF has duplicates in 'name' (e.g. two rows with 'a'
in 'DF$name'), my solution will use the first occurrence only (and this
for both start and for end).

HTH,
Kimmo
2018-05-19 kello 16:37 +0530, Christofer Bogaso wrote: