Merge sort - R-help | R Mailing Lists

Gaston

Tue, Apr 19, 2016 12:39 PM #

Hello everyone,

I am learning R since recently, and as a small exercise I wanted to 
write a recursive mergesort. I was extremely surprised to discover that 
my sorting, although operational, is deeply inefficient in time. Here is 
my code :

I tried my best to write it "the R-way", but apparently I failed. I 
suppose some of the functions I used are quite heavy. I would be 
grateful if you could give a hint on how to change that!

I hope I made myself clear and wish you a nice day,

Cheers,

Gaston

Duncan Murdoch

Tue, Apr 19, 2016 2:51 PM #

On 19/04/2016 3:39 PM, Gaston wrote:

Your use of is.na() looks strange.  I don't understand why you are 
testing element 2 in mergesort(), and element 1 in merge(), and element 
3 in division.  Are you using it to test the length?  It's better to use 
the length() function for that.

The division() function returns a matrix.  It would make more R-sense to 
return a list containing the two parts, because they might not be the 
same length.

Generally speaking, function calls are expensive in R, so the recursive 
merge you're using looks like it would be the bottleneck.  You'd almost 
certainly be better off to allocate something of length(x) + length(y), 
and do the assignments in a loop.

Here's a merge sort I wrote as an illustration in a class.  It's 
designed for clarity rather than speed, but I'd guess it would be faster 
than yours:

mergesort <- function(x) {

   n <- length(x)
   if (n < 2) return(x)

   # split x into two pieces of approximately equal size, x1 and x2

   x1 <- x[1:(n %/% 2)]
   x2 <- x[(n %/% 2 + 1):n]

   # sort each of the pieces
   x1 <- mergesort(x1)
   x2 <- mergesort(x2)

   # merge them back together
   result <- c()
   i <- 0
   while (length(x1) > 0 && length(x2) > 0) {
     # compare the first values
     if (x1[1] < x2[1]) {
       result[i + 1] <- x1[1]
       x1 <- x1[-1]
     } else {
       result[i + 1] <- x2[1]
       x2 <- x2[-1]
     }
     i <- i + 1
   }

   # put the smaller one into the result
   # delete it from whichever vector it came from
   # repeat until one of x1 or x2 is empty
   # copy both vectors (one is empty!) onto the end of the results
   result <- c(result, x1, x2)
   result
}

If I were going for speed, I wouldn't modify the x1 and x2 vectors, and 
I'd pre-allocate result to the appropriate length, rather than growing 
it in the while loop.  But that was a different class!

Duncan Murdoch

Gaston

Wed, Apr 20, 2016 4:38 AM #

I indeed used is.na() to check length, as I was not sure weather 
lenght() was a simple query or would go through the whole vector to 
count the elements.

So to sum up, function calls are expensive, therefore recursion should 
be avoided, and growing the size of a vector (which is probably 
reassigning and copying?) is also expensive.

Thank you for your help!

On 04/19/2016 11:51 PM, Duncan Murdoch wrote:

On 19/04/2016 3:39 PM, Gaston wrote:

Hello everyone,

I am learning R since recently, and as a small exercise I wanted to
write a recursive mergesort. I was extremely surprised to discover that
my sorting, although operational, is deeply inefficient in time. Here is
my code :

merge <- function(x,y){
   if (is.na(x[1])) return(y)
   else if (is.na(y[1])) return(x)
   else if (x[1]<y[1]) return(c(x[1],merge(x[-1],y)))
   else return(c(y[1],merge(x,y[-1])))
}

division <- function(x){
   if (is.na(x[3])) return(cbind(x[1],x[2]))
   else
return(cbind(c(x[1],division(x[-c(1,2)])[,1]),c(x[2],division(x[-c(1,2)])[,2]))) 

}

mergesort <- function(x){
   if (is.na(x[2])) return(x)
   else{
     print(x)
     t=division(x)
     return(merge(mergesort(t[,1]),mergesort(t[,2])))
   }
}

I tried my best to write it "the R-way", but apparently I failed. I
suppose some of the functions I used are quite heavy. I would be
grateful if you could give a hint on how to change that!

I hope I made myself clear and wish you a nice day,

Your use of is.na() looks strange.  I don't understand why you are 
testing element 2 in mergesort(), and element 1 in merge(), and 
element 3 in division.  Are you using it to test the length?  It's 
better to use the length() function for that.

The division() function returns a matrix.  It would make more R-sense 
to return a list containing the two parts, because they might not be 
the same length.

Generally speaking, function calls are expensive in R, so the 
recursive merge you're using looks like it would be the bottleneck.  
You'd almost certainly be better off to allocate something of 
length(x) + length(y), and do the assignments in a loop.

Here's a merge sort I wrote as an illustration in a class.  It's 
designed for clarity rather than speed, but I'd guess it would be 
faster than yours:

mergesort <- function(x) {

  n <- length(x)
  if (n < 2) return(x)

  # split x into two pieces of approximately equal size, x1 and x2

  x1 <- x[1:(n %/% 2)]
  x2 <- x[(n %/% 2 + 1):n]

  # sort each of the pieces
  x1 <- mergesort(x1)
  x2 <- mergesort(x2)

  # merge them back together
  result <- c()
  i <- 0
  while (length(x1) > 0 && length(x2) > 0) {
    # compare the first values
    if (x1[1] < x2[1]) {
      result[i + 1] <- x1[1]
      x1 <- x1[-1]
    } else {
      result[i + 1] <- x2[1]
      x2 <- x2[-1]
    }
    i <- i + 1
  }

  # put the smaller one into the result
  # delete it from whichever vector it came from
  # repeat until one of x1 or x2 is empty
  # copy both vectors (one is empty!) onto the end of the results
  result <- c(result, x1, x2)
  result
}

If I were going for speed, I wouldn't modify the x1 and x2 vectors, 
and I'd pre-allocate result to the appropriate length, rather than 
growing it in the while loop.  But that was a different class!

Duncan Murdoch

Duncan Murdoch

Wed, Apr 20, 2016 5:02 AM #

On 20/04/2016 7:38 AM, Gaston wrote:

length() is a simple query, and is very fast.  The other problem in your 
approach (which may not be a problem with your current data) is that NA 
is commonly used as an element of a vector to represent a missing value.

"Avoided" may be too strong:  speed isn't always a concern, sometimes 
clarity is more important.  Growing vectors is definitely expensive.

Duncan Murdoch

Thank you for your help!



On 04/19/2016 11:51 PM, Duncan Murdoch wrote:

On 19/04/2016 3:39 PM, Gaston wrote:

Hello everyone,

I am learning R since recently, and as a small exercise I wanted to
write a recursive mergesort. I was extremely surprised to discover that
my sorting, although operational, is deeply inefficient in time. Here is
my code :

merge <- function(x,y){
    if (is.na(x[1])) return(y)
    else if (is.na(y[1])) return(x)
    else if (x[1]<y[1]) return(c(x[1],merge(x[-1],y)))
    else return(c(y[1],merge(x,y[-1])))
}

division <- function(x){
    if (is.na(x[3])) return(cbind(x[1],x[2]))
    else
return(cbind(c(x[1],division(x[-c(1,2)])[,1]),c(x[2],division(x[-c(1,2)])[,2])))

}

mergesort <- function(x){
    if (is.na(x[2])) return(x)
    else{
      print(x)
      t=division(x)
      return(merge(mergesort(t[,1]),mergesort(t[,2])))
    }
}

I tried my best to write it "the R-way", but apparently I failed. I
suppose some of the functions I used are quite heavy. I would be
grateful if you could give a hint on how to change that!

I hope I made myself clear and wish you a nice day,

Your use of is.na() looks strange.  I don't understand why you are
testing element 2 in mergesort(), and element 1 in merge(), and
element 3 in division.  Are you using it to test the length?  It's
better to use the length() function for that.

The division() function returns a matrix.  It would make more R-sense
to return a list containing the two parts, because they might not be
the same length.

Generally speaking, function calls are expensive in R, so the
recursive merge you're using looks like it would be the bottleneck.
You'd almost certainly be better off to allocate something of
length(x) + length(y), and do the assignments in a loop.

Here's a merge sort I wrote as an illustration in a class.  It's
designed for clarity rather than speed, but I'd guess it would be
faster than yours:

mergesort <- function(x) {

   n <- length(x)
   if (n < 2) return(x)

   # split x into two pieces of approximately equal size, x1 and x2

   x1 <- x[1:(n %/% 2)]
   x2 <- x[(n %/% 2 + 1):n]

   # sort each of the pieces
   x1 <- mergesort(x1)
   x2 <- mergesort(x2)

   # merge them back together
   result <- c()
   i <- 0
   while (length(x1) > 0 && length(x2) > 0) {
     # compare the first values
     if (x1[1] < x2[1]) {
       result[i + 1] <- x1[1]
       x1 <- x1[-1]
     } else {
       result[i + 1] <- x2[1]
       x2 <- x2[-1]
     }
     i <- i + 1
   }

   # put the smaller one into the result
   # delete it from whichever vector it came from
   # repeat until one of x1 or x2 is empty
   # copy both vectors (one is empty!) onto the end of the results
   result <- c(result, x1, x2)
   result
}

If I were going for speed, I wouldn't modify the x1 and x2 vectors,
and I'd pre-allocate result to the appropriate length, rather than
growing it in the while loop.  But that was a different class!

Duncan Murdoch