Skip to content

Efficient way to use data frame of indices to initialize matrix

5 messages · gene, Whit Armstrong, Greg Snow +1 more

#
I have a data frame with three columns, x, y, and a.  I want to create a matrix from these values such that for matrix m:
m[x,y] == a

Obviously, I can go row by row through the data frame and insert the value a at the correct x,y location in the matrix.  I can make that slightly more efficient (perhaps), by doing something like this:
But I feel that there must be a more efficient, or at least more elegant way to do this.

--
Gene
#
index m as a vector and do the assignment in one step

i <- df$row + (df$col-1)*nrow(m)
m[i] <- df$a

or something along those lines.

-Whit
On Tue, Dec 7, 2010 at 1:31 PM, Cutler, Gene <gcutler at amgen.com> wrote:
#
tmpdf <- data.frame( x = c(1,2,3), y=c(2,3,1), a=c(10,20,30) )
mymat <- matrix(0, ncol=3, nrow=3)
mymat[ as.matrix(tmpdf[,c('x','y')]) ] <- tmpdf$a
#
On Dec 7, 2010, at 1:49 PM, Greg Snow wrote:

            
cbind is also useful for assembly of arguments to the  matrix-`[<-`  
function:

tmpdf <- data.frame( x = c(1,2,3), y=c(2,3,1), a=c(10,20,30) )
  mymat <- matrix(NA, ncol=max(tmpdf$y), nrow=max(tmpdf$x))
  mymat[ cbind(tmpdf$x,tmpdf$y) ] <- tmpdf$a

  mymat
      [,1] [,2] [,3]
[1,]   NA   10   NA
[2,]   NA   NA   20
[3,]   30   NA   NA
David Winsemius, MD
West Hartford, CT
#
Thanks for the three great answers!  For those who are curious, I timed the three approaches:

nr <- 15812
nc <- 64636
mymat <- matrix(nrow=nr, ncol=nc)
mymat[1,1] <- 1 # see note below

# mydf is created elsewhere
dim(mydf)
# 10910263        3
colnames(mydf)
# "x" "y" "a"

# approach 1:
# mymat[ mydf$x + (mydf$y-1) * nc ] <- mydf$a

# approach 2:
# mymat[ as.matrix(mydf[,2:1]) ] <- mydf$a

# approach 3:
# mymat[ cbind(mydf$x, mydf$y) ] <- mydf$a


system.time( for (i in 1:10) mymat[ mydf$x + (mydf$y-1) * nc ] <- mydf$a )
system.time( for (i in 1:10) mymat[ as.matrix(mydf$x, mydf$y) ] <- mydf$a )
system.time( for (i in 1:10) mymat[ cbind(mydf$x, mydf$y) ] <- mydf$a )


#   user  system elapsed 
# 10.478   3.837  14.317 <- #1
#  9.064   1.711  10.777 <- #2
# 10.747   2.702  13.450 <- #3

So you can see that approach #2 is the fastest.  Note that I found that initializing the new matrix with its first value takes about 8 elapsed seconds all on its own, which is why I have that initialization line above.

--
Gene