Skip to content

Replacing for loop with tapply!?

9 messages · Dimitris Rizopoulos, Kjetil Halvorsen, Adaikalavan Ramasamy +2 more

#
Dear all,

We have a large data set with temperature data for weather stations 
across the globe (15000 stations).

For each station, we need to calculate the number of days a certain 
temperature is exceeded.

So far we used the following S code, where mat88 is a matrix containing 
rows of 365 daily temperatures for each of 15000 weather stations:

	m <- 37
	n <- 2
	outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
	for(i in 1:nrow(mat88)) {
		# i <- 3
		row1 <- as.data.frame(df88[i,  ])
		temprow37 <- select.rows(row1, row1 > m)
		temprow39 <- select.rows(row1, row1 > m + n)
		temprow41 <- select.rows(row1, row1 > m + 2 * n)
		outmat88[i, 1] <- max(row1, na.rm = T)
		outmat88[i, 2] <- count.rows(temprow37)
		outmat88[i, 3] <- count.rows(temprow39)
		outmat88[i, 4] <- count.rows(temprow41)
	}
	outmat88

We have transferred the data to a more potent Linux box running R, but 
still hope to speed up the code.

I know a for loop should be avoided when looking for speed. I also know 
the answer is in something like tapply, but my understanding of these 
commands is still to limited to see the solution. Could someone show me 
the way!?

Thanks in advance,

Sander.
#
maybe you are looking for something along these lines:

mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
temps <- c(37, 39, 41)
#################
ind <- matrix(0, length(temps), ncol(mat))
for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
ind


I hope it helps.

Best,
Dimitris

----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat/
     http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm


----- Original Message ----- 
From: "Sander Oom" <slist at oomvanlieshout.net>
To: <r-help at stat.math.ethz.ch>
Sent: Friday, June 10, 2005 10:50 AM
Subject: [R] Replacing for loop with tapply!?
#
Thanks Dimitris,

Very impressive! Much faster than before.

Thanks to new found R.basic, I can simply rotate the result with 
rotate270{R.basic}:

 > mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
 > temps <- c(37, 39, 41)
 > #################
 > #ind <- matrix(0, length(temps), ncol(mat))
 > ind <- matrix(0, 4, ncol(mat))
 > (startDate <- date())
[1] "Fri Jun 10 12:08:01 2005"
 > for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
 > ind[4, ] <- colMeans(max(mat))
Error in colMeans(max(mat)) : 'x' must be an array of at least two 
dimensions
 > (endDate <- date())
[1] "Fri Jun 10 12:08:02 2005"
 > ind <- rotate270(ind)
 > ind[1:10,]
    V4 V3 V2 V1
1   0 56 75 80
2   0 46 53 60
3   0 50 58 67
4   0 60 72 80
5   0 59 68 76
6   0 55 67 74
7   0 62 77 93
8   0 45 57 67
9   0 57 68 75
10  0 61 66 76

However, I have not managed to get the row maximum using your method? It 
should be 50 for most rows, but my first guess code gives an error!

Any suggestions?

Sander
Dimitris Rizopoulos wrote:

  
    
#
Sander Oom wrote:

            
What you need is not tapply but apply. Something like
   apply(mat88, 1, function(x) sum(x > 30))

where your treshold should replace 30 and the `1' refers to rows. For 
multiple tresholds:

apply(mat88, 1, function(x) c( sum(x>20), sum(x>25), sum(x>30)))

Kjetil
-- 

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
               --  Mahdi Elmandjra
#
for the maximum you could use something like:

ind[, 1] <- apply(mat, 2, max)

I hope it helps.

Best,
Dimitris

----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat/
     http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm



----- Original Message ----- 
From: "Sander Oom" <slist at oomvanlieshout.net>
To: "Dimitris Rizopoulos" <dimitris.rizopoulos at med.kuleuven.be>
Cc: <r-help at stat.math.ethz.ch>
Sent: Friday, June 10, 2005 12:10 PM
Subject: Re: [R] Replacing for loop with tapply!?
#
Dear all,

Dimitris and Andy, thanks for your great help. I have progressed to the 
following code which runs very fast and effective:

mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[mat>45] <- NA
mat<-NA
mat
temps <- c(35, 37, 39)
ind <- rbind(
     t(sapply(temps, function(temp)
       rowSums(mat > temp, na.rm=TRUE) )),
     rowSums(!is.na(mat), na.rm=FALSE),
     apply(mat, 1, max, na.rm=TRUE))
ind <- t(ind)
ind

However, some weather stations have missing values for the whole year. 
Unfortunately, the code breaks down (when uncommenting mat<-NA).

I have tried 'ifelse' statements in the functions, but it becomes even 
more of a mess. I could subset the matrix before hand, but this would 
mean merging with a complete matrix afterwards to make it compatible 
with other years. That would slow things down.

How can I make the code robust for rows containing all missing values?

Thanks for your help,

Sander.
Dimitris Rizopoulos wrote:
#
OK, so you want to find some summary statistics for each column, where
some columns could be completely missing. 

Writing a small wrapper should help. When you use apply(), you are
actually applying a function to every column (or row). First, let us
simulate a dataset with 15 days/rows and 10 stations/columns 

### simulate data
set.seed(1)            # for reproducibility 
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)  
mat[ mat > 45 ] <- NA  # create some missing values
mat[ ,9 ]       <- NA  # station 9's data is completely missing


Here are two example of such wrappers :

find.stats1 <- function( data, threshold=c(37,39,41) ){
  
  n   <- length(threshold)
  out <- matrix(  nrow=(n + 1), ncol=ncol(data) ) # initialise

  out[1, ] <- apply(data, 2, function(x) 
                         ifelse( all(is.na(x)), NA, max(x, na.rm=T) ))

  for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T )
  
  rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
  colnames(out) <- rownames(data)  # name of the stations
  return( out )
}
  
find.stats2 <- function( data, threshold=c(37,39,41) ){
  
  n      <- length(threshold)
  excess <- numeric( n )
  out    <- matrix(  nrow=(n + 1), ncol=ncol(data) ) # initialise
  good   <- which( apply( data, 2, function(x) !all(is.na(x)) ) )
  # colums that are not completely missing
 
  out[ , good] <- apply( data[ , good], 2, function(x){
    m <- max( x, na.rm=T )
    for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) }
    return( c(m, excess) )
  } ) 
  
  rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
  colnames(out) <- rownames(data)  # name of the stations
  return( out )
}

find.stats1( mat )
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max   44   42   39   41   45   43   42   45   NA    42
above_37     2    1    2    1    3    2    2    1    0     1
above_39     2    1    0    1    3    2    1    1    0     1
above_41     2    1    0    0    2    2    1    1    0     1

find.stats2( mat )
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max   44   42   39   41   45   43   42   45   NA    42
above_37     2    1    2    1    3    2    2    1   NA     1
above_39     2    1    0    1    3    2    1    1   NA     1
above_41     2    1    0    0    2    2    1    1   NA     1


On my laptop 'find.stats1' and 'find.stats2' (which is more flexible)
takes 7 and 6 seconds respectively to execute on a dataset with 10000
stations and 365 days.

Regards, Adai
On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote:
#
Hi
On 10 Jun 2005 at 20:05, Sander Oom wrote:

            
By this you redefine mat as
logi NA
and your code gives an error that it has to have some dimensions

+      apply(mat, 1, max, na.rm=TRUE))
Error in rowSums(mat > temp, na.rm = TRUE) : 
        'x' must be an array of at least two dimensions
If your matrix has one row full of NA's it only complains but 
computes a value.
+      t(sapply(temps, function(temp)
+        rowSums(mat > temp, na.rm=TRUE) )),
+      rowSums(!is.na(mat), na.rm=FALSE),
+      apply(mat, 1, max, na.rm=TRUE))
Warning message:
no finite arguments to max; returning -Inf
[,1] [,2] [,3] [,4] [,5]
 [1,]    5    5    3    9   48
 [2,]    1    1    1    9   42
 [3,]    0    0    0    0 -Inf
which(rowSums(!is.na(mat))==0) 
This gives you indices which lines of your matrix has all values NA 
and you can use it for fine tuning of your code. What you need to 
do depends on what results do you want, how ind matrix should 
look like after processing mat with one or more rows full of NA's.

HTH
Petr
Petr Pikal
petr.pikal at precheza.cz
1 day later
#
Dear Adaikalavan,

Your solution (the second function) is definitely the most elegant and 
generic solution of all replies in this discussion. Robust for missing 
values and flexible to allow as many calculations as desired! It is so 
clear, I even managed to hack it (of course also thanks to the new 
insight from all the other posts)!

As the data consists of weather stations in rows and days in columns, I 
have adapted the function to work on rows instead of columns. Did not 
manage to get the results directly into the right rows/cols layout, so a 
transpose (t) is still required. However this seems instant, so does not 
mean a reduction in speed! Calculating proportions is now a snip!!

Thanks for you help,

Sander.

### simulate data
set.seed(1)            # for reproducibility
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[ mat > 45 ] <- NA  # create some missing values
mat[ 9, ]       <- NA  # station 9's data is completely missing
mat

find.stats <- function( data, threshold ){

   n      <- length(threshold)
   excess <- numeric( n )
   out    <- matrix( ncol=nrow(data), nrow=(n + 2) ) # initialise
   good   <- which( apply( data, 1, function(x) !all(is.na(x)) ) )
   # rows that are not completely missing

   out[ ,good ] <- apply( data[ good, ], 1, function(x){
     m <- max( x, na.rm=T )
     # determine maximum value per row
     c <- length(x[!is.na(x)])
     # determine number of non-missing values
     for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE 
)/length(x[!is.na(x)]) }
     # calc proportion of non-missing values over multiple thresholds
     return( c(m, c, excess) )
   } )

   rownames(out) <- c( "TmpMax", "Count", paste("Over", threshold, sep="") )
   colnames(out) <- rownames(data)  # name of the stations
   return( t(out) )
}

lstTemps=c(37,39,41,43)
tmp <- find.stats( mat, lstTemps )
tmp
Adaikalavan Ramasamy wrote: