OK, so you want to find some summary statistics for each column, where
some columns could be completely missing.
Writing a small wrapper should help. When you use apply(), you are
actually applying a function to every column (or row). First, let us
simulate a dataset with 15 days/rows and 10 stations/columns
### simulate data
set.seed(1) # for reproducibility
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[ mat > 45 ] <- NA # create some missing values
mat[ ,9 ] <- NA # station 9's data is completely missing
Here are two example of such wrappers :
find.stats1 <- function( data, threshold=c(37,39,41) ){
n <- length(threshold)
out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise
out[1, ] <- apply(data, 2, function(x)
ifelse( all(is.na(x)), NA, max(x, na.rm=T) ))
for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T )
rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
colnames(out) <- rownames(data) # name of the stations
return( out )
}
find.stats2 <- function( data, threshold=c(37,39,41) ){
n <- length(threshold)
excess <- numeric( n )
out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise
good <- which( apply( data, 2, function(x) !all(is.na(x)) ) )
# colums that are not completely missing
out[ , good] <- apply( data[ , good], 2, function(x){
m <- max( x, na.rm=T )
for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) }
return( c(m, excess) )
} )
rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
colnames(out) <- rownames(data) # name of the stations
return( out )
}
find.stats1( mat )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max 44 42 39 41 45 43 42 45 NA 42
above_37 2 1 2 1 3 2 2 1 0 1
above_39 2 1 0 1 3 2 1 1 0 1
above_41 2 1 0 0 2 2 1 1 0 1
find.stats2( mat )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max 44 42 39 41 45 43 42 45 NA 42
above_37 2 1 2 1 3 2 2 1 NA 1
above_39 2 1 0 1 3 2 1 1 NA 1
above_41 2 1 0 0 2 2 1 1 NA 1
On my laptop 'find.stats1' and 'find.stats2' (which is more flexible)
takes 7 and 6 seconds respectively to execute on a dataset with 10000
stations and 365 days.
Regards, Adai
On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote:
Dear all,
Dimitris and Andy, thanks for your great help. I have progressed to the
following code which runs very fast and effective:
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[mat>45] <- NA
mat<-NA
mat
temps <- c(35, 37, 39)
ind <- rbind(
t(sapply(temps, function(temp)
rowSums(mat > temp, na.rm=TRUE) )),
rowSums(!is.na(mat), na.rm=FALSE),
apply(mat, 1, max, na.rm=TRUE))
ind <- t(ind)
ind
However, some weather stations have missing values for the whole year.
Unfortunately, the code breaks down (when uncommenting mat<-NA).
I have tried 'ifelse' statements in the functions, but it becomes even
more of a mess. I could subset the matrix before hand, but this would
mean merging with a complete matrix afterwards to make it compatible
with other years. That would slow things down.
How can I make the code robust for rows containing all missing values?
Thanks for your help,
Sander.
Dimitris Rizopoulos wrote:
for the maximum you could use something like:
ind[, 1] <- apply(mat, 2, max)
I hope it helps.
Best,
Dimitris
----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat/
http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
----- Original Message -----
From: "Sander Oom" <slist at oomvanlieshout.net>
To: "Dimitris Rizopoulos" <dimitris.rizopoulos at med.kuleuven.be>
Cc: <r-help at stat.math.ethz.ch>
Sent: Friday, June 10, 2005 12:10 PM
Subject: Re: [R] Replacing for loop with tapply!?
Thanks Dimitris,
Very impressive! Much faster than before.
Thanks to new found R.basic, I can simply rotate the result with
rotate270{R.basic}:
mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
temps <- c(37, 39, 41)
#################
#ind <- matrix(0, length(temps), ncol(mat))
ind <- matrix(0, 4, ncol(mat))
(startDate <- date())
[1] "Fri Jun 10 12:08:01 2005"
for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
ind[4, ] <- colMeans(max(mat))
Error in colMeans(max(mat)) : 'x' must be an array of at least two
dimensions
[1] "Fri Jun 10 12:08:02 2005"
ind <- rotate270(ind)
ind[1:10,]
V4 V3 V2 V1
1 0 56 75 80
2 0 46 53 60
3 0 50 58 67
4 0 60 72 80
5 0 59 68 76
6 0 55 67 74
7 0 62 77 93
8 0 45 57 67
9 0 57 68 75
10 0 61 66 76
However, I have not managed to get the row maximum using your
method? It
should be 50 for most rows, but my first guess code gives an error!
Any suggestions?
Sander
Dimitris Rizopoulos wrote:
maybe you are looking for something along these lines:
mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
temps <- c(37, 39, 41)
#################
ind <- matrix(0, length(temps), ncol(mat))
for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
ind
I hope it helps.
Best,
Dimitris
----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat/
http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
----- Original Message -----
From: "Sander Oom" <slist at oomvanlieshout.net>
To: <r-help at stat.math.ethz.ch>
Sent: Friday, June 10, 2005 10:50 AM
Subject: [R] Replacing for loop with tapply!?
Dear all,
We have a large data set with temperature data for weather stations
across the globe (15000 stations).
For each station, we need to calculate the number of days a certain
temperature is exceeded.
So far we used the following S code, where mat88 is a matrix
containing
rows of 365 daily temperatures for each of 15000 weather stations:
m <- 37
n <- 2
outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
for(i in 1:nrow(mat88)) {
# i <- 3
row1 <- as.data.frame(df88[i, ])
temprow37 <- select.rows(row1, row1 > m)
temprow39 <- select.rows(row1, row1 > m + n)
temprow41 <- select.rows(row1, row1 > m + 2 * n)
outmat88[i, 1] <- max(row1, na.rm = T)
outmat88[i, 2] <- count.rows(temprow37)
outmat88[i, 3] <- count.rows(temprow39)
outmat88[i, 4] <- count.rows(temprow41)
}
outmat88
We have transferred the data to a more potent Linux box running R,
but
still hope to speed up the code.
I know a for loop should be avoided when looking for speed. I also
know
the answer is in something like tapply, but my understanding of
these
commands is still to limited to see the solution. Could someone
show
me
the way!?
Thanks in advance,
Sander.