You are right, but unfortunately this is not the limiting step or bottleneck in the code below. The filter.const() function is only used to get the non-constant columns in the training data set, which is initially small (49 rows and 525 columns). And this function is only applied for filtering the training set and takes about 2 seconds on my PowerBook. After filtering the training data set, just the list of column names is used to filter the huge "prediction.set". I think, the really time and memory consuming part is the for-loop below, but I don't know how to improve this part. Anyway, thanks for the hint!!! Best, Torsten
On Aug 9, 2005, at 12:37 PM, Patrick Burns wrote:
Building up an object like you do with 'realdata' is very
wasteful (S Poetry says why). I think you want something
along the lines of:
if(vectors[1] == 'column') {
realdata <- apply(X, 2, function(x) diff(range(x))) > tol
filteredX <- X[, realdata]
} else {
realdata <- apply(X, 1, function(x) diff(range(x))) > tol
filteredX <- X[realdata, ]
}
Patrick Burns
patrick at burns-stat.com
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and "A Guide for the Unwilling S User")
Torsten Schindler wrote:
Hi,
I'm a R newbie and want to accelerate the following pre-filtering
step of a data set with more than 115,000 rows :
#-----------------
# Function to filter out constant data columns
filter.const<-function(X, vectors=c('column', 'row'), tol=0){
realdata=c()
filteredX<-matrix()
if( vectors[1] == 'row' ){
for( row in (1:nrow(X)) ){
if( length(which(X[row,]!=median(X[row,])))>tol ){
realdata[length(realdata)+1]=row
}
}
filteredX=X[realdata,]
} else if( vectors[1] == 'column' ){
for( col in (1:ncol(X)) ){
if( length(which(X[,col]!=median(X[,col])))>tol ){
realdata[length(realdata)+1]=col
}
}
filteredX=X[,realdata]
}
return(list(x=filteredX, ix=realdata))
}
#-----------------
# Filter out all all-constant columns in my training data set
#
# Read training data set with class information in the first column
training <- read.csv('training_data.txt')
dim(training) # => 49 rows and 525 columns
# Prepare column names by stripping the underline and the number
at the end
colnames(training) <- sub('_\\d+$', '', colnames(training),
perl=TRUE)
# Filter out the all-constant columns, exclude column 1, the
class column called myclass
training.filter <- filter.const(training[,-1])
# The filtered data frame is
training.filtered <- cbind(myclass=training[,1], training.filter$x)
dim(training.filtered) # => 49 rows and 250 columns
# Save the filtered training set for later use in classification
filtered.data <- 'training_set_filtered.Rdata'
save(training.filtered, file=filtered.data)
#-----------------
# THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
# AND CONSUMES ABOUT 600 Mb MEMORY.
#
# I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.
# Pre-filter the big data set (more than 115,000 rows and 524
columns) for later class predictions.
# The big data set contains the same column names as the training
set, but in a different order.
input.file <- 'big_data_set.txt'
filtered.file <- 'big_data_set_filtered.txt'
# Read header with first row
prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)
# Prepare column names by stripping the underline and the number
at the end
colnames(prediction.set) <- sub('_\\d+$', '', colnames
(prediction.set), perl=TRUE)
prediction.set.header <- colnames(prediction.set)
# Get descriptor columns of the training data set without the
Activity_Class column
training.filtered.property.colnames <- colnames(training.filtered)
[-1]
# Filter out the all-constant columns from the training set
prediction.set.filtered <- prediction.set
[training.filtered.property.colnames]
dim(prediction.set.filtered) # => 1 row and 249 columns
# Write header and the first filtered row
write.csv(prediction.set.filtered, file=filtered.file,
append=FALSE,
col.names=training.filtered.property.colnames)
blocksize <- 1000
for (lineid in (0:120)*blocksize) {
cat('lineid: ', lineid, '\n')
# Read block of data
# We have to add an dummy colname "x" in the col.names, when
the header is not read!
prediction.set <- try(read.csv(input.file, header=FALSE,
col.names=c('x',prediction.set.header),
row.names=1,
skip=lineid+2, nrow=blocksize))
if (class(prediction.set) == "try-error") break
# Filter out all-constant training set columns from the block
prediction.set.filtered <- prediction.set
[training.filtered.property.colnames]
# Append the data
# (I know this function is slow, but I couldn't figure out how
to do it faster, so far.)
write.table(prediction.set.filtered, file=filtered.file,
append=TRUE, col.names=FALSE, sep=",")
}
#-------------
# Now read in the filtered data set and save it for later use in
classification
prediction.set.filtered <- read.csv(filtered.file, header=TRUE,
row.names=1)
filtered.data <- 'prediction_set_filtered.Rdata'
save(prediction.set.filtered, file=filtered.data)
I would be very happy about any hints how to improve the code
above!!!
Best regards,
Torsten
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting- guide.html