How to pre-filter large amounts of data effectively
You are right, but unfortunately this is not the limiting step or bottleneck in the code below. The filter.const() function is only used to get the non-constant columns in the training data set, which is initially small (49 rows and 525 columns). And this function is only applied for filtering the training set and takes about 2 seconds on my PowerBook. After filtering the training data set, just the list of column names is used to filter the huge "prediction.set". I think, the really time and memory consuming part is the for-loop below, but I don't know how to improve this part. Anyway, thanks for the hint!!! Best, Torsten
On Aug 9, 2005, at 12:37 PM, Patrick Burns wrote:
Building up an object like you do with 'realdata' is very
wasteful (S Poetry says why). I think you want something
along the lines of:
if(vectors[1] == 'column') {
realdata <- apply(X, 2, function(x) diff(range(x))) > tol
filteredX <- X[, realdata]
} else {
realdata <- apply(X, 1, function(x) diff(range(x))) > tol
filteredX <- X[realdata, ]
}
Patrick Burns
patrick at burns-stat.com
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and "A Guide for the Unwilling S User")
Torsten Schindler wrote:
Hi,
I'm a R newbie and want to accelerate the following pre-filtering
step of a data set with more than 115,000 rows :
#-----------------
# Function to filter out constant data columns
filter.const<-function(X, vectors=c('column', 'row'), tol=0){
realdata=c()
filteredX<-matrix()
if( vectors[1] == 'row' ){
for( row in (1:nrow(X)) ){
if( length(which(X[row,]!=median(X[row,])))>tol ){
realdata[length(realdata)+1]=row
}
}
filteredX=X[realdata,]
} else if( vectors[1] == 'column' ){
for( col in (1:ncol(X)) ){
if( length(which(X[,col]!=median(X[,col])))>tol ){
realdata[length(realdata)+1]=col
}
}
filteredX=X[,realdata]
}
return(list(x=filteredX, ix=realdata))
}
#-----------------
# Filter out all all-constant columns in my training data set
#
# Read training data set with class information in the first column
training <- read.csv('training_data.txt')
dim(training) # => 49 rows and 525 columns
# Prepare column names by stripping the underline and the number
at the end
colnames(training) <- sub('_\\d+$', '', colnames(training),
perl=TRUE)
# Filter out the all-constant columns, exclude column 1, the
class column called myclass
training.filter <- filter.const(training[,-1])
# The filtered data frame is
training.filtered <- cbind(myclass=training[,1], training.filter$x)
dim(training.filtered) # => 49 rows and 250 columns
# Save the filtered training set for later use in classification
filtered.data <- 'training_set_filtered.Rdata'
save(training.filtered, file=filtered.data)
#-----------------
# THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
# AND CONSUMES ABOUT 600 Mb MEMORY.
#
# I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.
# Pre-filter the big data set (more than 115,000 rows and 524
columns) for later class predictions.
# The big data set contains the same column names as the training
set, but in a different order.
input.file <- 'big_data_set.txt'
filtered.file <- 'big_data_set_filtered.txt'
# Read header with first row
prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)
# Prepare column names by stripping the underline and the number
at the end
colnames(prediction.set) <- sub('_\\d+$', '', colnames
(prediction.set), perl=TRUE)
prediction.set.header <- colnames(prediction.set)
# Get descriptor columns of the training data set without the
Activity_Class column
training.filtered.property.colnames <- colnames(training.filtered)
[-1]
# Filter out the all-constant columns from the training set
prediction.set.filtered <- prediction.set
[training.filtered.property.colnames]
dim(prediction.set.filtered) # => 1 row and 249 columns
# Write header and the first filtered row
write.csv(prediction.set.filtered, file=filtered.file,
append=FALSE,
col.names=training.filtered.property.colnames)
blocksize <- 1000
for (lineid in (0:120)*blocksize) {
cat('lineid: ', lineid, '\n')
# Read block of data
# We have to add an dummy colname "x" in the col.names, when
the header is not read!
prediction.set <- try(read.csv(input.file, header=FALSE,
col.names=c('x',prediction.set.header),
row.names=1,
skip=lineid+2, nrow=blocksize))
if (class(prediction.set) == "try-error") break
# Filter out all-constant training set columns from the block
prediction.set.filtered <- prediction.set
[training.filtered.property.colnames]
# Append the data
# (I know this function is slow, but I couldn't figure out how
to do it faster, so far.)
write.table(prediction.set.filtered, file=filtered.file,
append=TRUE, col.names=FALSE, sep=",")
}
#-------------
# Now read in the filtered data set and save it for later use in
classification
prediction.set.filtered <- read.csv(filtered.file, header=TRUE,
row.names=1)
filtered.data <- 'prediction_set_filtered.Rdata'
save(prediction.set.filtered, file=filtered.data)
I would be very happy about any hints how to improve the code
above!!!
Best regards,
Torsten
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting- guide.html