Skip to content

algorithm that iteratively drops columns of a data-frame

4 messages · Martin Batholdy, R. Michael Weylandt, Jeff Newmiller

#
Dear R-Users,


I have a problem with an algorithm that iteratively goes over a data.frame and exclude n-columns each step based on a statistical criterion.
So that the 'column-space' gets smaller and smaller with each iteration (like when you do stepwise regression).

The problem is that in every round I use a new subset of my data.frame.

However, as soon as I "generate" this subset by indexing the data.frame I get of course different column-numbers (compared to my original data-frame).

How can I solve this?



I prepared a small example to make my problem easier to understand:


Here I generate a data.frame containing 6 vectors with different means.

The loop now should exclude the vector with the smallest mean in each round.

At the end I want to have a vector ('drop') which contains the column numbers that I can apply on the original data.frame to get a subset with the highest means.

But the problem is that this is not working, since every time I generate a subset ('data[,-drop]') I of course get now different column-numbers that differ from the column-numbers of the original data-frame.

So, in the end I can't use my drop-vector on my original data-frame ? since the dimension of the testing data-frame changes in every loop-round.


How can I deal with this kind of problem?

Any suggestions are highly appreciated! 
(of course for the example code, there are much easier method to achieve the goal of finding the columns with the smallest means ? It is a pretty generic example)


here is the sample code:


x1 <- rnorm(200, 5, 2)
x2 <- rnorm(200, 6, 2)
x3 <- rnorm(200, 1, 2)
x4 <- rnorm(200, 12, 2)
x5 <- rnorm(200, 8, 2)
x6 <- rnorm(200, 9, 2)


data <- data.frame(x1, x2, x3, x4, x5,x6)

col_means <- colMeans(data)
drop <- match(min(col_means), col_means)


for(i in 1:4) {

	col_means <- colMeans(data[,-drop])
	drop <- c(drop, match(min(col_means), col_means))

}
#
Perhaps attach placeholder names to your columns and use those rather
than indices?

Michael

On Wed, Nov 9, 2011 at 10:36 AM, Martin Batholdy
<batholdy at googlemail.com> wrote:
#
Try

data[,!names(data) %in% names(col_means)]
On Wed, 9 Nov 2011, Martin Batholdy wrote:

            
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
#
great, thank you both!
On 09.11.2011, at 17:27, Jeff Newmiller wrote: