Skip to content
Back to formatted view

Raw Message

Message-ID: <4A0D6003.5070401@vanderbilt.edu>
Date: 2009-05-15T12:28:51Z
From: Frank E Harrell Jr
Subject: Using sample to create Training and Test sets
In-Reply-To: <4A0D1CFE.20106@bris.ac.uk>

Note that the single split sample technique is not competitive with 
other approaches unless the sample size exceeds around 20,000.

Frank


Chris Arthur wrote:
> Forgive the newbie question, I want to select random rows from my 
> data.frame to create a test set (which I can do) but then I want to 
> create a training set using whats left over.
> 
> Example code:
> acc <- read.table("accOUT.txt", header=T, sep = ",", row.names=1)
> #select 400 random rows in data
> training <- acc[sample(1:nrow(acc), 400, replace=TRUE),]
> 
> #try to get whats left of acc not in training
> testset <- acc[-training, ]
> Fails with the following error....
> Error: invalid subscript type
> In addition: Warning message:
> - not meaningful for factors in: Ops.factor(left)
> 
> I then try.
> testset <- acc[!training, ]
> Which gives me the warning message
> ! not meaningful for factors in: Ops.factor(left)
> And if i look at testset It is 400 rows of NA's ... which clearly isn't 
> right.
> 
> Can anyone tell me what I'm doing wrong.
> 
> Thanks in advance
> 
> Chris
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University