Selecting Variables

I think that you have to be a little more explicit with a description
of your data.  I am not clear as to what this means:
There are lots of variables between each exposure and the values are nominal
with upto 6 values..
Can you provide a more complete description.  How many columns of
exposure are there in your data?  How many unique IDs?  Depending on
these answers, you can probably read in a portion of your 5GB data
base and summarize the information and the aggregate it at then end
since I would expect that the length of the aggregated data is just
the number of unique IDs.
Thanks for the help guys,

i think i needed to be a bit more explicit however (sorry)

There are lots of variables between each exposure and the values are nominal
with upto 6 values..
And to add to the problem the datasets i deal with range from anything upto
5G.

My guess is that the melt function would be inefficient in this situation.

I was looking at the agrep function to count the number Exposures in the
names() , i wasn't sure of how to count if there was a value in each one but
the y[complete.cases(y),] looks like a nice function.

Is this a good path to follow?

On Tue, Aug 5, 2008 at 3:09 PM, jim holtman <jholtman at gmail.com> wrote:
I am not sure where the "Max" comes from, but this might be a start for
you:

x <- read.table(textConnection("ID Exposure_1 Exposure_2 Exposure_3
+ 1        y                 y                y
+ 2        y                 y                -
+ 3        y                 -                 -"), header=TRUE,
na.strings='-')
closeAllConnections()
require(reshape)
y <- melt(x, id.var='ID')
# get rid of NAs
y <- y[complete.cases(y),]
y
 ID   variable value
1  1 Exposure_1     y
2  2 Exposure_1     y
3  3 Exposure_1     y
4  1 Exposure_2     y
5  2 Exposure_2     y
7  1 Exposure_3     y
cbind(Unique=tapply(y$ID, y$ID, length))
 Unique
1      3
2      2
3      1

On Tue, Aug 5, 2008 at 9:21 AM, Michael Pearmain <mpearmain at google.com>
wrote:
Hi All,

i have a dataset that i want to dynamically inspect for the number of
variables that start with "Exposure_"  and then for these count the
entries
across each case i.e

ID Exposure_1 Exposure_2 Exposure_3
1        y                 y                y
2        y                 y                -
3        y                 -                 -

So the corresponding new variables that would be created are

ID Max_Exposure Unique_Exposure
1           3                       3
2           3                       2
3           3                       1

I know this may seem fairly basic but it will give me the starting point
to
develop more advanced things with loop and nat lang

Thanks in advance

Mike

       [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

--
Michael Pearmain
Senior Statistical Analyst

1st Floor, 180 Great Portland St. London W1W 5QZ
t +44 (0) 2032191684
mpearmain at google.com
mpearmain at doubleclick.com

Doubleclick is a part of the Google group of companies

"If you received this communication by mistake, please don't forward it to
anyone else (it may contain confidential or privileged information), please
erase all copies of it, including all attachments, and please let the sender
know it went to the wrong person. Thanks."

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Selecting Variables

Thread (5 messages)