Selecting Variables
I think that you have to be a little more explicit with a description of your data. I am not clear as to what this means:
There are lots of variables between each exposure and the values are nominal with upto 6 values..
Can you provide a more complete description. How many columns of exposure are there in your data? How many unique IDs? Depending on these answers, you can probably read in a portion of your 5GB data base and summarize the information and the aggregate it at then end since I would expect that the length of the aggregated data is just the number of unique IDs.
On Tue, Aug 5, 2008 at 11:54 AM, Michael Pearmain <mpearmain at google.com> wrote:
Thanks for the help guys, i think i needed to be a bit more explicit however (sorry) There are lots of variables between each exposure and the values are nominal with upto 6 values.. And to add to the problem the datasets i deal with range from anything upto 5G. My guess is that the melt function would be inefficient in this situation. I was looking at the agrep function to count the number Exposures in the names() , i wasn't sure of how to count if there was a value in each one but the y[complete.cases(y),] looks like a nice function. Is this a good path to follow? On Tue, Aug 5, 2008 at 3:09 PM, jim holtman <jholtman at gmail.com> wrote:
I am not sure where the "Max" comes from, but this might be a start for you:
x <- read.table(textConnection("ID Exposure_1 Exposure_2 Exposure_3
+ 1 y y y + 2 y y - + 3 y - -"), header=TRUE, na.strings='-')
closeAllConnections() require(reshape) y <- melt(x, id.var='ID') # get rid of NAs y <- y[complete.cases(y),] y
ID variable value 1 1 Exposure_1 y 2 2 Exposure_1 y 3 3 Exposure_1 y 4 1 Exposure_2 y 5 2 Exposure_2 y 7 1 Exposure_3 y
cbind(Unique=tapply(y$ID, y$ID, length))
Unique 1 3 2 2 3 1
On Tue, Aug 5, 2008 at 9:21 AM, Michael Pearmain <mpearmain at google.com> wrote:
Hi All,
i have a dataset that i want to dynamically inspect for the number of
variables that start with "Exposure_" and then for these count the
entries
across each case i.e
ID Exposure_1 Exposure_2 Exposure_3
1 y y y
2 y y -
3 y - -
So the corresponding new variables that would be created are
ID Max_Exposure Unique_Exposure
1 3 3
2 3 2
3 3 1
I know this may seem fairly basic but it will give me the starting point
to
develop more advanced things with loop and nat lang
Thanks in advance
Mike
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
-- Michael Pearmain Senior Statistical Analyst 1st Floor, 180 Great Portland St. London W1W 5QZ t +44 (0) 2032191684 mpearmain at google.com mpearmain at doubleclick.com Doubleclick is a part of the Google group of companies "If you received this communication by mistake, please don't forward it to anyone else (it may contain confidential or privileged information), please erase all copies of it, including all attachments, and please let the sender know it went to the wrong person. Thanks."
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?