Skip to content

Help with subset

6 messages · Jerry Floren, Peter Alspach

#
I am so happy about learning how to read in multiple Excel files, that I have
to try and make another improvement. I know what I have been doing is
clumsy, but it works. Hopefully, someone can suggest a more elegant
solution. As a novice, I have been using MS-Word and mail merge to write my
code. I start with about 2 pages of code, and end up with 2,220 merged pages
that I copy and paste into R. You can probably guess that I am not a
programmer.

## here is the start of my merge document. The "x116" line has the merge
field, in this case "Bases-K Ammonium Acetate-2008-116". This changes for
each soil sample, and for each type of analysis. ##

napt <- read.table(file = "C:/Documents and Settings/jfloren/My
Documents/R_Statistics/NAPT/NAPT_09/CertIn2010/Data_for_R/readin_all_for_2010_cert.csv
", header = TRUE, sep = ",")
attach(napt)

x116 <- subset(napt, Analysis_Soil %in% c("Bases-K Ammonium
Acetate-2008-116"))

detach(napt)
attach(x116)

#### End of merge document section for selecting the subset  ###

Once I get the subset isolated, I have no problems calculating the necessary
statistics and can generate some wonderful graphs. 

I have two questions. 
1. How do I select different subsets from a large table without resorting to
using Word's Mail Merge?
2. I prefer to only analyze the results if at least nine labs submitted
results for a particular test. How would I tell R to skip the analysis if
the number of labs running a particular test is less than nine?

Thanks,

Jerry Floren
Minnesota Department of Agriculture
#
Tena koe Jerry

I'm not sure exactly what you wish to do, but it would seem you could
create a character vector (myVars) containing all your merge field names
and then use a for() loop.  Something like:

for (myV in myVars)
{
  x116 <- subset(napt, Analysis_Soil %in% myV)
  if (nrow(x116) > 8)
  {
    # steps for your analysis
  }
}

HTH ......

Peter Alspach
#
Thank you Peter. I am really new to this. The spreadsheet I am working with
has 12,379 rows with the first row consisting of the variable names and
12,378 rows of data. There are seven columns, and the 7th column is the only
one with numerical data ("Results"). 

I need to match up the variable Results with the variable "Anlysis_Soil",
which is the type of test performed by the labs on one of 20 different soil
samples. Here are some examples of the Anlysis_Soil variable:

Anlysis_Soil
Bases-Aluminum KCL Extr-2008-116
Bases-Aluminum KCL Extr-2008-116
Bases-Aluminum KCL Extr-2008-117
Bases-Aluminum KCL Extr-2008-118
Bases-Aluminum KCL Extr-2008-118
Bases-Aluminum KCL Extr-2008-119
Bases-Aluminum KCL Extr-2008-120
Bases-Aluminum KCL Extr-2008-120
Bases-Aluminum KCL Extr-2009-101

Actually, I am not interested in any of the above, because there are too few
(less than 9). 

I think I need to first identify the unique Anlysis_Soil from the entire
list, and I thought using "list" might work:
List of 1
 $ : Factor w/ 1695 levels "Bases-Aluminum KCL Extr-2008-116",..: 1 1 2 3 3
4 5 5 6 6 ...
It does correctly identify there are 1695 unique "Anlysis_Soil" variables.
However, "anlyses" contains all 12,378 "Anlysis_Soil" variables. For
example:

print(anlyses)
...
...
...
[12374] Soil pH & EC-Soil EC (1to2)-2009-115                            
[12375] Soil pH & EC-Soil EC (1to2)-2009-115                            
[12376] Soil pH & EC-Soil EC (1to2)-2009-115                            
[12377] Soil pH & EC-Soil EC (1to2)-2009-115                            
[12378] Soil pH & EC-Soil EC (1to2)-2009-115                            
1695 Levels: Bases-Aluminum KCL Extr-2008-116 ...

And once again shows correctly that there are 1,695 unique "Anlysis_Soil"
variables.

Once the unique Anlysis_Soil variables are identified, I need to determine
the ones greater than 8, and I see how that could be done with your code.

I am not clear what you mean by, "for (myV in myVars)" ? Is myV the name of
one of the unique variables that has at least 9 Results? Is myVars the
entire column of "Anlysis_Soil" ?

I am not sure if this is any clearer.

Thanks,

Jerry Floren
Minnesota Department of Agriculture
#
Tena koe Jerry 

myVars would be the unique values of Anlysis_Soil.  I guess Anlysis_Soil
is a factor, in which case

myVars <- levels(Anlysis_Soil)

The for() loop then steps through each of these in turn.

HTH ....

Peter Alspach

PS  If you'd like more details it might be better to contact me off
list.

P
#
Thank you very much Peter. I'm off until Monday, but now I have an
interesting project to think about over the weekend. It looks pretty simple,
and it sure will beat over 2,000 pages of code to cut and paste.

Thanks again,

Jerry Floren
Minnesota Department of Agriculture
3 days later
#
Hi Peter,

Thanks again for showing me how to do this. This will save me hours of
tedious work, and I can't believe how quickly R reads in and processes my
data. It was taking about 40 minutes when I used Word's mail merge for this.
When I first switched from Excel to R, R was so much faster than Excel that
I could live with the 40 minute processing time. 

This is another major improvement. What will really save me the most time is
the" if (nrow(rsubset) > 8)"  statement. 

Thanks,

Jerry Floren
Minnesota Department of Agriculture