How to subset() from data frame using specific rows

I have a data frame called chemdata with this structure:
str(chemdata)
'data.frame':	14886 obs. of  4 variables:
  $ site    : Factor w/ 148 levels "BC-0.5","BC-1",..: 104 145 126 115 114 128 124 2 3 3 ...
  $ sampdate: Date, format: "1996-12-27" "1996-08-22" ...
  $ param   : Factor w/ 8 levels "As","Ca","Cl",..: 1 1 1 1 1 1 1 1 1 1 ...
  $ quant   : num  0.06 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...

   I've looked in the R Cookbook and Dalgaard's intro book without finding a
way to use wildcards (e.g., like "BC-*") or explicitly witing each site ID
when subdsetting a data frame..

   I need to create subsets (as data frames) based on sites, but including
all sites on each stream. For example, using the initial site factor shown
above, I want a subset containing all data for sites "BC-0.5", "BC-1".
"BC-2", "BC-3", "BC-4", "BC-5", and "BC-6".

Pointers appreciated,

Rich
Hi Rich,

You can use something like this:
testdata <- c("A1", "A2", "A3", "B1", "B2", "B3")
grep("^A", testdata)
[1] 1 2 3
grepl("^A", testdata)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

Sarah
?I have a data frame called chemdata with this structure:

str(chemdata)
'data.frame': ? 14886 obs. of ?4 variables:
?$ site ? ?: Factor w/ 148 levels "BC-0.5","BC-1",..: 104 145 126 115 114
128 124 2 3 3 ...
?$ sampdate: Date, format: "1996-12-27" "1996-08-22" ...
?$ param ? : Factor w/ 8 levels "As","Ca","Cl",..: 1 1 1 1 1 1 1 1 1 1 ...
?$ quant ? : num ?0.06 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...

?I've looked in the R Cookbook and Dalgaard's intro book without finding a
way to use wildcards (e.g., like "BC-*") or explicitly witing each site ID
when subdsetting a data frame..

?I need to create subsets (as data frames) based on sites, but including
all sites on each stream. For example, using the initial site factor shown
above, I want a subset containing all data for sites "BC-0.5", "BC-1".
"BC-2", "BC-3", "BC-4", "BC-5", and "BC-6".

Pointers appreciated,

Rich

Sarah Goslee
http://www.functionaldiversity.org
This isn't going to be the most elegant, but it should work:

## Get the factors as characters

ff <- as.character(chemdata$site)

## Identify those that match what you want
ff <- grepl(ff, "BC-")

now use this logical vector to subset

chemdata[ff, ]

Can't test, but should be good to go assuming that "BC-" entirely
identifies those sites you want. If you have other "BC-" things read
through the ?regex documentation and I think it describes how to do
selective wildcards

Michael
?I have a data frame called chemdata with this structure:

str(chemdata)
'data.frame': ? 14886 obs. of ?4 variables:
?$ site ? ?: Factor w/ 148 levels "BC-0.5","BC-1",..: 104 145 126 115 114
128 124 2 3 3 ...
?$ sampdate: Date, format: "1996-12-27" "1996-08-22" ...
?$ param ? : Factor w/ 8 levels "As","Ca","Cl",..: 1 1 1 1 1 1 1 1 1 1 ...
?$ quant ? : num ?0.06 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...

?I've looked in the R Cookbook and Dalgaard's intro book without finding a
way to use wildcards (e.g., like "BC-*") or explicitly witing each site ID
when subdsetting a data frame..

?I need to create subsets (as data frames) based on sites, but including
all sites on each stream. For example, using the initial site factor shown
above, I want a subset containing all data for sites "BC-0.5", "BC-1".
"BC-2", "BC-3", "BC-4", "BC-5", and "BC-6".

Pointers appreciated,

Rich

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

You can use something like this:

testdata <- c("A1", "A2", "A3", "B1", "B2", "B3")
grep("^A", testdata)
[1] 1 2 3
grepl("^A", testdata)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
Sarah,

   I don't see how this gives me a data frame containing only those sites I
specify. I want to plot by sites-within-streams specifying which param
factor to use.

Thanks,

Rich
Hi Rich,
On Tue, 4 Oct 2011, Sarah Goslee wrote:

You can use something like this:

testdata <- c("A1", "A2", "A3", "B1", "B2", "B3")
grep("^A", testdata)
[1] 1 2 3
grepl("^A", testdata)
[1] ?TRUE ?TRUE ?TRUE FALSE FALSE FALSE
Sarah,

?I don't see how this gives me a data frame containing only those sites I
specify. I want to plot by sites-within-streams specifying which param
factor to use.
You asked for pointers, and didn't provide a reproducible example, so
I offered a
pointer.

If you have a logical vector that specifies whether to include or omit
a row, you
can use that to subset your data frame.

sitesToUse <- grepl("firstsite", mydata$mysitenames)
dataframeForThatSite <- mydata[sitesToUse, ]

If you want real worked results, you'll need to provide a reproducible example
of your own.

Sarah
Sarah Goslee
http://www.functionaldiversity.org

This isn't going to be the most elegant, but it should work:
## Get the factors as characters
ff <- as.character(chemdata$site)
## Identify those that match what you want
ff <- grepl(ff, "BC-")
Michael,

   Apparently grep works differently in R than it does on the command line:

bf <- grep(ff, "BC-")
Warning message:
In grep(ff, "BC-") :
   argument 'pattern' has length > 1 and only the first element will be used

   I understand what you suggest but it does not appear to work for me.

Thanks,

Rich
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111004/be3a8984/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111004/785854c1/attachment.pl>
No, that was just a typo on my end:

the correct order of arguments should have been

ff <- grepl("BC-", ff)
On Tue, 4 Oct 2011, R. Michael Weylandt wrote:

This isn't going to be the most elegant, but it should work:
## Get the factors as characters
ff <- as.character(chemdata$site)
## Identify those that match what you want
ff <- grepl(ff, "BC-")
Michael,

?Apparently grep works differently in R than it does on the command line:

bf <- grep(ff, "BC-")
Warning message:
In grep(ff, "BC-") :
?argument 'pattern' has length > 1 and only the first element will be used

?I understand what you suggest but it does not appear to work for me.

Thanks,

Rich

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

You asked for pointers, and didn't provide a reproducible example, so I
offered a pointer.
Sarah,

   I did not realize that your pointer was to the factor component of the
subset() command.

   I think the most parsimonious thing for me to do is to modify the database
table with a new column of the full stream name, then re-export and re-read
into R.

Thanks,

Rich

No, that was just a typo on my end:
the correct order of arguments should have been
ff <- grepl("BC-", ff)
Michael,

   Thank you.

Rich
On Tue, 4 Oct 2011, Sarah Goslee wrote:

You asked for pointers, and didn't provide a reproducible example, so 
I
offered a pointer.
Sarah,

   I did not realize that your pointer was to the factor component of 
the
subset() command.

   I think the most parsimonious thing for me to do is to modify the 
database
table with a new column of the full stream name, then re-export and 
re-read
into R.
Hm. I seldom use such approach. In your original request you said you want 
split your data to smaller data frames based on sites 

-----
   I need to create subsets (as data frames) based on sites, but including
all sites on each stream. For example, using the initial site factor shown
------
From what we know it is difficult to say if there is some common feature 
in site variable. If it is organised like

XY-N

you can simply make new variable from first two letters

sites <- substr(chemdata$site,1,2)

then you can split your data frame according to sites

chem.spl <- split(chemdata, sites)

and do anything with your splitted data frames organised in list

Regards
Petr
Thanks,

Rich

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hm. I seldom use such approach. In your original request you said you want
split your data to smaller data frames based on sites
Petr,

   I need the additional information in the database, too.
From what we know it is difficult to say if there is some common feature
in site variable. If it is organised like
XY-N
you can simply make new variable from first two letters
Unfortunately, the site designations are not so uniform. As I went through
the process of re-doing the data I discovered this lack of consistency
resulting in duplicate records because one site had been designated XX-n and
XXn. Had to clean those up, too.
sites <- substr(chemdata$site,1,2)

then you can split your data frame according to sites

chem.spl <- split(chemdata, sites)

and do anything with your splitted data frames organised in list
First thing this morning I'm upgrading to 2.13.2 and hoping that this
fixes an issue that just showed up yesterday afternoon: not being able to
access function help pages. For example, I tried ?subset and ?split because
I thought the latter is really what I want, yet R told me no help was found.
Strange; it was there a week ago.

Thanks,

Rich

 First thing this morning I'm upgrading to 2.13.2 and hoping that this
fixes an issue that just showed up yesterday afternoon: not being able to
access function help pages. For example, I tried ?subset and ?split because
I thought the latter is really what I want, yet R told me no help was found.
Strange; it was there a week ago.
Yep. The upgrade brought back the help system.

Rich
Hi
On Wed, 5 Oct 2011, Petr PIKAL wrote:

Hm. I seldom use such approach. In your original request you said you 
want
split your data to smaller data frames based on sites
Petr,

   I need the additional information in the database, too.
But you do not loose them, your data frame is cut according to sites 
variable and put into a list

see
iris.spl<- split(iris, iris$Species)
str(iris.spl)
List of 3
 $ setosa    :'data.frame':     50 obs. of  5 variables:
  ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

From what we know it is difficult to say if there is some common 
feature
in site variable. If it is organised like
XY-N
you can simply make new variable from first two letters
   Unfortunately, the site designations are not so uniform. As I went 
through
the process of re-doing the data I discovered this lack of consistency
resulting in duplicate records because one site had been designated XX-n 
and
XXn. Had to clean those up, too.

sites <- substr(chemdata$site,1,2)
Which would not matter if the first two letters designates required 
grouping variable I called sites

Regards
Petr
then you can split your data frame according to sites

chem.spl <- split(chemdata, sites)

and do anything with your splitted data frames organised in list
   First thing this morning I'm upgrading to 2.13.2 and hoping that this
fixes an issue that just showed up yesterday afternoon: not being able 
to
access function help pages. For example, I tried ?subset and ?split 
because
I thought the latter is really what I want, yet R told me no help was 
found.
Strange; it was there a week ago.

Thanks,

Rich

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

But you do not loose them, your data frame is cut according to sites
variable and put into a list
I know this, Petr. But adding them to the database table ensures that the
information is there, too.

   This brings up another question, but I should put that on a different
thread. It's about process and work flow; that is, when can I use multiple
factors in the original data frame and when I need to split and subset the
data frame. I think it depends on how many factors can be specified by a
particular model or graph. Regardless, I'll hold off on this as I work
through these initial exploratory steps.

Thanks,

Rich