Skip to content

How to subset() from data frame using specific rows

16 messages · R. Michael Weylandt, Sarah Goslee, Jeff Newmiller +3 more

#
I have a data frame called chemdata with this structure:
'data.frame':	14886 obs. of  4 variables:
  $ site    : Factor w/ 148 levels "BC-0.5","BC-1",..: 104 145 126 115 114 128 124 2 3 3 ...
  $ sampdate: Date, format: "1996-12-27" "1996-08-22" ...
  $ param   : Factor w/ 8 levels "As","Ca","Cl",..: 1 1 1 1 1 1 1 1 1 1 ...
  $ quant   : num  0.06 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...

   I've looked in the R Cookbook and Dalgaard's intro book without finding a
way to use wildcards (e.g., like "BC-*") or explicitly witing each site ID
when subdsetting a data frame..

   I need to create subsets (as data frames) based on sites, but including
all sites on each stream. For example, using the initial site factor shown
above, I want a subset containing all data for sites "BC-0.5", "BC-1".
"BC-2", "BC-3", "BC-4", "BC-5", and "BC-6".

Pointers appreciated,

Rich
#
Hi Rich,

You can use something like this:
[1] 1 2 3
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

Sarah
On Tue, Oct 4, 2011 at 2:39 PM, Rich Shepard <rshepard at appl-ecosys.com> wrote:

  
    
#
This isn't going to be the most elegant, but it should work:

## Get the factors as characters

ff <- as.character(chemdata$site)

## Identify those that match what you want
ff <- grepl(ff, "BC-")

now use this logical vector to subset

chemdata[ff, ]

Can't test, but should be good to go assuming that "BC-" entirely
identifies those sites you want. If you have other "BC-" things read
through the ?regex documentation and I think it describes how to do
selective wildcards

Michael
On Tue, Oct 4, 2011 at 2:39 PM, Rich Shepard <rshepard at appl-ecosys.com> wrote:
#
On Tue, 4 Oct 2011, Sarah Goslee wrote:

            
Sarah,

   I don't see how this gives me a data frame containing only those sites I
specify. I want to plot by sites-within-streams specifying which param
factor to use.

Thanks,

Rich
#
Hi Rich,
On Tue, Oct 4, 2011 at 2:58 PM, Rich Shepard <rshepard at appl-ecosys.com> wrote:
You asked for pointers, and didn't provide a reproducible example, so
I offered a
pointer.

If you have a logical vector that specifies whether to include or omit
a row, you
can use that to subset your data frame.

sitesToUse <- grepl("firstsite", mydata$mysitenames)
dataframeForThatSite <- mydata[sitesToUse, ]

If you want real worked results, you'll need to provide a reproducible example
of your own.

Sarah
#
On Tue, 4 Oct 2011, R. Michael Weylandt wrote:

            
Michael,

   Apparently grep works differently in R than it does on the command line:

bf <- grep(ff, "BC-")
Warning message:
In grep(ff, "BC-") :
   argument 'pattern' has length > 1 and only the first element will be used

   I understand what you suggest but it does not appear to work for me.

Thanks,

Rich
#
No, that was just a typo on my end:

the correct order of arguments should have been

ff <- grepl("BC-", ff)
On Tue, Oct 4, 2011 at 3:07 PM, Rich Shepard <rshepard at appl-ecosys.com> wrote:
#
On Tue, 4 Oct 2011, Sarah Goslee wrote:

            
Sarah,

   I did not realize that your pointer was to the factor component of the
subset() command.

   I think the most parsimonious thing for me to do is to modify the database
table with a new column of the full stream name, then re-export and re-read
into R.

Thanks,

Rich
#
On Tue, 4 Oct 2011, R. Michael Weylandt wrote:

            
Michael,

   Thank you.

Rich
#
I
the
database
re-read
Hm. I seldom use such approach. In your original request you said you want 
split your data to smaller data frames based on sites 

-----
   I need to create subsets (as data frames) based on sites, but including
all sites on each stream. For example, using the initial site factor shown
------
in site variable. If it is organised like

XY-N

you can simply make new variable from first two letters

sites <- substr(chemdata$site,1,2)

then you can split your data frame according to sites

chem.spl <- split(chemdata, sites)

and do anything with your splitted data frames organised in list

Regards
Petr
http://www.R-project.org/posting-guide.html
#
On Wed, 5 Oct 2011, Petr PIKAL wrote:

            
Petr,

   I need the additional information in the database, too.
Unfortunately, the site designations are not so uniform. As I went through
the process of re-doing the data I discovered this lack of consistency
resulting in duplicate records because one site had been designated XX-n and
XXn. Had to clean those up, too.
First thing this morning I'm upgrading to 2.13.2 and hoping that this
fixes an issue that just showed up yesterday afternoon: not being able to
access function help pages. For example, I tried ?subset and ?split because
I thought the latter is really what I want, yet R told me no help was found.
Strange; it was there a week ago.

Thanks,

Rich
#
On Wed, 5 Oct 2011, Rich Shepard wrote:

            
Yep. The upgrade brought back the help system.

Rich
#
Hi
want
But you do not loose them, your data frame is cut according to sites 
variable and put into a list

see
List of 3
 $ setosa    :'data.frame':     50 obs. of  5 variables:
  ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
feature
through
and
Which would not matter if the first two letters designates required 
grouping variable I called sites

Regards
Petr
to
because
found.
http://www.R-project.org/posting-guide.html
#
On Wed, 5 Oct 2011, Petr PIKAL wrote:

            
I know this, Petr. But adding them to the database table ensures that the
information is there, too.

   This brings up another question, but I should put that on a different
thread. It's about process and work flow; that is, when can I use multiple
factors in the original data frame and when I need to split and subset the
data frame. I think it depends on how many factors can be specified by a
particular model or graph. Regardless, I'll hold off on this as I work
through these initial exploratory steps.

Thanks,

Rich