Descriptive Stats from Data Frame

10 messages · Rich Shepard, Tal Galili, David Winsemius +1 more

Original

1

10

Rich Shepard

Tue, Aug 30, 2011 2:00 PM #

I don't find how to do what I need to do in Dalgaard or 'R Cookbook', so
I'm asking here.

   I have a data frame with water chemistry data and I want to start
exploring these data. There are three factors (site, date, chemical)
associated with each measurement. The data frame looks like this:

site_id.sample_date.param.quant
  BC-0.5|1996-04-19|Arsenic|0.01              :    1
  BC-0.5|1996-04-19|Calcium|76.56             :    1
  BC-0.5|1996-04-19|Chloride|12               :    1
  BC-0.5|1996-04-19|Magnesium|43.23           :    1
  BC-0.5|1996-04-19|Sulfate|175               :    1
  BC-0.5|1996-04-19|Total Dissolved Solids|460:    1
  (Other)                                     :14880

   I want first to calculate (and plot) descriptive stats by chemical,
ignoring site and date and telling R to ignore missing data. (Incorporating
those factors will occur later.) What I have not been able to figure out is
how to specify the command to, for example, calculate mean and sd for
Arsenic. My floundering and thrashing includes attempts like these:

Error in is.numeric(x) : 'x' is missing

Error in mean(chemdata.quant, param = "Arsenic") :
   object 'chemdata.quant' not found

[1] NA
Warning message:
In mean.default(chemdata$quant, param = "Arsenic") :
   argument is not numeric or logical: returning NA

   As a newcomer to R I've done a lot of reading, yet all the examples use
nicely structured data to illustrate the point being made. I need to work
with my data and learn how to specify columns and write correct commands for
the analyses I need. Please point me in the right direction.

Rich

Tal Galili

Tue, Aug 30, 2011 2:12 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110831/28b19a2c/attachment.pl>

David Winsemius

Tue, Aug 30, 2011 2:13 PM #

On Aug 30, 2011, at 5:00 PM, Rich Shepard wrote:

It appears that your original file was delimited by "|" and your used  
something else, perhaps the default white-space setting? I think you  
need to go back and do your input operations again with sep="|"

(Or you could provide str() on the data.frame rather than making us  
guess.)

David

> BC-0.5|1996-04-19|Arsenic|0.01              :    1
> BC-0.5|1996-04-19|Calcium|76.56             :    1
> BC-0.5|1996-04-19|Chloride|12               :    1
> BC-0.5|1996-04-19|Magnesium|43.23           :    1
> BC-0.5|1996-04-19|Sulfate|175               :    1
> BC-0.5|1996-04-19|Total Dissolved Solids|460:    1
> (Other)                                     :14880
>
>  I want first to calculate (and plot) descriptive stats by chemical,
> ignoring site and date and telling R to ignore missing data.  
> (Incorporating
> those factors will occur later.) What I have not been able to figure  
> out is
> how to specify the command to, for example, calculate mean and sd for
> Arsenic. My floundering and thrashing includes attempts like these:
>
>> mean(chemdata.param="Arsenic")
> Error in is.numeric(x) : 'x' is missing
>> mean(chemdata.quant, param="Arsenic")
> Error in mean(chemdata.quant, param = "Arsenic") :
>  object 'chemdata.quant' not found
>> mean(chemdata$quant, param="Arsenic")
> [1] NA
> Warning message:
> In mean.default(chemdata$quant, param = "Arsenic") :
>  argument is not numeric or logical: returning NA
>
>  As a newcomer to R I've done a lot of reading, yet all the examples  
> use
> nicely structured data to illustrate the point being made. I need to  
> work
> with my data and learn how to specify columns and write correct  
> commands for
> the analyses I need. Please point me in the right direction.
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

Tue, Aug 30, 2011 2:22 PM #

Hi Rich,

I do not know what u really want, because it seems to me, u want to calculate the mean of all rows, where the chemical is Arsenic??

But try this to get a little more inside:

mean(chemdata$quant[chemdata$param=="Arsenic"])

The vector chemdata[chemdata$param=="Arsenic",] is a logical vector, returning TRUE for every row in which the variable param takes the value "Arsenic". Try it in your R editor to see it and understand the R concept!

If u now want to get all values of a certain column, given all values have "Arsenic" as param, u just write:

chemdata$COLUMNNAME[chemdata$param=="Arsenic"]

I do not know if this helps, as it seems to me, that Arsenic only occurs once in your frame?..

Good luck Simon

On Aug 30, 2011, at 11:00 PM, Rich Shepard wrote:

 I don't find how to do what I need to do in Dalgaard or 'R Cookbook', so
I'm asking here.

 I have a data frame with water chemistry data and I want to start
exploring these data. There are three factors (site, date, chemical)
associated with each measurement. The data frame looks like this:

summary(chemdata)

                            site_id.sample_date.param.quant
BC-0.5|1996-04-19|Arsenic|0.01              :    1
BC-0.5|1996-04-19|Calcium|76.56             :    1
BC-0.5|1996-04-19|Chloride|12               :    1
BC-0.5|1996-04-19|Magnesium|43.23           :    1
BC-0.5|1996-04-19|Sulfate|175               :    1
BC-0.5|1996-04-19|Total Dissolved Solids|460:    1
(Other)                                     :14880

 I want first to calculate (and plot) descriptive stats by chemical,
ignoring site and date and telling R to ignore missing data. (Incorporating
those factors will occur later.) What I have not been able to figure out is
how to specify the command to, for example, calculate mean and sd for
Arsenic. My floundering and thrashing includes attempts like these:

mean(chemdata.param="Arsenic")

Error in is.numeric(x) : 'x' is missing

mean(chemdata.quant, param="Arsenic")

Error in mean(chemdata.quant, param = "Arsenic") :
 object 'chemdata.quant' not found

mean(chemdata$quant, param="Arsenic")

[1] NA
Warning message:
In mean.default(chemdata$quant, param = "Arsenic") :
 argument is not numeric or logical: returning NA

 As a newcomer to R I've done a lot of reading, yet all the examples use
nicely structured data to illustrate the point being made. I need to work
with my data and learn how to specify columns and write correct commands for
the analyses I need. Please point me in the right direction.

Rich

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Rich Shepard

Tue, Aug 30, 2011 2:28 PM #

On Wed, 31 Aug 2011, Tal Galili wrote:

Tal,

   Yes, summary() is inappropriate. I do want str() instead. And what that
shows is:

'data.frame':   14886 obs. of  1 variable:
  $ site_id.sample_date.param.quant: Factor w/ 14886 levels
"BC-0.5|1996-04-19|Arsenic|0.01",..: 11579 14219 13298 11982 11909 13371
13082 111 12 23 ...

Error: unexpected '==' in "mean(chemdata$quant[chemdata$param > =="

[1] NA
Warning message:
In mean.default(chemdata$quant[chemdata$param == "Arsenic"]) :
   argument is not numeric or logical: returning NA

   I find it easy following the syntax in the half-dozen or more books I've
read, but quite difficult to apply what I read to my own real-world data.
:-)

Thanks,

Rich

Rich Shepard

Tue, Aug 30, 2011 2:30 PM #

On Tue, 30 Aug 2011, David Winsemius wrote:

David,

   Yes, the csv file separator is the pipe.

'data.frame':   14886 obs. of  1 variable:
  $ site_id.sample_date.param.quant: Factor w/ 14886 levels
"BC-0.5|1996-04-19|Arsenic|0.01",..: 11579 14219 13298 11982 11909 13371
13082 111 12 23 ...

Thanks,

Rich

David Winsemius

Tue, Aug 30, 2011 2:38 PM #

On Aug 30, 2011, at 5:30 PM, Rich Shepard wrote:

It is _not_ a csv file. You need to use read.table with sep = "|".

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

Rich Shepard

Tue, Aug 30, 2011 2:38 PM #

On Tue, 30 Aug 2011, David Winsemius wrote:

David,

   Yes, that's better. I did not know of the sep option. The new results:

'data.frame':   14886 obs. of  4 variables:
  $ site_id    : Factor w/ 148 levels "BC-0.5","BC-1",..: 104 145 126 115 114
128 124 2 3 3 ...
  $ sample_date: Factor w/ 1012 levels "1980-03-01","1980-05-01",..: 432 410
423 405 398 408 401 360 366 407 ...
  $ param      : Factor w/ 8 levels "Arsenic","Calcium",..: 1 1 1 1 1 1 1 1 1
1 ...
  $ quant      : num  0.06 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...

   Now I can work on the commands I need.

Many thanks,

Rich

David Winsemius

Tue, Aug 30, 2011 3:08 PM #

On Aug 30, 2011, at 5:38 PM, Rich Shepard wrote:

That does look more workable. You might consider changing the dates  
with:

chemadata$samp_date <- as.Date(as.character(chemdata$sample_date) )

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

Rich Shepard

Tue, Aug 30, 2011 4:10 PM #

On Tue, 30 Aug 2011, David Winsemius wrote:

David,

   I was thinking that I needed to do this. Thank you. It's now done.

   Good progress for the first afternoon applying R to my own data.

Rich