Skip to content

Descriptive Stats from Data Frame

10 messages · Rich Shepard, Tal Galili, David Winsemius +1 more

#
I don't find how to do what I need to do in Dalgaard or 'R Cookbook', so
I'm asking here.

   I have a data frame with water chemistry data and I want to start
exploring these data. There are three factors (site, date, chemical)
associated with each measurement. The data frame looks like this:
site_id.sample_date.param.quant
  BC-0.5|1996-04-19|Arsenic|0.01              :    1
  BC-0.5|1996-04-19|Calcium|76.56             :    1
  BC-0.5|1996-04-19|Chloride|12               :    1
  BC-0.5|1996-04-19|Magnesium|43.23           :    1
  BC-0.5|1996-04-19|Sulfate|175               :    1
  BC-0.5|1996-04-19|Total Dissolved Solids|460:    1
  (Other)                                     :14880

   I want first to calculate (and plot) descriptive stats by chemical,
ignoring site and date and telling R to ignore missing data. (Incorporating
those factors will occur later.) What I have not been able to figure out is
how to specify the command to, for example, calculate mean and sd for
Arsenic. My floundering and thrashing includes attempts like these:
Error in is.numeric(x) : 'x' is missing
Error in mean(chemdata.quant, param = "Arsenic") :
   object 'chemdata.quant' not found
[1] NA
Warning message:
In mean.default(chemdata$quant, param = "Arsenic") :
   argument is not numeric or logical: returning NA

   As a newcomer to R I've done a lot of reading, yet all the examples use
nicely structured data to illustrate the point being made. I need to work
with my data and learn how to specify columns and write correct commands for
the analyses I need. Please point me in the right direction.

Rich
#
On Aug 30, 2011, at 5:00 PM, Rich Shepard wrote:

            
It appears that your original file was delimited by "|" and your used  
something else, perhaps the default white-space setting? I think you  
need to go back and do your input operations again with sep="|"

(Or you could provide str() on the data.frame rather than making us  
guess.)
#
Hi Rich,

I do not know what u really want, because it seems to me, u want to calculate the mean of all rows, where the chemical is Arsenic??

But try this to get a little more inside:

mean(chemdata$quant[chemdata$param=="Arsenic"])

The vector chemdata[chemdata$param=="Arsenic",] is a logical vector, returning TRUE for every row in which the variable param takes the value "Arsenic". Try it in your R editor to see it and understand the R concept!

If u now want to get all values of a certain column, given all values have "Arsenic" as param, u just write:

chemdata$COLUMNNAME[chemdata$param=="Arsenic"]

I do not know if this helps, as it seems to me, that Arsenic only occurs once in your frame?..

Good luck Simon
On Aug 30, 2011, at 11:00 PM, Rich Shepard wrote:

            
#
On Wed, 31 Aug 2011, Tal Galili wrote:

            
Tal,

   Yes, summary() is inappropriate. I do want str() instead. And what that
shows is:
'data.frame':   14886 obs. of  1 variable:
  $ site_id.sample_date.param.quant: Factor w/ 14886 levels
"BC-0.5|1996-04-19|Arsenic|0.01",..: 11579 14219 13298 11982 11909 13371
13082 111 12 23 ...
Error: unexpected '==' in "mean(chemdata$quant[chemdata$param > =="
[1] NA
Warning message:
In mean.default(chemdata$quant[chemdata$param == "Arsenic"]) :
   argument is not numeric or logical: returning NA

   I find it easy following the syntax in the half-dozen or more books I've
read, but quite difficult to apply what I read to my own real-world data.
:-)

Thanks,

Rich
#
On Tue, 30 Aug 2011, David Winsemius wrote:

            
David,

   Yes, the csv file separator is the pipe.
'data.frame':   14886 obs. of  1 variable:
  $ site_id.sample_date.param.quant: Factor w/ 14886 levels
"BC-0.5|1996-04-19|Arsenic|0.01",..: 11579 14219 13298 11982 11909 13371
13082 111 12 23 ...

Thanks,

Rich
#
On Aug 30, 2011, at 5:30 PM, Rich Shepard wrote:

            
It is _not_ a csv file. You need to use read.table with sep = "|".
David Winsemius, MD
West Hartford, CT
#
On Tue, 30 Aug 2011, David Winsemius wrote:

            
David,

   Yes, that's better. I did not know of the sep option. The new results:
'data.frame':   14886 obs. of  4 variables:
  $ site_id    : Factor w/ 148 levels "BC-0.5","BC-1",..: 104 145 126 115 114
128 124 2 3 3 ...
  $ sample_date: Factor w/ 1012 levels "1980-03-01","1980-05-01",..: 432 410
423 405 398 408 401 360 366 407 ...
  $ param      : Factor w/ 8 levels "Arsenic","Calcium",..: 1 1 1 1 1 1 1 1 1
1 ...
  $ quant      : num  0.06 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...

   Now I can work on the commands I need.

Many thanks,

Rich
#
On Aug 30, 2011, at 5:38 PM, Rich Shepard wrote:

            
That does look more workable. You might consider changing the dates  
with:

chemadata$samp_date <- as.Date(as.character(chemdata$sample_date) )
David Winsemius, MD
West Hartford, CT
#
On Tue, 30 Aug 2011, David Winsemius wrote:

            
David,

   I was thinking that I needed to do this. Thank you. It's now done.

   Good progress for the first afternoon applying R to my own data.

Rich