Skip to content

Syntax Check: rshape2 melt()

23 messages · Nordlund, Dan (DSHS/RDA), Dennis Murphy, Hadley Wickham +1 more

#
This is my first excursion into using reshape2 and I want to ensure that
the melt() function call is syntactically correct.

   The unmodifed data frame is organized this way:

head(tds.anal)
       site   sampdate param quant
1    UDS-O 2006-12-06   TDS 10800
4   STC-FS 1996-06-14  Cond   280
7    UDS-O 2007-10-04    Mg  1620
9    UDS-O 2007-10-04   SO4  7580
19 JCM-10B 2007-06-21    Ca    79
20 JCM-10B 2007-06-21    Cl   114

   What I want looks like this:

     site    sampdate  TDS   Cond   Mg  Ca   Cl  Na  SO4
    UDS-O  2006-12-06  10800  NA   1620 NA   NA  NA  7580

with the actual data for each param, of course.

   I've read the reshape.pdf, reshape2.pdf, the ?melt help page, and the
?melt.data.frame help page. I'm still unclear on the differences among
measure.vars, variable.name, and value.name. After several attempts I have
what may be what the melted tds.anal should look like:

m.tds.anal <- melt(tds.anal, id.vars = c('site', 'sampdate', 'param'), \
measure.vars = 'quant', value.name = 'quant', na.rm = F)
site   sampdate param variable quant
1   UDS-O 2006-12-06   TDS    quant 10800
2  STC-FS 1996-06-14  Cond    quant   280
3   UDS-O 2007-10-04    Mg    quant  1620
4   UDS-O 2007-10-04   SO4    quant  7580
5 JCM-10B 2007-06-21    Ca    quant    79
6 JCM-10B 2007-06-21    Cl    quant   114

   Is the melt() function call correct? Should the melted result look like
the unmelted ("long" form in Paul Dalgaard's book) data with the additional
'variable' column containing 'quant' for each row?

Rich
#
Rich,

What I think you want is as simple as 


test.melted <- melt(test)
wanted <- cast(test.melted, site + sampdate ~ param)


Hope this is helpful,

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
Dan,

   I see the difference this syntax makes; sampdate is the variable
associated with the value (quant) for a specific ID pair of site and param.
This, unfortunately, doesn't work:

c.tds.anal <- cast(m.tds.anal, site + sampdate ~ param)
Error: could not find function "cast"

   ?cast has an example using acast() so I tried that:

c.tds.anal <- acast(m.tds.anal, site + sampdate ~ param)
Error in eval(expr, envir, enclos) : object 'sampdate' not found

   Changing 'sampdate' to 'variable' produces results, but not what's needed:
Aggregation function missing: defaulting to length
Ca Cl Cond Mg Na SO4 TDS
BC-0.5_sampdate  1  1    3  1  0   3   3
BC-0.5_quant     1  1    3  1  0   3   3
BC-1_sampdate    8  8   11  8  7  10  10
BC-1_quant       8  8   11  8  7  10  10
BC-1.5_sampdate  3  4    6  3  0   6   6
BC-1.5_quant     3  4    6  3  0   6   6

   Now that I better understand melt(), I'll read ?cast.

Thanks very much,

Rich
#
Rich,

I should apologize.  I saw rshape2 and just "glossed" over that and read rshape.  So I was using the rshape package rather than rshape2.  I don't know the relationship between those two packages and/or how they differ.  I am sure that there are others that can help you out here.

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
I need more coffee.  That should have been reshape and reshape2, not rshape and rshape2.

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
Dan,

   Apparently the 'e' in the subject line fell off. It's 'reshape2'.
I, too, don't know how the two packages 'reshape, The Orignal' and
'reshape2, Rebooted' differ. The two articles by Hadley that I read
("Statistical Computing and Graphics", 2005, and "Journal of Statistical
Software", 2007) do not explicitly specify which version of reshape is
addressed.

Rich
#
Rich,

You should be able to get what you want using the dcast function (I think).

test.melted <- melt(tds.anal)
dcast(test.melted, site + sampdate ~ param)


Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
Dan,

   I don't think the data frame melted properly. I tried the dcast() but
sampdate is not found:

c.tds.anal <- dcast(m.tds.anal, site + sampdate ~ param)
Error in eval(expr, envir, enclos) : object 'sampdate' not found

   Perhaps I still need to better understand the parameters in the melt()
function.

Rich
#
Rich,

I read in the sample data that you put in an earlier post.
site   sampdate param quant
1    UDS-O 2006-12-06   TDS 10800
4   STC-FS 1996-06-14  Cond   280
7    UDS-O 2007-10-04    Mg  1620
9    UDS-O 2007-10-04   SO4  7580
19 JCM-10B 2007-06-21    Ca    79
20 JCM-10B 2007-06-21    Cl   114

Then ran the following code
Using site, sampdate, param as id variables
     site   sampdate param variable value
1   UDS-O 2006-12-06   TDS    quant 10800
2  STC-FS 1996-06-14  Cond    quant   280
3   UDS-O 2007-10-04    Mg    quant  1620
4   UDS-O 2007-10-04   SO4    quant  7580
5 JCM-10B 2007-06-21    Ca    quant    79
6 JCM-10B 2007-06-21    Cl    quant   114
Using site, sampdate, param as id variables
     site   sampdate Ca  Cl Cond   Mg  SO4   TDS
1 JCM-10B 2007-06-21 79 114   NA   NA   NA    NA
2  STC-FS 1996-06-14 NA  NA  280   NA   NA    NA
3   UDS-O 2006-12-06 NA  NA   NA   NA   NA 10800
4   UDS-O 2007-10-04 NA  NA   NA 1620 7580    NA
So you can see that at least it worked for me.  Can you show us the actual output you get when running the melt and dcast functions on a sample of your data?  I am not sure why sampdate is not found since it appears to be in your original sample data.

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
Dan, et al.:

   No, it's not that simple. Unfortunately. I've tried various combinations
of id.vars, measure.vars, and variable, but just cannot get the dates to
display in the melted data frame. For example,

m.tds.anal <-  melt(tds.anal, id.vars = c('site', 'param'), measure.vars =
'sampdate', variable.name = 'variable', na.rm = T, value.name = 'quant')
site param variable quant
1   UDS-O   TDS sampdate 13488
2  STC-FS  Cond sampdate  9661
3   UDS-O    Mg sampdate 13790
4   UDS-O   SO4 sampdate 13790
5 JCM-10B    Ca sampdate 13685
6 JCM-10B    Cl sampdate 13685

   No matter where I put 'sampdate' it displays as that string rather than as
the date itself.

Rich
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
Dan,

   In your 'tds.anal' is sampdate a factor or a date? Here it's a date:

str(tds.anal)
'data.frame':	11434 obs. of  4 variables:
  $ site    : Factor w/ 143 levels "BC-0.5","BC-1",..: 134 127 134 134 73 73
91 91 91 91 ...
  $ sampdate: Date, format: "2006-12-06" "1996-06-14" ...
  $ param   : Factor w/ 7 levels "Ca","Cl","Cond",..: 7 3 4 6 1 2 4 5 1 4 ...
  $ quant   : num  10800 280 1620 7580 79 114 301 27.5 381 274 ...

   I wonder if that's what's not working here.
When I do this I see:

melt(tds.anal)
Using site, param as id variables
             site param variable     value
1          UDS-O   TDS sampdate 13488.000
2         STC-FS  Cond sampdate  9661.000
3          UDS-O    Mg sampdate 13790.000
4          UDS-O   SO4 sampdate 13790.000
5        JCM-10B    Ca sampdate 13685.000
6        JCM-10B    Cl sampdate 13685.000

   Notice there's no sampdate here.

Rich
#
Rich,

You were right about sampdate as I read it in.  How about trying this on your data
variable.name = 'variable', na.rm = T, value.name = 'quant')
Hope this is helpful,

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
Try this, based on your small example:
+      site   sampdate param quant
+ 1    UDS-O 2006-12-06   TDS 10800
+ 4   STC-FS 1996-06-14  Cond   280
+ 7    UDS-O 2007-10-04    Mg  1620
+ 9    UDS-O 2007-10-04   SO4  7580
+ 19 JCM-10B 2007-06-21    Ca    79
+ 20 JCM-10B 2007-06-21    Cl   114"), header = TRUE, stringsAsFactors = FALSE)
# Define param so that all of its levels are represented:
tds.a <- within(tds.a, {
           param = factor(param, levels = c('TDS', 'Cond', 'Mg', 'Ca',
'Cl', 'Na', 'SO4'))
           sampdate = as.Date(sampdate)  } )
library('reshape2')
dcast(tds.a, site + sampdate ~ param, value_var = 'quant')

# Result:
     site   sampdate   TDS Cond   Mg Ca  Cl  SO4
1 JCM-10B 2007-06-21    NA   NA   NA 79 114   NA
2  STC-FS 1996-06-14    NA  280   NA NA  NA   NA
3   UDS-O 2006-12-06 10800   NA   NA NA  NA   NA
4   UDS-O 2007-10-04    NA   NA 1620 NA  NA 7580

HTH,
Dennis
On Thu, Oct 27, 2011 at 8:26 AM, Rich Shepard <rshepard at appl-ecosys.com> wrote:
#
See here: https://github.com/hadley/reshape
#
On Thu, 27 Oct 2011, Dennis Murphy wrote:

            
Dennis,

   This produces an error:
tds.a <- within(tds.anal, {param = factor(param, levels = c('TDS', 'Cond',
'Mg', 'Ca', 'Cl', 'Na', 'SO4')) sampdate = as.Date(sampdate)})
Error: unexpected symbol in "tds.a <- within(tds.anal, {param =
factor(param, levels = c('TDS', 'Cond', 'Mg', 'Ca', 'Cl', 'Na', 'SO4'))
sampdate"

   So I did the conversion and melting the long way:
Now sampdate is a factor, and quant has the proper values:
str(tds.a)
'data.frame':	11434 obs. of  4 variables:
  $ site    : Factor w/ 143 levels "BC-0.5","BC-1",..: 134 127 134 134 73 73
  $ sampdate: Factor w/ 890 levels "1981-11-30","1982-04-28",..: 686 268 727
  $ param   : Factor w/ 7 levels "Ca","Cl","Cond",..: 7 3 4 6 1 2 4 5 1 4 ...
  $ quant   : num  10800 280 1620 7580 79 114 301 27.5 381 274 ...

  However, ...
c.tds.anal <- dcast(tds.a, site + sampdate ~ param, value_var = 'quant') 
Aggregation function missing: defaulting to length.

   That's because c.tds.anal has lost the quant values:

str(c.tds.anal)
'data.frame':	2309 obs. of  9 variables:
  $ site    : Factor w/ 143 levels "BC-0.5","BC-1",..: 1 1 1 2 2 2 2 2 2 2
...
  $ sampdate: Factor w/ 890 levels "1981-11-30","1982-04-28",..: 239 252 260
99 101 102 155 163 170 179 ...
  $ Ca      : int  1 0 0 1 1 1 1 1 0 1 ...
  $ Cl      : int  1 0 0 1 1 1 1 1 0 1 ...
  $ Cond    : int  1 1 1 1 1 1 1 1 1 1 ...
  $ Mg      : int  1 0 0 1 1 1 1 1 0 1 ...
  $ Na      : int  0 0 0 1 1 1 1 1 0 1 ...
  $ SO4     : int  1 1 1 1 1 1 1 1 0 1 ...
  $ TDS     : int  1 1 1 1 1 1 1 1 0 1 ...

   I don't see what I did differently from your sample code.

Rich
#
On Thu, 27 Oct 2011, Rich Shepard wrote:

            
I didn't copy the melt command:

   tds.anal.m <- melt(tds.a)
And the above was actually

   c.tds.anal <- dcast(m.tds.anal ...) but produces the same error.

Rich
#
Rich,

It appears that what you did differently was to dcast the un-melted data.  In addition, you shouldn't need to turn your dates into factors.  I assume that your unaltered data is in tds.anal.

Melt the data this way:
Then, cast the data this way:
Does this do what you wanted?


Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
Dan,

   It's been that sort of a day when that might well have been what I did. So
I checked and re-melted the data with similar lack of results.
OK. This yields:

head(m.tds.anal)
      site   sampdate param variable value
1   UDS-O 2006-12-06   TDS    quant 10800
2  STC-FS 1996-06-14  Cond    quant   280
3   UDS-O 2007-10-04    Mg    quant  1620
4   UDS-O 2007-10-04   SO4    quant  7580
5 JCM-10B 2007-06-21    Ca    quant    79
6 JCM-10B 2007-06-21    Cl    quant   114
I still get this error:

wanted <- dcast(m.tds.anal, site + sampdate ~ param)
Aggregation function missing: defaulting to length

and the values are replaced by lengths in the output:

head(wanted)
     site   sampdate Ca Cl Cond Mg Na SO4 TDS
1 BC-0.5 1996-04-19  1  1    1  1  0   1   1
2 BC-0.5 1996-05-21  0  0    1  0  0   1   1
3 BC-0.5 1996-06-02  0  0    1  0  0   1   1
4   BC-1 1994-05-24  1  1    1  1  1   1   1
5   BC-1 1994-06-02  1  1    1  1  1   1   1
6   BC-1 1994-06-10  1  1    1  1  1   1   1

   This is what happens each time I try. That's why I thought that sampdate
needed to be a factor rather than a date. I certainly have no idea why I'm
not getting valid results, but appreciate the help that you and others
provide so I can learn to do this correctly.

Thanks again,

Rich
#
Rich,

I will be leaving work shortly, and won't be able to look at this again until tomorrow some time.  However, I went back and reread the help page on dcast.  Here is part of that page.

fun.aggregate 

aggregation function needed if variables do not identify a single observation for each output cell. Defaults to length (with a message) if needed but not specified.

It looks like your data may be slightly different than I expected.  The fact that dcast is looking for an aggregation  function suggests that you have more than one record for some combination of site|sampdate|param.  No param value ('TDS', 'SO4', ...) should occur more than once for any combination of site and sampdate (i.e. no multiple measures of the same parameter on the same day in the same site).  So, you might want to check your data to make sure you have no more than 1 measurement per param per sampdate per site.

If that isn't the problem, then I would start thinking about restarting R and then re-installing reshape2.  Good luck.

Dan 

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
On Thu, 27 Oct 2011, Nordlund, Dan (DSHS/RDA) wrote:

            
I read that, too, Dan. But did not see how that applies to these data.
Nope. Can't. In the database table site, date, param are the compound
primary key. Besides, it makes no sense to have two concentration
measurements of the same chemical on the same day from the same location. I
spent a lot of time getting rid of duplicate records because the same site
was listed with two spellings (one with a hyphen, one without) or some
records were listed as "TDS" while others from the same day and location
were entered in their spreadsheet as "Total Dissolved Solids". That table is
clean, and the .csv copy didn't change any values on its way into R.
I'll do that tomorrow. It's sorta' a Microsoft solution: shut down
computer, spin around it your chair three times, restart computer.

   Too bad there is not more detailed diagnostic information about the source
of the error, or a way to stick a stub of code in the function call to
illucidate just what dcast() thinks it sees.

Thanks,

Rich
#
On Thu, 27 Oct 2011, Rich Shepard wrote:

            
Dan et al.:

   I apologize for the certainty; anything can happen to data as un-intended
consequences of processing. I will write that there _should_ be no
duplicates.

   What I'm going to do is re-read the data frame from the source text file
that was written out from the postgres table, then subset only a single
stream's worth of data as a test set. I'll obfuscate the identification
(although the data have been submitted to the regulator over the decades)
and make the set available to use if this new approach still has issues with
dcast().

   May not be until Monday that I return with results on this effort as I'm
going to be carfully checking and documenting each step and its results.

Rich
#
On Fri, 28 Oct 2011, Rich Shepard wrote:

            
Dan, et al.:

   Got it working properly now. Took a different approach after re-reading
the source data into an R data frame.

   I applied the subset() function to extract all rows for a single stream
(all commands are on a single line in emacs but wrap when copied into
alpine):
param, quant), drop = T)

   Then I applied the reshape2 melt() function:
which produced this data frame:
site   sampdate param variable  value
1 BC-0.5 1996-06-02    OH    quant     NA
2 BC-0.5 1996-06-02   SO4    quant 194.00
3 BC-0.5 1996-06-02   TDS    quant 530.00
4 BC-0.5 1996-06-02   TSS    quant     NA
5 BC-0.5 1996-04-19    As    quant   0.01
6 BC-1.5 1996-09-19    As    quant     NA

   Then the reshape2 dcast() function produced the desired results:
site   sampdate Acid Ag   Al Alk-HO Alk-Tot   As   Ba Be Bo CO3    Ca Cd
1 BC-0.5 1996-04-19   NA NA 0.07     NA     162 0.01 0.18 NA NA  NA 76.56 NA
2 BC-0.5 1996-05-21   NA NA   NA     NA     152   NA   NA NA NA  NA    NA NA
3 BC-0.5 1996-06-02   NA NA   NA     NA     212   NA   NA NA NA  NA    NA NA
4   BC-1 1988-06-26   NA NA   NA     NA      NA 0.00   NA NA NA  NA    NA NA
5   BC-1 1988-07-30   NA NA   NA     NA      NA 0.00   NA NA NA  NA    NA NA
6   BC-1 1989-05-15   NA NA   NA     NA      NA 0.00   NA NA NA  NA    NA NA

  et cetera.

   Thank you all for your patient help. Now I should be able to produce
scatter plots and run linear regressions on selected parameter pairs by site
and date.

Carpe weekend, all,

Rich