Skip to content

reshape2: Lost Values Between melt() and dcast()

5 messages · Justin Haynes, Rich Shepard

#
Working with 5 subset streams from my source data frame, three of them
successfully call dcast(), but two fail:

jerritt.cast <- dcast(jerritt.melt, site + sampdate ~ param)
Aggregation function missing: defaulting to length

and

winters.cast <- dcast(winters.melt, site + sampdate ~ param)
Aggregation function missing: defaulting to length

   Yet both data frames have the values in their .melt data frames:

summary(jerritt.melt)
       site         sampdate              param       variable
  JCM-1  :2178   Min.   :1978-03-28   pH     : 292   quant:7519
  JCM-20A:2149   1st Qu.:1996-05-24   As     : 286
  JC-E   : 476   Median :2000-05-31   SO4    : 271
  JC     : 400   Mean   :2001-02-04   TDS    : 271
  GD-1   : 395   3rd Qu.:2006-05-31   Cl     : 253
  JC-2   : 349   Max.   :2009-12-30   Zn     : 250
  (Other):1572                        (Other):5896
      value
  Min.   :    0.000
  1st Qu.:    0.005
  Median :    0.650
  Mean   :  317.588
  3rd Qu.:   27.000
  Max.   :20450.000
  NA's   : 2134.000

and

summary(winters.melt)
       site        sampdate              param      variable
  WC     :601   Min.   :1987-07-23   As     : 96   quant:1189
  WC-2   :327   1st Qu.:1994-06-15   TDS    : 79
  WC-1   :261   Median :1995-07-27   NO3-N  : 74
  BC-0.5 :  0   Mean   :1997-05-15   pH     : 72
  BC-1   :  0   3rd Qu.:1996-07-29   SO4    : 69
  BC-1.5 :  0   Max.   :2011-06-06   Cl     : 64
  (Other):  0                        (Other):735
      value
  Min.   :   0.00
  1st Qu.:   0.05
  Median :   7.59
  Mean   :  79.20
  3rd Qu.:  75.00
  Max.   :2587.00
  NA's   : 252.00

   What might be causing dcast() to fail with these two data frames while it
succeeds with three others processed using the same syntax? If additional
information would help, let me know and I'll provide it.

Puzzled,

Rich
#
The reason dcast would give that warning (not a failure) is if the
formula you gave did not specify unique values.  Thus, dcast needs an
aggregating function, which defaults to length.

However, the dcast calls that "failed" can be helpful for determining
the source of your error.  I'd look at the outputs of those two dcast
calls and find cells where the length is > 1.  Those are duplicated
entries in your initial data.frames (when I've run into this is was
usually due to NA values somewhere unexpected).

Hope that clarifies things.

Justin
On Mon, Oct 31, 2011 at 9:32 AM, Rich Shepard <rshepard at appl-ecosys.com> wrote:
#
On Mon, 31 Oct 2011, Justin Haynes wrote:

            
Justin,

   I'll have to dig in the docs to see how to examine specific rows in the
original data frames because I cannot find where duplicate entries were
generated.

   In the dcast() results for the two problem data frames I found 1 row with
a value of 2 in one and 8 rows each with a value of 2 in the other. When I
look at the original database table, only one row is present for each of the
9. There are about 47.5K rows in the original R data frame so going through
them one at a time is a problem.

   Have you any suggestion on how to examine the data frame and the melted
data frame to see where the problems might be?

Thanks,

Rich
#
On Mon, 31 Oct 2011, Justin Haynes wrote:

            
The dcast() resulting data frame has one row with a '2' in one column.
However the melt() data frame has only one row with that combination of
site, sampdate, and param. The problem is that the melt(), and the chemdata
data frames show the quant value as 'NA' while the original database table
has the value of 1.0 for that site, sampdate, and param. I'll re-read the
table and see if that fixes the issue with this one subset data frame.

   Curious how the database table has a value of 1.00 mg/L and the read data
frame contains NA. More curious is why the cast() data frame has a '2' for
that row.

Rich
#
On Mon, 31 Oct 2011, Rich Shepard wrote:

            
Further searching in emacs of the text file generated by write.text() I
found two rows for the same values in the columns site, sampdate, and param.
Since a select query on the database table returns only one row I cannot
explain how the R data frame has 2 rows. Regardless, thanks to Justin's
suggestions, I've fixed one subset data frame and will now fix the other.

Rich