Skip to content

merge a list of data frames

8 messages · David Winsemius, Sam Steingold, Jeff Newmiller +1 more

#
I have a list of data frames:
List of 4
 $ :'data.frame':	700773 obs. of  3 variables:
  ..$ V1: chr [1:700773] "200130446465779" "200070050127778" "200030633708779" "200010587002779" ...
  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ V3: num [1:700773] 1 1 1 1 1 ...
 $ :'data.frame':	700773 obs. of  3 variables:
  ..$ V1: chr [1:700773] "200130446465779" "200070050127778" "200030633708779" "200010587002779" ...
  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ V3: num [1:700773] 1 1 1 1 1 ...
 $ :'data.frame':	700773 obs. of  3 variables:
  ..$ V1: chr [1:700773] "200130446465779" "200070050127778" "200030633708779" "200010587002779" ...
  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ V3: num [1:700773] 1 1 1 1 1 ...
 $ :'data.frame':	700773 obs. of  3 variables:
  ..$ V1: chr [1:700773] "200160325893778" "200130647544079" "200130446465779" "200120186959078" ...
  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ V3: num [1:700773] 1 1 1 1 1 1 1 1 1 1 ...

I want to merge them.
I tried to follow
http://rwiki.sciviews.org/doku.php?id=tips%3adata-frames%3amerge
and did:
Warning message:
In merge.data.frame(f1, f2, by = c("V1"), all = TRUE) :
  column names 'V2.x', 'V3.x', 'V2.y', 'V3.y' are duplicated in the result
'data.frame':	700773 obs. of  9 variables:
 $ V1  : chr  "100010000099079" "100010000254078" "100010000499078" "100010000541779" ...
 $ V2.x: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V3.x: num  0.476 0.748 0.442 0.483 0.577 ...
 $ V2.y: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V3.y: num  0.476 0.748 0.442 0.483 0.577 ...
 $ V2.x: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V3.x: num  0.476 0.752 0.443 0.485 0.578 ...
 $ V2.y: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V3.y: num  0.47 0.733 0.57 0.416 0.616 ...

I don't like the warning and I don't like that I now have to use [n] to
access identically named columns, but, I guess, this is better than
this:

library('reshape')
Error in merge.data.frame(dfs[[1]], Recall(dfs[-1]), all = TRUE, sort = FALSE,  : 
  formal argument "all" matched by multiple actual arguments
Error in merge.data.frame(dfs[[1]], Recall(dfs[-1]), all = TRUE, sort = FALSE,  : 
  formal argument "all" matched by multiple actual arguments
Error in merge.data.frame(dfs[[1]], Recall(dfs[-1]), all = TRUE, sort = FALSE,  : 
  formal argument "sort" matched by multiple actual arguments
Error in `[.data.frame`(df, , match(names(dfs[[1]]), names(df))) : 
  undefined columns selected
Error in `[.data.frame`(df, , match(names(dfs[[1]]), names(df))) : 
  undefined columns selected

what does 'formal argument "sort" matched by multiple actual arguments' mean?

thanks.
#
On Sep 5, 2012, at 8:51 PM, Sam Steingold wrote:

            
Why? What are you expecting?
Perhaps it would make more sense to explain what your goal was, rather than showing us two divergent efforts, neither of which is satisfactory? Perhaps?
#
I don't really know what you want, but if you have many columns with the same names I am wondering why this is so.  Do you really want to merge, which puts all of the non-key columns side-by-side in one data frame? If so, why don't you start by renaming the columns so they will make sense in the combined data frame?

If you really want the column names to stay the same, perhaps you want to stack the data frames "vertically" with rbind?
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.
Sam Steingold <sds at gnu.org> wrote:

            
#
these are the results of applying a model to the test data.
the first column is the ID
the second column is the actual value
the third column is the model score

after I will merge the frames, I will
1. check that all the V2 columns are identical and drop all but one
(I guess I could just merge on c("V1","V2") instead, right?)

2. compute the sum (or the mean, whatever is easier) of all the V3
columns

3. sort by the sum/mean of the V3 columns and evaluate the combined
model using the lift quality metric
(http://dl.acm.org/citation.cfm?id=380995.381018)

I have many more score files (not just 4), so it is not practical for me
to rename the column to something unique.
#
"not practical [...] to rename the column to something unique."

On the contrary, since R is a scripting language it is quite practical. Depending on the format of your data, it is probably necessary as well.

If all of the files have exactly the same number of rows corresponding to the same key values, you may be able to load each file and cbind the V3 columns into a matrix. That is the only scenario I see where you don't need to rename the columns.

If you do need the key alignment capability that merge offers, then it may be most effective to load the data into a list of data frames, add a column to each data frame containing the desired column name (perhaps derived from the file name the data were loaded from, or just a letter with sequential numbers at the end), stack the data frames into a single frame (rbind or ldply from the plyr package), and then use the reshape2 package dcast function to form the wide, combined data frame with the necessary unique column names against your key c(V1,V2).

Depending on the algorithm you plan to use, you may not need or want to do the dcast step at all. The plyr package or the sqldf package or the ave base function can let you combine computations on groups of rows instead of on columns.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.
Sam Steingold <sds at gnu.org> wrote:

            
#
Hi
colSums(apply(do.call(cbind,lapply(data, "[", "V2")),1,diff)!=0)

shall give you 0 if there is no difference
sapply(lapply(data, "[", "V3"), sum)
sapply(lapply(data, "[", "V3"), mean)

shall give you table of means or sums. Sorting them is straightforward

The most tedious part of my response was to prepare toy data. So please, maybe you shall be kind to us to provide them by an appropriate way

dput(header(data))

Regards
Petr
#
On Sep 6, 2012, at 6:42 AM, Sam Steingold wrote:

            
In which case you should be using the 'by' argument to `merge` and specifying that it is only the first column that is to be used. In the curent situation merge will attempt to use all of the columns because the three columns have the same names.

Notice the difference in these results:
[1] a b
<0 rows> (or 0-length row.names)
a b.x b.y
1 1   5  10
2 2   6  11
3 3   7  12

(`merge` "by" arguments can be column numbers.)
Depends what you want. I already suggested you only what to merge on the id column.
`aggregate should do that without difficulty.
That's going to require more background (or more money since they want $15.00 for a pdf.
Which column?
David Winsemius, MD
Alameda, CA, USA
#
I already do! see my initial message!
:-)
that I have already implemented, works just fine:

proficiency <- function (actual, prediction) {
  proficiency1(ea = entropy(table(actual)),
               ep = entropy(table(prediction)),
               ej = entropy(table(actual,prediction)))
}

proficiency1 <- function (ea, ep, ej) {
  mi <- ea + ep - ej
  list(joint = ej, actual = ea, prediction = ep, mutual = mi,
       proficiency = mi / ea, dependency = mi / ej)
}

detector.statistics <- function (tp,fn,fp,tn) {
  observationCount <- tp + fn + fp + tn
  predictedPositive <- tp + fp
  predictedNegative <- fn + tn
  actualPositive <- tp + fn
  actualNegative <- fp + tn
  correct <- tp + tn
  list(baseRate = actualPositive / observationCount,
       precision = if (tp == 0) 0 else tp / predictedPositive,
       specificity = if (tn == 0) 0 else tn / actualNegative,
       recall = if (tp == 0) 0 else tp / actualPositive,
       accuracy = correct / observationCount,
       lift = (tp * observationCount) / (predictedPositive * actualPositive),
       f1score = if (tp == 0) 0 else 2 * tp / (2 * tp + fp + fn),
       proficiency = proficiency1(ej = entropy(c(tp,fn,fp,tn)),
         ea = entropy(c(actualPositive,actualNegative)),
         ep = entropy(c(predictedPositive,predictedNegative))))
}

## v should be vector of 0&1 sorted according to some model
## Gregory Piatetsky-Shapiro, Samuel Steingold
## "Measuring Lift Quality in Database Marketing"
## http://sds.podval.org/data/l-quality.pdf
## http://www.sigkdd.org/explorations/issues/2-2-2000-12/piatetsky-shapiro.pdf
## SIGKDD Explorations, Vol. 2:2, (2000), 81-86
## tests: lift.quality(rbinom(10000,size=1,prob=0.1)) ==> ~0
##        lift.quality(rev(round((1:10000)/12000))) ==> 1
lift.quality <- function (v, plot = TRUE, file = NULL, main = "lift curve", thresholds = NULL) {
  target.count <- sum(v)
  total.count <- length(v)
  base.rate <- target.count / total.count
  target.level <- cumsum(v)/target.count
  lq <- ((2*sum(target.level) - 1)/total.count - 1) / (1 - base.rate)
  if (plot) {
    if (!is.null(file)) {
      pdf(file = file)
      on.exit(dev.off())
    }
    plot(x=(1:total.count)/total.count,y=target.level,type="l",
         main=paste(main,"( lift quality ",lq,")"),
         xlab="% cutoff",ylab="cumulative % hit")
  }
  if (is.null(thresholds)) thresholds = c(base.rate)
  list(lift.quality = lq,
       detector.statistics = sapply(thresholds, function (l) {
         cutoff <- round(l * total.count)
         tp <- round(target.level[cutoff] * target.count) # = sum(v[1:cutoff])
         fn <- target.count - tp
         fp <- cutoff - tp
         tn <- total.count - target.count - cutoff + tp
         detector.statistics(tp, fn, fp, tn)
       }))
}
the 3rd ("score") column.

Meanwhile I realised that the fastest way is actuall shell:
sort+cut+paste produced the csv file which can be loaded into R much
faster than the individual score files, so this issue is now purely
academic.  However, I appreciate the replies I got so far, it was quite
educational, thanks!
(I also appreciate comments on the code above)