Skip to content

Problem merging data frames and duplicates

8 messages · William Michels, Frank Burbrink, John Kane

#
Hello All,

I am attempting to merge two data frames that naturally contain duplicate
entries, however when using either merge or dMerge I get even more
duplicates.

For example:

data.frame(state=c("IL", "IL", "LA","LA", "MS","MS", "AR", "AR"),
locus=c(1,1,2,2,3,4,5,6))->x

data.frame(state=c("IL", "IL", "AR", "AR", "TN","TN"),
locus=c(1,1,2,3,3,4,4))->y

These yield:

 x
  state locus
1    IL     1
2    IL     1
3    LA     2
4    LA     2
5    MS     3
6    MS     4
7    AR     5
8    AR     6

y
  state locus
1    IL     1
2    IL     1
3    AR     2
4    AR     3
5    TN     3
6    TN     4

However, when merged by "state: they produce another second set of AR and
IL:
merge(x,y,by=1,all=T)->z

   state locus.x locus.y
1     AR       5       2
2     AR       5       3
3     AR       6       2
4     AR       6       3
5     IL       1       1
6     IL       1       1
7     IL       1       1
8     IL       1       1
9     LA       2      NA
10    LA       2      NA
11    MS       3      NA
12    MS       4      NA
13    TN      NA       3
14    TN      NA       4

While, the NAs are desired when states are missing, I don't want the extra
duplicated states and values but rather:

   state locus.x locus.y
1     AR       5       2
4     AR       6       3
7     IL       1       1
8     IL       1       1
9     LA       2      NA
10    LA       2      NA
11    MS       3      NA
12    MS       4      NA
13    TN      NA       3
14    TN      NA       4

Any help would be much appreciated.

Thanks!

Frank
#
Hi Frank,

It looks like you're very close. I think you want:

unique(merge(x, y, by = 1, all=T))

Gabor Grothendieck's sqldf package is very useful if you're more
comfortable with SQL-type syntax, see:

https://github.com/ggrothendieck/sqldf

Best Regards,

William (Bill) Michels, Ph.D.



On Tue, May 26, 2015 at 5:12 PM, Frank Burbrink
<burbrink666 at gmail.com> wrote: <SNIP>
#
Thanks Bill,

However, unique(merge(x, y, by = 1, all=T)) is giving me:

   state locus.x locus.y
1     AR       5       2
2     AR       5       3
3     AR       6       2
4     AR       6       3
5     IL       1       1
9     LA       2      NA
11    MS       3      NA
12    MS       4      NA
13    TN      NA       3
14    TN      NA       4

This has AR repeated twice and the normal double IL and LA now only listed
singly.

What I am hoping for is something like this:

   state locus.x locus.y
1     AR       5       2
4     AR       6       3
7     IL       1       1
8     IL       1       1
9     LA       2      NA
10    LA       2      NA
11    MS       3      NA
12    MS       4      NA
13    TN      NA       3
14    TN      NA       4

On Wed, May 27, 2015 at 3:53 AM, William Michels <wjm1 at caa.columbia.edu>
wrote:

  
    
#
I have figured out a cheesy work around since these problems have to do
with not having unique identifiers for the States:

1) Append a unique identifier to each state such that both AR becomes ARa
and ARb
2) run the normal merge(x,y,by=1, all=T)
3) Use subst to cut the appended identifiers.

While this is clunky I can just write a function to do it all at once.

On Wed, May 27, 2015 at 6:20 AM, Frank Burbrink <burbrink666 at gmail.com>
wrote:

  
    
#
y has unequal n's . There are 6 states and 7 loci.  

It is safer to submit data in the dput() form. See ?dput for information.



John Kane
Kingston ON Canada
____________________________________________________________
FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
#
Hi Frank!

Ok, bind columns together in a state-wise fashion, allowing for state
duplicates. Below (maybe cheesy) uses the state abbreviations
"state.abb" in the datasets package. Also uses two functions
"rbind.na" and "cbind.na", available from Andrej-Nikolai Spiess'
website at:  http://www.dr-spiess.de . These functions are pretty
helpful in that they bind without recycling, inserting NAs instead.

by.state1 <- function(df1, df2) {
  df_out <- data.frame( state1=character(), locus1=numeric(),
state2=character(), locus2=numeric() )
  for(i in 1:50) {
    df_out_per <- cbind.na(df1[df1$state == state.abb[i], ],
df2[df2$state == state.abb[i], ])
    df_out <- rbind.na(df_out, df_out_per)
  }
  df_out
}

data.frame(state=c("IL", "IL", "LA","LA", "MS","MS", "AR", "AR"),
           locus=c(1,1,2,2,3,4,5,6)) -> x

##edit y to correct number of loci (6 not 7)
data.frame(state=c("IL", "IL", "AR", "AR", "TN","TN"),
           locus=c(1,1,2,3,3,4)) -> y
state locus state locus
7     AR     5    AR     2
8     AR     6    AR     3
3     IL     1    IL     1
4     IL     1    IL     1
31    LA     2  <NA>    NA
41    LA     2  <NA>    NA
5     MS     3  <NA>    NA
6     MS     4  <NA>    NA
9   <NA>    NA    TN     3
10  <NA>    NA    TN     4

The rows will be in state alphabetical order. If you need the row
numbers cleaned up as well (numeric order) you can pre-merge your data
to states.abb:
state locus state locus
1     AR     5    AR     2
2     AR     6    AR     3
3     IL     1    IL     1
4     IL     1    IL     1
5     LA     2  <NA>    NA
6     LA     2  <NA>    NA
7     MS     3  <NA>    NA
8     MS     4  <NA>    NA
9   <NA>    NA    TN     3
10  <NA>    NA    TN     4
Hope this helps,

Bill

William Michels, Ph.D.
On Wed, May 27, 2015 at 4:01 AM, Frank Burbrink <burbrink666 at gmail.com> wrote:
#
Interesting solutions. Thanks guys!

On Wed, May 27, 2015 at 9:27 AM, William Michels <wjm1 at caa.columbia.edu>
wrote:

  
    
#
Notes:

1. You can get output for all states (alphabetically) by pre-merging
data with states.abb:

a1 <- merge(state.abb, x, by= 1, all.x=T)
b1 <- merge(state.abb, y, by= 1, all.x=T)
colnames(a1) <- c("state", "locus")
colnames(b1) <- c("state", "locus")
by.states1(a1,b1)

2. In my previous post (based on the very small test data given), it
seemed output could be arranged sequentially by row numbers, while
keeping states alphabetically arranged. This doesn't appear to be
possible in general. The larger example above is in state-alphabetical
order, but row numbers are not sequential.

Best, Bill

William Michels, Ph.D.
On Wed, May 27, 2015 at 6:42 AM, Frank Burbrink <burbrink666 at gmail.com> wrote: