Skip to content

merging two dataframes

12 messages · dividend, Jeff Newmiller, Timothy Bates +1 more

#
Hello

I am a new user of R and I need help to merge two large datasets about
stocks with different number of rows and columns.
Both have 2 variable(column) that are have same values ("name" and "date1),
but they are not in same order and "data3" contains much more observations.
In the "data" have dividend information about each stock, and in the second
there are information about the stock prices.

I tried to use this function:
data4<-merge(data,data3, by="name","date1")

But it does not work. 

"data 3" has more than 800.000 observations of daily stock prices for each
stock, and "data" has only 14.000, so i want to make a function that merge
the two datasets to a new dataset with the same length as "data", and also
include all the variables in "data3" where "name" (example: Statoil) and
"date1"(example:31.12.2000)  is the same in both sets.

Can someone please help me?

Regards



--
View this message in context: http://r.789695.n4.nabble.com/merging-two-dataframes-tp3932869p3932869.html
Sent from the R help mailing list archive at Nabble.com.
#
here is some more info:
http://r.789695.n4.nabble.com/file/n3932898/help.jpg 

this is what`s happend when I run the function in R

I appreciate if someone can give me some inputs here

Thanks in advance



--
View this message in context: http://r.789695.n4.nabble.com/merging-two-dataframes-tp3932869p3932898.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi
"date1),
observations.
second
Does

data4<-merge(data,data3, by=c("name","date1"), all=T)

do what you want?

I think merge help page is quite explanatory and you shall look at it.

Regards
Petr
each
merge
also
http://www.R-project.org/posting-guide.html
#
Thank you Petr!

I have read on the merge help page, but I cant figure out how to write this
function.
When I use your function it includes all data from "data3", but all columns
in "data" has "NA"(without "name" and "date". I hoped to keep these values
to.
 
I try to explain it more precise:
In "data" with 14000 observations about company name, date, size of dividend
etc.
In "data3" there are 800000 daily observations of the stockprices for each
company listed:

name            date               div
statoil          17.05.2000        5
statoil          18.05.2001        6
........           ..............       ...
Yara             17.05.2000       10
etc

I want to get the stockprice for statoil, yara etc from "data3", and merge
it into a new data like this:

name            date               div     price
statoil          17.05.2000        5       120
statoil          18.05.2001        6       130
........           ..............       ...
Yara             17.05.2000       10      200
etc

And also keep the rest of the columns from both datasets. 

name            date               div     price        industry     secid  
etc
statoil          17.05.2000        5       120         .......         .....   
...
statoil          18.05.2001        6       130         .......        
......   ....   
........           ..............       ...
Yara             17.05.2000       10      200        .....            ......  
...
etc
 
Using the first function i get this result:

name            date               div     price        industry     secid  
etc
statoil          17.05.2000        NA      120         .......         .....   
...
statoil          18.05.2001        NA      130         .......        
......   ....   
........           ..............       ...
Yara             17.05.2000       NA     200        .....            ......  
...

Hope you understand what i want to do.

Thanks again, I really appreciate it!
etc

--
View this message in context: http://r.789695.n4.nabble.com/merging-two-dataframes-tp3932869p3933101.html
Sent from the R help mailing list archive at Nabble.com.
#
this
columns
values

Because name and date common columns and you do not need them twice. Merge 
shall work as you expected provided input data are OK. The result of 
command I suggested shall be all values from both data frames aligned by 
name and date.

It is impossible to say what is wrong without reproducible data. Did you 
check size of your data frames after merge? I suspect that date in data 
and date in data3 is somewhat different. You shall check it with ?str 
function.

As Posting guide says you shall provide some suitable data so that we can 
look at it and work with it. Something like 

dput(data[1:20,])

could be enough.

Regards
Petr
dividend
each
merge
http://www.R-project.org/posting-guide.html
1 day later
#
Hello.

Now i tried to do what you told me.
I used the str(fuction), and data$date1 and data3$date1 where both listed
"character". I changed "name" to character but it did not work either.
I also changed all variables to character, with no positive result.

str(data)
'data.frame':   14446 obs. of  15 variables:
 $ id     : chr  "1" "1" "1" "1" ...
 $ compid : chr  "2514" "2514" "2514" "2514" ...
 $ secid  : chr  "15856" "15856" "15856" "15856" ...
 $ name   : chr  "A-pressen" "A-pressen" "A-pressen" "A-pressen" ...
 $ period : chr  "1" "2" "3" "4" ...
 $ date   : chr  "17.05.1980" "17.05.1981" "17.05.1982" "17.05.1983" ...
 $ enddate: chr  "17.05.1981" "17.05.1982" "17.05.1983" "17.05.1984" ...
 $ div    : chr  NA NA NA NA ...
 $ ndivs  : chr  NA NA NA NA ...
 $ posdiv : chr  NA NA NA NA ...
 $ ddiv2  : chr  NA NA NA NA ...
 $ ddiv3  : chr  NA NA NA NA ...
 $ ddiv4  : chr  NA NA NA NA ...
 $ ddiv5  : chr  NA NA NA NA ...
 $ ddiv6  : chr  NA NA NA NA ...

str(data3)
'data.frame':   812354 obs. of  9 variables:
 $ date                  : chr  "02.01.1996" "03.01.1996" "04.01.1996"
"05.01.1996" ...
 $ Securityid            : chr  "6001" "6001" "6001" "6001" ...
 $ Symbol                : chr  "AAV" "AAV" "AAV" "AAV" ...
 $ name                  : chr  "Adresseavisen" "Adresseavisen"
"Adresseavisen" "Adresseavisen" ...
 $ Securitytype          : chr  "Ordinary Shares" "Ordinary Shares"
"Ordinary Shares" "Ordinary Shares" ...
 $ Unadjusted            : chr  "200" "200" "200" "200" ...
 $ Event.adjusted        : chr  "200" "200" "200" "200" ...
 $ Div.and.Event.adjusted: chr  "109,7595375" "109,7595375" "109,7595375"
"109,7595375" ...
 $ Sharesissued          : chr  "1901646" "1901646" "1901646" "1901646" ...

Here is some suitable data for "data"
structure(list(id = c("1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), 
    compid = c("2514", "2514", "2514", "2514", "2514", "2514", 
    "2514", "2514", "2514", "2514", "2514", "2514", "2514", "2514", 
    "2514", "2514", "2514", "2514", "2514", "2514"), secid = c("15856", 
    "15856", "15856", "15856", "15856", "15856", "15856", "15856", 
    "15856", "15856", "15856", "15856", "15856", "15856", "15856", 
    "15856", "15856", "15856", "15856", "15856"), name = c("A-pressen", 
    "A-pressen", "A-pressen", "A-pressen", "A-pressen", "A-pressen", 
    "A-pressen", "A-pressen", "A-pressen", "A-pressen", "A-pressen", 
    "A-pressen", "A-pressen", "A-pressen", "A-pressen", "A-pressen", 
    "A-pressen", "A-pressen", "A-pressen", "A-pressen"), period = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
    "13", "14", "15", "16", "17", "18", "19", "20"), date = c("17.05.1980", 
    "17.05.1981", "17.05.1982", "17.05.1983", "17.05.1984", "17.05.1985", 
    "17.05.1986", "17.05.1987", "17.05.1988", "17.05.1989", "17.05.1990", 
    "17.05.1991", "17.05.1992", "17.05.1993", "17.05.1994", "17.05.1995", 
    "17.05.1996", "17.05.1997", "17.05.1998", "17.05.1999"), 
    enddate = c("17.05.1981", "17.05.1982", "17.05.1983", "17.05.1984", 
    "17.05.1985", "17.05.1986", "17.05.1987", "17.05.1988", "17.05.1989", 
    "17.05.1990", "17.05.1991", "17.05.1992", "17.05.1993", "17.05.1994", 
    "17.05.1995", "17.05.1996", "17.05.1997", "17.05.1998", "17.05.1999", 
    "17.05.2000"), div = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    "0", "0", "0", "0", "0", "5", "0", "1.1", "1.2", "1", "0"
    ), ndivs = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, "0", "0", 
    "0", "0", "0", "1", "0", "1", "1", "1", "0"), posdiv = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "1", 
    NA, "1", "1", "1", NA), ddiv2 = c(NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, "0", "0", "0", "0", "0", NA, "0", "1", NA, 
    NA), ddiv3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    "0", "0", "0", "0", "0", "0", "0", "0", "-1"), ddiv4 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "0", "0", "0", 
    "0", "0", "0", "0", "0"), ddiv5 = c(NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, "0", "0", "0", "0", "0", "0", 
    "0"), ddiv6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, "0", "0", "0", "0", "0", "0")), .Names = c("id", 
"compid", "secid", "name", "period", "date", "enddate", "div", 
"ndivs", "posdiv", "ddiv2", "ddiv3", "ddiv4", "ddiv5", "ddiv6"
), row.names = c(NA, 20L), class = "data.frame")




Here is some suitable data for "data3":
structure(list(date = c("02.01.1996", "03.01.1996", "04.01.1996", 
"05.01.1996", "08.01.1996", "09.01.1996", "10.01.1996", "11.01.1996", 
"12.01.1996", "15.01.1996", "16.01.1996", "17.01.1996", "18.01.1996", 
"19.01.1996", "22.01.1996", "23.01.1996", "24.01.1996", "25.01.1996", 
"26.01.1996", "29.01.1996"), Securityid = c("6001", "6001", "6001", 
"6001", "6001", "6001", "6001", "6001", "6001", "6001", "6001", 
"6001", "6001", "6001", "6001", "6001", "6001", "6001", "6001", 
"6001"), Symbol = c("AAV", "AAV", "AAV", "AAV", "AAV", "AAV", 
"AAV", "AAV", "AAV", "AAV", "AAV", "AAV", "AAV", "AAV", "AAV", 
"AAV", "AAV", "AAV", "AAV", "AAV"), name = c("Adresseavisen", 
"Adresseavisen", "Adresseavisen", "Adresseavisen", "Adresseavisen", 
"Adresseavisen", "Adresseavisen", "Adresseavisen", "Adresseavisen", 
"Adresseavisen", "Adresseavisen", "Adresseavisen", "Adresseavisen", 
"Adresseavisen", "Adresseavisen", "Adresseavisen", "Adresseavisen", 
"Adresseavisen", "Adresseavisen", "Adresseavisen"), Securitytype =
c("Ordinary Shares", 
"Ordinary Shares", "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
"Ordinary Shares", "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
"Ordinary Shares", "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
"Ordinary Shares", "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
"Ordinary Shares", "Ordinary Shares", "Ordinary Shares"), Unadjusted =
c("200", 
"200", "200", "200", "200", "200", "200", "200", "200", "200", 
"200", "200", "200", "200", "200", "200", "200", "200", "200", 
"200"), Event.adjusted = c("200", "200", "200", "200", "200", 
"200", "200", "200", "200", "200", "200", "200", "200", "200", 
"200", "200", "200", "200", "200", "200"), Div.and.Event.adjusted =
c("109,7595375", 
"109,7595375", "109,7595375", "109,7595375", "109,7595375", "109,7595375", 
"109,7595375", "109,7595375", "109,7595375", "109,7595375", "109,7595375", 
"109,7595375", "109,7595375", "109,7595375", "109,7595375", "109,7595375", 
"109,7595375", "109,7595375", "109,7595375", "109,7595375"), 
    Sharesissued = c("1901646", "1901646", "1901646", "1901646", 
    "1901646", "1901646", "1901646", "1901646", "1901646", "1901646", 
    "1901646", "1901646", "1901646", "1901646", "1901646", "1901646", 
    "1901646", "1901646", "1901646", "1901646")), .Names = c("date", 
"Securityid", "Symbol", "name", "Securitytype", "Unadjusted", 
"Event.adjusted", "Div.and.Event.adjusted", "Sharesissued"), row.names =
c(NA, 
20L), class = "data.frame")



When I run the function:
 
data4<-merge(data,data3, by=c("name","date1"), all=T)
structure(list(name = c("A-pressen", "A-pressen", "A-pressen", 
"A-pressen", "A-pressen", "A-pressen", "A-pressen", "A-pressen", 
"A-pressen", "A-pressen", "A-pressen", "A-pressen", "A-pressen", 
"A-pressen", "A-pressen", "A-pressen", "A-pressen", "A-pressen", 
"A-pressen", "A-pressen"), date = c("01.02.1999", "01.02.2000", 
"01.02.2001", "01.02.2002", "01.03.1999", "01.03.2000", "01.03.2001", 
"01.03.2002", "01.04.2003", "01.06.1999", "01.06.2001", "01.07.1999", 
"01.07.2002", "01.07.2003", "01.08.2000", "01.08.2001", "01.08.2002", 
"01.08.2003", "01.09.1999", "01.09.2000"), id = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    compid = c(NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), secid = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), period = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_), enddate = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), div = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ndivs = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), posdiv = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ddiv2 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ddiv3 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ddiv4 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ddiv5 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ddiv6 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Securityid = c("15856", 
    "15856", "15856", "15856", "15856", "15856", "15856", "15856", 
    "15856", "15856", "15856", "15856", "15856", "15856", "15856", 
    "15856", "15856", "15856", "15856", "15856"), Symbol = c("APR", 
    "APR", "APR", "APR", "APR", "APR", "APR", "APR", "APR", "APR", 
    "APR", "APR", "APR", "APR", "APR", "APR", "APR", "APR", "APR", 
    "APR"), Securitytype = c("Ordinary Shares", "Ordinary Shares", 
    "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
    "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
    "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
    "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
    "Ordinary Shares", "Ordinary Shares", "Ordinary Shares", 
    "Ordinary Shares", "Ordinary Shares", "Ordinary Shares"), 
    Unadjusted = c("120", "140", "160", "105", "110", "190", 
    "160", "112", "115", "120", "150", "127", "106,5", "154", 
    "155", "160", "111", "155", "127", "205"), Event.adjusted = c("120", 
    "140", "160", "105", "110", "190", "160", "112", "115", "120", 
    "150", "127", "106,5", "154", "155", "160", "111", "155", 
    "127", "205"), Div.and.Event.adjusted = c("111,4092308", 
    "129,9774359", "148,545641", "100,1538462", "102,1251282", 
    "176,3979487", "148,545641", "106,8307692", "109,6923077", 
    "111,4092308", "143,0769231", "117,9081026", "101,5846154", 
    "154", "143,9035897", "152,6153846", "105,8769231", "155", 
    "117,9081026", "190,3241026"), Sharesissued = c("8839643", 
    "8839643", "8854307", "8866191", "8839643", "8839643", "8854307", 
    "8867791", "8885537", "8839643", "8866191", "8839643", "8885537", 
    "8903842", "8854307", "8866191", "8885537", "8903842", "8839643", 
    "8854307")), .Names = c("name", "date", "id", "compid", "secid", 
"period", "enddate", "div", "ndivs", "posdiv", "ddiv2", "ddiv3", 
"ddiv4", "ddiv5", "ddiv6", "Securityid", "Symbol", "Securitytype", 
"Unadjusted", "Event.adjusted", "Div.and.Event.adjusted", "Sharesissued"
), row.names = c(NA, 20L), class = "data.frame")


If I want to just keep the observations from "data" (14446 observations)
merged with "data3". Is it correct to use:
data4<-merge(data,data3, by=c("name","date1"), all=T)
or should I use:
data4<-merge(data,data3, by=c("name","date1"), all.x=F)? 
(Then I get ca 14000 obs, but "NA" in all variables from "data3".

In advance, thank you.



--
View this message in context: http://r.789695.n4.nabble.com/merging-two-dataframes-tp3932869p3940157.html
Sent from the R help mailing list archive at Nabble.com.
#
I think you want something like this (I like to be explicit about what you are merging)

df3 = merge(df1, df2, by = "date", all=T)

You can be explicit about what you are merging on in each file:

df3 = merge(df1,df2, by.x = "date?, by.y="date", all=T)

You were trying to merge on ?date1? but it looks to me like your data frames actually contains columns called ?date? not ?date1"

As Petr says, in the vanilla situation where there is no overlap of data and the ID column has the same name in both frames, then 
merge(frame1, frame2) works by itself.

tip: don?t use words like ?data? as variable names, as that is also a function
On 26 Oct 2011, at 11:59 AM, dividend wrote:

            
#
I pasted wrong function, I have changed from "date1" to "date" (ignore that).
I think it have to be something wrong with my data format. I can`t
understand why it don't work. I know I can use by.x=.... and by.y=...., but
since both datasets have the same variable name it should be unnecessary to
do that. 



--
View this message in context: http://r.789695.n4.nabble.com/merging-two-dataframes-tp3932869p3940396.html
Sent from the R help mailing list archive at Nabble.com.
#
listed

You have no date1 only date. Therefore

result<- merge(data, data3, by=c("date", "name"), all=T)

takes all values from both data frames
[1] 20 15
[1] 20  9

alltogether 24 columns from which 4 are date and name columns therefore 20 
columns contain data.
[1] 40 22

So the result has all 20 columns from both data frames plus one name and 
one date column and all rows from both data frames = 40. Those two sets 
are disjoint. If you had some common date and name in both data frames 
these rows would be merged on the same row in result.

Let us try this.
result<- merge(data, data3, by=c("date", "name"), all=T)
dim(result)
[1] 37 22

Regards
Petr
"109,7595375"
...
c("17.05.1980",
"17.05.1985",
"17.05.1990",
"17.05.1995",
"17.05.1989",
"17.05.1994",
"17.05.1999",
Shares",
Shares",
Shares",
Shares",
"109,7595375",
"109,7595375",
"109,7595375",
NA_character_,
NA_character_,
NA_character_,
NA_character_,
c(NA_character_,
c(NA_character_,
c(NA_character_,
c(NA_character_,
c(NA_character_,
c(NA_character_,
c(NA_character_,
c("15856",
http://www.R-project.org/posting-guide.html
#
that).
but
to
Again

check dimensions of your the three data frames. Number of rows in final 
data frame shall be at least same as the number of rows in bigger data 
frame and lower than sum of rows of both merged data frames.

Regards
Petr
http://www.R-project.org/posting-guide.html
#
So when I do the merge on your example frames, I get the expected result.

But the example component dataframes you sent are already full of NAs, and there are no rows which are present in both data sets. So I think perhaps, that merge is just highlighting a problem that has its roots in your component data.

t
On 26 Oct 2011, at 1:16 PM, dividend wrote: