Hello dear R-helpers,
I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
data frames by characters.
In each data frame I got two different list of names, that is my main-key to
be merged.
To figure out what I'm saying, I build up a modified "?merge" example, with
errors by purpose:
# Data for authors:
authors <- data.frame(
surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
"Venables" is without the final 's', and "Tierney, without "i".
# Data for books:
books <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Rippley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
With "surname" column instead of "name" (differs from original example for
more easy going merge). And the second "Ripley" with double "p".
So, if I ask for:
merge(authors, books, all=TRUE)
I got:
But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
and "Venable" to "Venables". I was wondering if there was any way to work
around this problem. My orginal data have around 27,000 name entries, and if
I take "all=FALSE", this database drops out to around 17,000, most because
mispelling (or truncated expressions). If I take "all=TRUE", I got many of
this <NA> cases like the example above.
Has anyone experienced this? Any idea how I can get out? I'm thinking to
take the longest match possible to each entry. For example, in
"Venable"/"Venables" there is a 87.5% match. As I have name and surname, and
also auxiliary keys to this match, I think this could work.
Thank you in advance.
-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
www.fjp.mg.gov.br reseacher
--
View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
Sent from the R help mailing list archive at Nabble.com.
Merge data frame with mispelling characters
4 messages · jim holtman, David Winsemius, VictorDelgado
You might try the 'soundex' function in the RecordLinkage package:
soundex('ripley')
[1] "R140"
soundex('rippley')
[1] "R140"
soundex('venable')
[1] "V514"
soundex('venables')
[1] "V514"
soundex('terney')
[1] "T650"
soundex('tierney')
[1] "T650"
On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at fjp.mg.gov.br> wrote:
Hello dear R-helpers,
I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
data frames by characters.
In each data frame I got two different list of names, that is my main-key to
be merged.
To figure out what I'm saying, I build up a modified "?merge" example, with
errors by purpose:
# Data for authors:
authors <- data.frame(
surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
"Venables" is without the final 's', and "Tierney, without "i".
# Data for books:
books <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Rippley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
With "surname" column instead of "name" (differs from original example for
more easy going merge). And the second "Ripley" with double "p".
So, if I ask for:
merge(authors, books, all=TRUE)
I got:
But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
and "Venable" to "Venables". I was wondering if there was any way to work
around this problem. My orginal data have around 27,000 name entries, and if
I take "all=FALSE", this database drops out to around 17,000, most because
mispelling (or truncated expressions). If I take "all=TRUE", I got many of
this <NA> cases like the example above.
Has anyone experienced this? Any idea how I can get out? I'm thinking to
take the longest match possible to each entry. For example, in
"Venable"/"Venables" there is a 87.5% match. As I have name and surname, and
also auxiliary keys to this match, I think this could work.
Thank you in advance.
-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
www.fjp.mg.gov.br reseacher
--
View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:
Hello dear R-helpers,
I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
data frames by characters.
In each data frame I got two different list of names, that is my main-key to
be merged.
To figure out what I'm saying, I build up a modified "?merge" example, with
errors by purpose:
# Data for authors:
authors <- data.frame(
surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
"Venables" is without the final 's', and "Tierney, without "i".
# Data for books:
books <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Rippley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
In your example the authors list has better spelling. The 'agrep' functions by default will return matches that are 90% ( or more precisely Levenshtein distance of less than or equalt to 0.1) : books$altname <- NA altidx <- unlist( sapply(books$surname, agrep, authors$surname) ) books$altname[seq(altidx)] <- authors$surname[altidx] books #----------------------- surname title other.author altname 1 Tukey Exploratory Data Analysis <NA> Tukey 2 Venables Modern Applied Statistics ... Ripley Venable 3 Tierney LISP-STAT <NA> Terney 4 Ripley Spatial Statistics <NA> Ripley 5 Rippley Stochastic Simulation <NA> Ripley 6 McNeil Interactive Data Analysis <NA> McNeil 7 R Core An Introduction to R Venables & Smith <NA> If you then match 'books' to 'authors' with a merge on authors$surname and books$altname, you should get closer to your goals
David. > > With "surname" column instead of "name" (differs from original example for > more easy going merge). And the second "Ripley" with double "p". > > So, if I ask for: > > merge(authors, books, all=TRUE) > > I got: > > > But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" > and "Venable" to "Venables". I was wondering if there was any way to work > around this problem. My orginal data have around 27,000 name entries, and if > I take "all=FALSE", this database drops out to around 17,000, most because > mispelling (or truncated expressions). If I take "all=TRUE", I got many of > this <NA> cases like the example above. > > Has anyone experienced this? Any idea how I can get out? I'm thinking to > take the longest match possible to each entry. For example, in > "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and > also auxiliary keys to this match, I think this could work. > > Thank you in advance. > > > > ----- > Victor Delgado > cedeplar.ufmg.br P.H.D. student > www.fjp.mg.gov.br reseacher > -- > View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Alameda, CA, USA
David Winsemius wrote
On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:
Hello dear R-helpers,
I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of
two
data frames by characters.
In each data frame I got two different list of names, that is my main-key
to
be merged.
To figure out what I'm saying, I build up a modified "?merge" example,
with
errors by purpose:
# Data for authors:
authors <- data.frame(
surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
"Venables" is without the final 's', and "Tierney, without "i".
# Data for books:
books <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Rippley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
In your example the authors list has better spelling. The 'agrep' functions by default will return matches that are 90% ( or more precisely Levenshtein distance of less than or equalt to 0.1) : books$altname <- NA altidx <- unlist( sapply(books$surname, agrep, authors$surname) ) books$altname[seq(altidx)] <- authors$surname[altidx] books #----------------------- surname title other.author altname 1 Tukey Exploratory Data Analysis <NA> Tukey 2 Venables Modern Applied Statistics ... Ripley Venable 3 Tierney LISP-STAT <NA> Terney 4 Ripley Spatial Statistics <NA> Ripley 5 Rippley Stochastic Simulation <NA> Ripley 6 McNeil Interactive Data Analysis <NA> McNeil 7 R Core An Introduction to R Venables & Smith <NA> If you then match 'books' to 'authors' with a merge on authors$surname and books$altname, you should get closer to your goals -- David.
With "surname" column instead of "name" (differs from original example for more easy going merge). And the second "Ripley" with double "p". So, if I ask for: merge(authors, books, all=TRUE) I got: But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" and "Venable" to "Venables". I was wondering if there was any way to work around this problem. My orginal data have around 27,000 name entries, and if I take "all=FALSE", this database drops out to around 17,000, most because mispelling (or truncated expressions). If I take "all=TRUE", I got many of this
<NA> cases like the example above.
Has anyone experienced this? Any idea how I can get out? I'm thinking to take the longest match possible to each entry. For example, in "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and also auxiliary keys to this match, I think this could work. Thank you in advance. ----- Victor Delgado cedeplar.ufmg.br P.H.D. student www.fjp.mg.gov.br reseacher -- View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help@
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD Alameda, CA, USA
______________________________________________
R-help@
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
It's amazing to get such useful answers so fast. I did not know the RecordLinkage package, it looks very sophisticated and useful for this kind of demand. I just made some tests and I think it could be very useful. I'm working with portuguese spelling names, so I will also test agrep and see what function returns better results, giving less data loss. Thank you a lot, Jim Holtman and also David Winsemius. ----- Victor Delgado cedeplar.ufmg.br P.H.D. student UFOP assistant professor -- View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255p4648266.html Sent from the R help mailing list archive at Nabble.com.