Merge data frame with mispelling characters
You might try the 'soundex' function in the RecordLinkage package:
soundex('ripley')
[1] "R140"
soundex('rippley')
[1] "R140"
soundex('venable')
[1] "V514"
soundex('venables')
[1] "V514"
soundex('terney')
[1] "T650"
soundex('tierney')
[1] "T650"
On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at fjp.mg.gov.br> wrote:
Hello dear R-helpers,
I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
data frames by characters.
In each data frame I got two different list of names, that is my main-key to
be merged.
To figure out what I'm saying, I build up a modified "?merge" example, with
errors by purpose:
# Data for authors:
authors <- data.frame(
surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
"Venables" is without the final 's', and "Tierney, without "i".
# Data for books:
books <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Rippley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
With "surname" column instead of "name" (differs from original example for
more easy going merge). And the second "Ripley" with double "p".
So, if I ask for:
merge(authors, books, all=TRUE)
I got:
But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
and "Venable" to "Venables". I was wondering if there was any way to work
around this problem. My orginal data have around 27,000 name entries, and if
I take "all=FALSE", this database drops out to around 17,000, most because
mispelling (or truncated expressions). If I take "all=TRUE", I got many of
this <NA> cases like the example above.
Has anyone experienced this? Any idea how I can get out? I'm thinking to
take the longest match possible to each entry. For example, in
"Venable"/"Venables" there is a 87.5% match. As I have name and surname, and
also auxiliary keys to this match, I think this could work.
Thank you in advance.
-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
www.fjp.mg.gov.br reseacher
--
View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.