On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:
Hello dear R-helpers,
I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of
two
data frames by characters.
In each data frame I got two different list of names, that is my main-key
to
be merged.
To figure out what I'm saying, I build up a modified "?merge" example,
with
errors by purpose:
# Data for authors:
authors <- data.frame(
surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)))
"Venables" is without the final 's', and "Tierney, without "i".
# Data for books:
books <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Rippley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"))
In your example the authors list has better spelling. The 'agrep'
functions by default will return matches that are 90% ( or more precisely
Levenshtein distance of less than or equalt to 0.1) :
books$altname <- NA
altidx <- unlist( sapply(books$surname, agrep, authors$surname) )
books$altname[seq(altidx)] <- authors$surname[altidx]
books
#-----------------------
surname title other.author altname
1 Tukey Exploratory Data Analysis
<NA>
Tukey
2 Venables Modern Applied Statistics ... Ripley Venable
3 Tierney LISP-STAT
<NA>
Terney
4 Ripley Spatial Statistics
<NA>
Ripley
5 Rippley Stochastic Simulation
<NA>
Ripley
6 McNeil Interactive Data Analysis
<NA>
McNeil
7 R Core An Introduction to R Venables & Smith
<NA>
If you then match 'books' to 'authors' with a merge on authors$surname and
books$altname, you should get closer to your goals
--
David.
With "surname" column instead of "name" (differs from original example
for
more easy going merge). And the second "Ripley" with double "p".
So, if I ask for:
merge(authors, books, all=TRUE)
I got:
But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
and "Venable" to "Venables". I was wondering if there was any way to work
around this problem. My orginal data have around 27,000 name entries, and
if
I take "all=FALSE", this database drops out to around 17,000, most
because
mispelling (or truncated expressions). If I take "all=TRUE", I got many
of
this
<NA>
cases like the example above.
Has anyone experienced this? Any idea how I can get out? I'm thinking to
take the longest match possible to each entry. For example, in
"Venable"/"Venables" there is a 87.5% match. As I have name and surname,
and
also auxiliary keys to this match, I think this could work.
Thank you in advance.
-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
www.fjp.mg.gov.br reseacher
--
View this message in context:
http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
Sent from the R help mailing list archive at Nabble.com.