Skip to content

Merge data frame with mispelling characters

4 messages · jim holtman, David Winsemius, VictorDelgado

#
Hello dear R-helpers,

I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
data frames by characters. 
In each data frame I got two different list of names, that is my main-key to
be merged.

To figure out what I'm saying, I build up a modified "?merge" example, with
errors by purpose:

# Data for authors:

authors <- data.frame(
    surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
    nationality = c("US", "Australia", "US", "UK", "Australia"),
    deceased = c("yes", rep("no", 4)))

"Venables" is without  the final 's', and "Tierney, without "i".

# Data for books:

books <- data.frame(
    surname = I(c("Tukey", "Venables", "Tierney",
             "Ripley", "Rippley", "McNeil", "R Core")),
    title = c("Exploratory Data Analysis",
              "Modern Applied Statistics ...",
              "LISP-STAT",
              "Spatial Statistics", "Stochastic Simulation",
              "Interactive Data Analysis",
              "An Introduction to R"),
    other.author = c(NA, "Ripley", NA, NA, NA, NA,
                     "Venables & Smith"))
 
With "surname" column instead of "name" (differs from original example for
more easy going merge). And the second "Ripley" with double "p".

So, if I ask for:

merge(authors, books, all=TRUE)

I got:


But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
and "Venable" to "Venables". I was wondering if there was any way to work
around this problem. My orginal data have around 27,000 name entries, and if
I take "all=FALSE", this database drops out to around 17,000, most because
mispelling (or truncated expressions). If I take "all=TRUE", I got many of
this <NA> cases like the example above.

Has anyone experienced this? Any idea how I can get out? I'm thinking to
take the longest match possible to each entry. For example, in
"Venable"/"Venables" there is a 87.5% match. As I have name and surname, and
also auxiliary keys to this match, I think this could work.

Thank you in advance.



-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
www.fjp.mg.gov.br reseacher
--
View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
Sent from the R help mailing list archive at Nabble.com.
#
You might try the 'soundex' function in the RecordLinkage package:
[1] "R140"
[1] "R140"
[1] "V514"
[1] "V514"
[1] "T650"
[1] "T650"
On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at fjp.mg.gov.br> wrote:

  
    
#
On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:

            
In your example the authors list has better spelling. The 'agrep' functions by default will return matches that are 90% ( or more precisely Levenshtein distance of less than or equalt to 0.1) :


books$altname <- NA
altidx <- unlist( sapply(books$surname, agrep, authors$surname) )
books$altname[seq(altidx)] <- authors$surname[altidx]
books
#-----------------------
   surname                         title     other.author altname
1    Tukey     Exploratory Data Analysis             <NA>   Tukey
2 Venables Modern Applied Statistics ...           Ripley Venable
3  Tierney                     LISP-STAT             <NA>  Terney
4   Ripley            Spatial Statistics             <NA>  Ripley
5  Rippley         Stochastic Simulation             <NA>  Ripley
6   McNeil     Interactive Data Analysis             <NA>  McNeil
7   R Core          An Introduction to R Venables & Smith    <NA>

If you then match 'books' to 'authors' with a merge on authors$surname and books$altname, you should get closer to your goals
#
David Winsemius wrote

            

            
It's amazing to get such useful answers so fast.  I did not know the
RecordLinkage package, it looks very sophisticated and useful for this kind
of demand. I just made some tests and I think it could be very useful. 
I'm working with portuguese spelling names, so I will also test agrep and
see what function returns better results, giving less data loss. 
Thank you a lot, Jim Holtman and also David Winsemius.



-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
UFOP assistant professor
--
View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255p4648266.html
Sent from the R help mailing list archive at Nabble.com.