Skip to content
Prev 15715 / 29559 Next

alphanumerical string adress matching

Dieter--

You may be able to simply paste your separate address components into
single character vectors for each dataframe, change to consistent case
with tolower(), and then use agrep() for  Levenshtein edit distance
approximate matching (minimum number of insertions & deletions).  You
may or may not want to preprocess (replacing 2 or more consecutive
spaces with a single space, etc.).

If not, look in the CRAN task view on Natural Language Processing
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html for
more tools for approximate matching.

Based on my experience with US property assessors' geocoding, I do not
recommend approximate matching by component and numerical compositing
of the distances: one of the most common variants is the same
information put at the end of one component (line) versus the
beginning of the next (line).

Good luck.

Tom
On Tue, Jul 24, 2012 at 6:51 AM, Dieter Mayr <dieter.mayr at boku.ac.at> wrote: