Matching names with non-English characters
Build a lookup table for your data.
I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.
BTW: To avoid propagating open joins your data should probably have some kind of id for the term those Representatives are serving.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
Spencer Graves <spencer.graves at structuremonitoring.com> wrote:
Hello:
How can one match names containing non-English characters that
appear differently in different but related data files? For example, I
have data on Ra?l Grijalva, who represents the third district of
Arizona
in the US House of Representatives. This first name appears as "Ra??l"
in data read from one file and "Raul" from another.
The ideal would convert both "Ra??l" and "Ra?l" to "Raul". A
reasonable alternative would identify the non-English characters and
match on everything else ("^Ra" and "l$" in this case). The files all
contain state and district, so "AZ-3" could be part of the solution.
However, the file also contains data on Grijalva's predecessor in that
office, Ben Quayle, so "AZ-3" is not enough.
Thanks,
Spencer
p.s. My current data contains other similar cases, e.g.:
Recipient District
Ra??l Grijalva AZ House 3
Tony C??rdenas CA House 29
Linda S??nchez CA House 38
Ra??l Labrador ID House 1
Andr?? Carson IN House 7
Bob Men??ndez NJ Senate
Ben Ray Luj??n NM House 3
Jos?? Serrano NY House 15
Nydia Vel??zquez NY House 7
Rub??n Hinojosa TX House 15
These names all appear differently in another file I have. I've
written an ugly function that can identify "nonstandard characters".
I'm confident I can solve this problem. However, I'm adding things
like
this to the Ecdat package, and it would be more useful for others if I
made better use of other capabilities in R.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.