Skip to content
Prev 323456 / 398503 Next

Matching names with non-English characters

On 13/05/2013 12:05 PM, Spencer Graves wrote:
You shouldn't have both "Ra??l" and "Ra?l" in the same file.  They are 
different encodings for the same characters.  (The first looks like 
UTF-8, the second is your native encoding, presumably the Windows 
Latin-1 variant, CP-1252.  So your first problem is to identify the 
encodings of your input files, and read them all in to a common 
encoding.  Converting them to UTF-8 in R makes the most sense, because 
it includes the characters from all other encodings you're ever likely 
to see.

Having both "Ra?l" and "Raul" in the same file is a different issue.  
The second one is an error or a variant spelling.  In this case, you can 
use

iconv("Ra?l", to="ASCII//TRANSLIT")

on most platforms to find an ASCII approximation.  (This works on my 
Windows system; your mileage may vary.)    As Jeff said, this is an 
impossible problem in general, so you may well need some manual fixups 
at the end.

Duncan Murdoch