Skip to content
Prev 346486 / 398500 Next

Comparing Latin characters with and without accents?

On 15/12/2014 05:33, Spencer Graves wrote:
I think the devil is the detail here: what is Latin?  Latin-1 has 
characters for which this is unclear, let alone Latin-2 or Latin-7.

What I would do is

1) convert to UTF-8 with iconv()
2) convert to Unicode points with utf8ToInt().
3) remap the Unicode characters with an integer lookup table tab[].
4) convert back to UTF-8, then to the desired encoding (or mark as UTF-8 
with Encoding()).

As I suspect all the characters you do want to convert are in the first 
few planes of Unicode, the lookup table can be small, maybe less than 
512 elements.  So for example ? is Unicode 250 and the value of tab[250] 
should be 117.  iconv() with transliteration might give you a good start 
for preparing that table.

(Note that transliteration to two chars is often more acceptable/widely 
applicable. E.g. ? to aa and ? to ss.)