Skip to content
Prev 167837 / 398502 Next

text vector clustering

Simply doing a tabulation and isolating the cases with only one entry  
might have been a possibility if the count discrepancy weren't so  
high. It appears you have a greater degree of corruption than would be  
expected just from "typos".

Have you looked at the packages referenced at:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

The Soundex algorithm is an old programming chestnut which I have seen  
implemented in R, but I understand there are improved versions. How  
well they perform on persons' names may depend strongly on cultural  
origins of your population.