Help retrieving only Portuguese words from a file
On Tue, May 28, 2013 at 5:02 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, And some words exist in Portuguese, Spanish and English, the three languages of the problem. For instance, "animal". I don't think this problem can be solved, but a dictionary search would tell if it is a Portuguese word, which it is.
Is there any structure to the text? If it has complete paragraphs in one of the three languages then you can probably make a better guess of the language of the paragraph from the presence of key words. I wonder if some of the code for detecting spam can help you here... Train it on some known Portuguese, Spanish, and English text... If its just a stream of words in one of the languages in a random order then it is difficult or impossible. Barry