Determine if text is in English?

后端 未结 6 1294
情深已故
情深已故 2020-12-15 23:34

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the follow

6条回答
  •  误落风尘
    2020-12-16 00:06

    If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:

    http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

    If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

提交回复
热议问题