Determine if text is in English?

后端未结

关注

 6  1284

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the follow

相关标签:

6条回答

攒了一身酷

2020-12-16 00:03

There is a library called langdetect. It is ported from Google's language-detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2020-12-16 00:06

If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2020-12-16 00:11

This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scrapping) those worked pretty well.

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

0 讨论(0)

野的像风

2020-12-16 00:12

Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.

After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

Below is the application / instantiation and use of the above class for my needs.

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)

0 讨论(0)

无人共我

2020-12-16 00:14

Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from their website

0 讨论(0)

眼角桃花

2020-12-16 00:20
You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

TL;DR:
- CLD-2 is pretty good and extremely fast
- lang-detect is a tiny bit better, but much slower
- langid is good, but CLD-2 and lang-detect are much better
- NLTK's Textcat is neither efficient nor effective.
You can install lidtk and classify languages:
```
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra
```
0 讨论(0)
发布评论:

提交评论
- 加载中...