Determine if text is in English?

后端 未结 6 1284
情深已故
情深已故 2020-12-15 23:34

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the follow

相关标签:
6条回答
  • 2020-12-16 00:03

    There is a library called langdetect. It is ported from Google's language-detection available here:

    https://pypi.python.org/pypi/langdetect

    It supports 55 languages out of the box.

    0 讨论(0)
  • 2020-12-16 00:06

    If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:

    http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

    If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

    0 讨论(0)
  • 2020-12-16 00:11

    This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scrapping) those worked pretty well.

    from enchant.checker import SpellChecker
    
    max_error_count = 4
    min_text_length = 3
    
    def is_in_english(quote):
      d = SpellChecker("en_US")
      d.set_text(quote)
      errors = [err.word for err in d]
      return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True
    
    print(is_in_english('“中文”'))
    print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
    
    > False
    > True
    
    0 讨论(0)
  • 2020-12-16 00:12

    Pretrained Fast Text Model Worked Best For My Similar Needs

    I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.

    After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

    With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

    class English_Check:
        def __init__(self):
            # Don't need to train a model to detect languages. A model exists
            #    that is very good. Let's use it.
            pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
            self.model = fasttext.load_model(pretrained_model_path)
    
        def predictionict_languages(self, text_file):
            this_D = {}
            with open(text_file, 'r') as f:
                fla = f.readlines()  # fla = file line array.
                # fasttext doesn't like newline characters, but it can take
                #    an array of lines from a file. The two list comprehensions
                #    below, just clean up the lines in fla
                fla = [line.rstrip('\n').strip(' ') for line in fla]
                fla = [line for line in fla if len(line) > 0]
    
                for line in fla:  # Language predict each line of the file
                    language_tuple = self.model.predictionict(line)
                    # The next two lines simply get at the top language prediction
                    #    string AND the confidence value for that prediction.
                    prediction = language_tuple[0][0].replace('__label__', '')
                    value = language_tuple[1][0]
    
                    # Each top language prediction for the lines in the file
                    #    becomes a unique key for the this_D dictionary.
                    #    Everytime that language is found, add the confidence
                    #    score to the running tally for that language.
                    if prediction not in this_D.keys():
                        this_D[prediction] = 0
                    this_D[prediction] += value
    
            self.this_D = this_D
    
        def determine_if_file_is_english(self, text_file):
            self.predictionict_languages(text_file)
    
            # Find the max tallied confidence and the sum of all confidences.
            max_value = max(self.this_D.values())
            sum_of_values = sum(self.this_D.values())
            # calculate a relative confidence of the max confidence to all
            #    confidence scores. Then find the key with the max confidence.
            confidence = max_value / sum_of_values
            max_key = [key for key in self.this_D.keys()
                       if self.this_D[key] == max_value][0]
    
            # Only want to know if this is english or not.
            return max_key == 'en'
    

    Below is the application / instantiation and use of the above class for my needs.

    file_list = # some tool to get my specific list of files to check for English
    
    en_checker = English_Check()
    for file in file_list:
        check = en_checker.determine_if_file_is_english(file)
        if not check:
            print(file)
    
    0 讨论(0)
  • 2020-12-16 00:14

    Use the enchant library

    import enchant
    
    dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
    
    dictionary.check("Hello") # prints True
    dictionary.check("Helo") #prints False
    

    This example is taken directly from their website

    0 讨论(0)
  • 2020-12-16 00:20

    You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

    TL;DR:

    • CLD-2 is pretty good and extremely fast
    • lang-detect is a tiny bit better, but much slower
    • langid is good, but CLD-2 and lang-detect are much better
    • NLTK's Textcat is neither efficient nor effective.

    You can install lidtk and classify languages:

    $ lidtk cld2 predict --text "this is some text written in English"
    eng
    $ lidtk cld2 predict --text "this is some more text written in English"
    eng
    $ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
    fra
    
    0 讨论(0)
提交回复
热议问题