Determine if text is in English?

后端 未结 6 1288
情深已故
情深已故 2020-12-15 23:34

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the follow

6条回答
  •  眼角桃花
    2020-12-16 00:20

    You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

    TL;DR:

    • CLD-2 is pretty good and extremely fast
    • lang-detect is a tiny bit better, but much slower
    • langid is good, but CLD-2 and lang-detect are much better
    • NLTK's Textcat is neither efficient nor effective.

    You can install lidtk and classify languages:

    $ lidtk cld2 predict --text "this is some text written in English"
    eng
    $ lidtk cld2 predict --text "this is some more text written in English"
    eng
    $ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
    fra
    

提交回复
热议问题