How to detect language

后端 未结 7 1885
你的背包
你的背包 2020-12-28 21:18

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn\'t query

7条回答
  •  盖世英雄少女心
    2020-12-28 21:44

    Try CLD2:

    Installation

    export CPPFLAGS="-std=c++98"  # https://github.com/CLD2Owners/cld2/issues/47
    pip install cld2-cffi --user
    

    Run

    import cld2
    
    res = cld2.detect("This is a sample text.")
    print(res)
    res = cld2.detect("Dies ist ein Beispieltext.")
    print(res)
    res = cld2.detect("Je ne peut pas parler cette language.")
    print(res)
    res = cld2.detect(" هذه هي بعض النصوص العربية")
    print(res)
    res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
    print(res)
    res = cld2.detect("これは、いくつかのアラビア語のテキストです")
    print(res)
    print("Supports {} languages.".format(len(cld2.LANGUAGES)))
    

    Gives

    Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u'ENGLISH', language_code=u'en', percent=95, score=1675.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
    Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u'GERMAN', language_code=u'de', percent=96, score=1496.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
    Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u'FRENCH', language_code=u'fr', percent=97, score=1134.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
    Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u'ARABIC', language_code=u'ar', percent=97, score=1263.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
    Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
    Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u'Japanese', language_code=u'ja', percent=98, score=3848.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
    Supports 282 languages.
    

    Others

    • https://detectlanguage.com/ - a service around CLD2

提交回复
热议问题