How to determine the (natural) language of a document?

前端未结

关注

 11  1644

情话喂你 2020-12-24 07:16

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

11条回答

执笔经年 (楼主)

2020-12-24 08:19

You can use the Google Language Detection API.

Here is a little program that uses it:

baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"

def detect(text):
    import json,urllib
    """Returns the W3C language code of a natural language"""

    params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters                    
    resp = json.load(urllib.urlopen(baseUrl + "?" + params))
    try:
        retText = resp['responseData']['language']
    except:
        raise
    return retText


def test():
    print "Type some text to detect its language:"
    while True:
        text = raw_input('#>  ')
        retText = detect(text)
        print retText


if __name__=='__main__':
    import sys
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

Other useful references:

Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html

Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/

Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/

RFC 1766 defines W3C languages

Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry

0 讨论(0)

查看其它11个回答