How to determine the (natural) language of a document?

前端 未结 11 1633
情话喂你
情话喂你 2020-12-24 07:16

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

11条回答
  •  执笔经年
    2020-12-24 08:19

    You can use the Google Language Detection API.

    Here is a little program that uses it:

    baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"
    
    def detect(text):
        import json,urllib
        """Returns the W3C language code of a natural language"""
    
        params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters                    
        resp = json.load(urllib.urlopen(baseUrl + "?" + params))
        try:
            retText = resp['responseData']['language']
        except:
            raise
        return retText
    
    
    def test():
        print "Type some text to detect its language:"
        while True:
            text = raw_input('#>  ')
            retText = detect(text)
            print retText
    
    
    if __name__=='__main__':
        import sys
        try:
            test()
        except KeyboardInterrupt:
            print "\n"
            sys.exit(0)
    

    Other useful references:

    Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html

    Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/

    Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/

    RFC 1766 defines W3C languages

    Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry

提交回复
热议问题