I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t
You can use the Google Language Detection API.
Here is a little program that uses it:
baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"
def detect(text):
import json,urllib
"""Returns the W3C language code of a natural language"""
params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters
resp = json.load(urllib.urlopen(baseUrl + "?" + params))
try:
retText = resp['responseData']['language']
except:
raise
return retText
def test():
print "Type some text to detect its language:"
while True:
text = raw_input('#> ')
retText = detect(text)
print retText
if __name__=='__main__':
import sys
try:
test()
except KeyboardInterrupt:
print "\n"
sys.exit(0)
Other useful references:
Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html
Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/
Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/
RFC 1766 defines W3C languages
Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry