How does language detection work?

问题

I have been wondering for some time how does Google translate(or maybe a hypothetical translator) detect language from the string entered in the "from" field. I have been thinking about this and only thing I can think of is looking for words that are unique to a language in the input string. The other way could be to check sentence formation or other semantics in addition to keywords. But this seems to be a very difficult task considering different languages and their semantics. I did some research to find that there are ways that use n-gram sequences and use some statistical models to detect language. Would appreciate a high level answer too.

回答1:

You don't have to do deep analysis of text to have an idea of what language it's in. Statistics tells us that every language has specific character patterns and frequencies. That's a pretty good first-order approximation. It gets worse when the text is in multiple languages, but still it's not something extremely complex. Of course, if the text is too short (e.g. a single word, worse, a single short word), statistics doesn't work, you need a dictionary.

回答2:

Take the Wikipedia in English. Check what is the probability that after the letter 'a' comes a 'b' (for example) and do that for all the combination of letters, you will end up with a matrix of probabilities.

If you do the same for the Wikipedia in different languages you will get different matrices for each language.

To detect the language just use all those matrices and use the probabilities as a score, let say that in English you'd get this probabilities:

t->h = 0.3 h->e = .2

and in the Spanish matrix you'd get that

t->h = 0.01 h->e = .3

The word 'the', using the English matrix, would give you a score of 0.3+0.2 = 0.5 and using the Spanish one: 0.01+0.3 = 0.31

The English matrix wins so that has to be English.

回答3:

If you want to implement a lightweight language guesser in the programming language of your choice you can use the method of 'Cavnar and Trenkle '94: N-Gram-Based Text Categorization'. You can find the Paper on Google Scholar and it is pretty straight forward.

Their method builds a N-Gram statistic for every language it should be able to guess afterwards from some text in that language. Then such statistic is build for the unknown text aswell and compared to the previously trained statistics by a simple out-of-place measure. If you use Unigrams+Bigrams (possibly +Trigrams) and compare the 100-200 most frequent N-Grams your hit rate should be over 95% if the text to guess is not too short. There was a demo available here but it doesn't seem to work at the moment.

There are other ways of Language Guessing including computing the probability of N-Grams and more advanced classifiers, but in the most cases the approach of Cavnar and Trenkle should perform sufficiently.

回答4:

An implementation example.

Mathematica is a good fit for implementing this. It recognizes (ie has several dictionaries) words in the following languages:

dicts = DictionaryLookup[All]
{"Arabic", "BrazilianPortuguese", "Breton", "BritishEnglish", \
"Catalan", "Croatian", "Danish", "Dutch", "English", "Esperanto", \
"Faroese", "Finnish", "French", "Galician", "German", "Hebrew", \
"Hindi", "Hungarian", "IrishGaelic", "Italian", "Latin", "Polish", \
"Portuguese", "Russian", "ScottishGaelic", "Spanish", "Swedish"}

I built a little and naive function to calculate the probability of a sentence in each of those languages:

f[text_] := 
 SortBy[{#[[1]], #[[2]] / Length@k} & /@ (Tally@(First /@ 
       Flatten[DictionaryLookup[{All, #}] & /@ (k = 
           StringSplit[text]), 1])), -#[[2]] &]

So that, just looking for words in dictionaries, you may get a good approximation, also for short sentences:

f["we the people"]
{{BritishEnglish,1},{English,1},{Polish,2/3},{Dutch,1/3},{Latin,1/3}}

f["sino yo triste y cuitado que vivo en esta prisión"]
{{Spanish,1},{Portuguese,7/10},{Galician,3/5},... }

f["wszyscy ludzie rodzą się wolni"]
{{"Polish", 3/5}}

f["deutsch lernen mit jetzt"]
{{"German", 1}, {"Croatian", 1/4}, {"Danish", 1/4}, ...}

回答5:

You might be interested in The WiLI benchmark dataset for written language identification. The high level-answer you can also find in the paper is the following:

Clean the text: Remove things you don't want / need; make unicode un-ambiguious by applying a normal form.
Feature Extraction: Count n-grams, create tf-idf features. Something like that
Train a classifier on the features: Neural networks, SVMs, Naive Bayes, ... whatever you think could work.

来源：https://stackoverflow.com/questions/7670427/how-does-language-detection-work

标签

algorithm

nlp

pattern-matching