How to compute letter frequency similarity?

问题

Given this data (relative letter frequency from both languages):

spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83,
english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80,

And then computing the letter frequency for the string "this is a test" gives me:

"t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14

So, what would be a good approach for matching the given string letter frequency with a language (and try to detect the language)? I've seen (and have tested) some examples using levenshtein distance, and it seems to work fine until you add more languages.

"this is a test" gives (shortest distance:) [:english, 13] ...
"esto es una prueba" gives (shortest distance:) [:spanish, 13] ...

回答1:

Have you considered using cosine similarity to determine the amount of similarity between two vectors?

The first vector would be the letter frequencies extracted from the test string (to be classified), and the second vector would be for a specific language.

You're currently extracting single letter frequencies (unigrams). I would suggest extracting higher order n-grams, such as bigrams or trigrams (and even larger if you had enough training data). For example, for bigrams you would compute the frequencies of "aa", "ab", "ac" ... "zz", which will allow you to extract more information than if you were just considering single character frequencies.

Be careful though, because you need more training data when you use higher order n-grams otherwise you will have many 0-values for character combinations you haven't seen before.

In addition, a second possibility is to use tf-idf (term-frequency inverse-document-frequency) weightings instead of pure letter (term) frequencies.

Research

Here is a good slideshow on language identification for (very) short texts, which uses machine learning classifiers (but also has some other good info).

Here is a short paper A Comparison of Language Identification Approaches on Short, Query-Style Texts that you might also find useful.

回答2:

The examples you gave consisted of a short sentence each. Statistics dictate that if your input was longer (e.g. a paragraph, the unique frequencies should be easier to identify.

If you can't rely on the user giving a longer input, perhaps look for common words (e.g. is, as, and, but ...) in the language as well, if the letter frequencies match?

回答3:

n-graphs certainly will help with short texts, and help a great deal. With any reasonable length text (a paragraph?), simple letter frequencies work well. As an example, I wrote a short demo of this, and you may download the source at http://georgeflanagin.com/free.code.php

It's the last example on the page.

来源：https://stackoverflow.com/questions/15710292/how-to-compute-letter-frequency-similarity

标签

text

nlp

levenshtein-distance

letter