What algorithm is used for finding ngrams?
Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?
EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.
Well it depends on the size of the ngrams you want.
I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:
$letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
$letters=array_count_values($letters);
Then there is the following function for calculating ngrams from a word:
function getNgrams($word, $n = 3) {
$ngrams = array();
$len = strlen($word);
for($i = 0; $i < $len; $i++) {
if($i > ($n - 2)) {
$ng = '';
for($j = $n-1; $j >= 0; $j--) {
$ng .= $word[$i-$j];
}
$ngrams[] = $ng;
}
}
return $ngrams;
}
The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.