What algorithm I need to find n-grams?

前端 未结 7 509
無奈伤痛
無奈伤痛 2020-12-04 16:56

What algorithm is used for finding ngrams?

Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?

7条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-04 17:44

    EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.

    Well it depends on the size of the ngrams you want.

    I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:

    $letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
    $letters=array_count_values($letters);
    

    Then there is the following function for calculating ngrams from a word:

    function getNgrams($word, $n = 3) {
            $ngrams = array();
            $len = strlen($word);
            for($i = 0; $i < $len; $i++) {
                    if($i > ($n - 2)) {
                            $ng = '';
                            for($j = $n-1; $j >= 0; $j--) {
                                    $ng .= $word[$i-$j];
                            }
                            $ngrams[] = $ng;
                    }
            }
            return $ngrams;
    }
    

    The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.

提交回复
热议问题