What techniques/tools are there for discovering common phrases in chunks of text?

不羁的心 提交于 2019-11-30 14:37:57

Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.

I'm not sure if this what you want but check out longest common substring problem and diff utility algorithms.

Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS is a function for computing the Longest Common Subsequence):

someMinimumLengthParameter = 20;
foundPhrases = [];

do {
    lcs = LCS(mailbodies);
    if (lcs in ignoredPhrases) continue;

    foundPhrases += lcs;

    for body in mailbodies {
        body.remove(lcs);
    }    
} while(lcs.length > someMinimumLengthParameter);
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!