What techniques/tools are there for discovering common phrases in chunks of text?

后端 未结 3 442
我寻月下人不归
我寻月下人不归 2021-01-02 17:28

Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\

3条回答
  •  旧巷少年郎
    2021-01-02 18:06

    Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS is a function for computing the Longest Common Subsequence):

    someMinimumLengthParameter = 20;
    foundPhrases = [];
    
    do {
        lcs = LCS(mailbodies);
        if (lcs in ignoredPhrases) continue;
    
        foundPhrases += lcs;
    
        for body in mailbodies {
            body.remove(lcs);
        }    
    } while(lcs.length > someMinimumLengthParameter);
    

提交回复
热议问题