Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\
Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS is a function for computing the Longest Common Subsequence):
someMinimumLengthParameter = 20;
foundPhrases = [];
do {
lcs = LCS(mailbodies);
if (lcs in ignoredPhrases) continue;
foundPhrases += lcs;
for body in mailbodies {
body.remove(lcs);
}
} while(lcs.length > someMinimumLengthParameter);