Duplicate text-finding

时间秒杀一切 提交于 2019-12-04 07:14:00

Not sure if this is what you are looking for.

I took the string "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest" and converted it to "[test]4 4notadupe[d+c+d+f+]4 [test]3 "

I'm sure someone will come up with a better more efficient solution as it's a bit slow when processing your full file. I look forward to other answers.

        string stringValue = "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest";

        for(int i = 0; i < stringValue.Length; i++)
        {
            for (int k = 1; (k*2) + i <= stringValue.Length; k++)
            {
                int count = 1;

                string compare1 = stringValue.Substring(i,k);
                string compare2 = stringValue.Substring(i + k, k);

                //Count if and how many duplicates
                while (compare1 == compare2) 
                {
                    count++;
                    k += compare1.Length;
                    if (i + k + compare1.Length > stringValue.Length)
                        break;

                    compare2 = stringValue.Substring(i + k, compare1.Length);
                } 

                if (count > 1)
                {
                    //New code.  Added a space to the end to avoid [test]4 
                    //turning using an invalid number ie: [test]44.
                    string addString = "[" + compare1 + "]" + count + " ";

                    //Only add code if we are saving space
                    if (addString.Length < compare1.Length * count)
                    {
                        stringValue = stringValue.Remove(i, count * compare1.Length);
                        stringValue = stringValue.Insert(i, addString);
                        i = i + addString.Length - 1;
                    }
                    break;
                }
            }
        }

You can use the Smith-Waterman algorithm to do local alignment, comparing the string against itself.

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

EDIT: To adapt the algorithm for self alignment, you need to force values in the diagonal to zero - that is, penalize the trivial solution of aligning the whole string exactly with itself. Then the "second best" alignment will pop out instead. This will be the longest two matching substrings. Repeat the same sort of thing to find progressively shorter matching substrings.

LZW can help: it uses prefixes dictionary to search for repetitive patterns and replaces such data with references to previous entries. I think it should not be hard to adapt it for your needs.

Why not just use System.IO.Compression?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!