Duplicate text-finding | 易学教程

My main problem is trying to find a suitable solution to automatically turning this, for example:

d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+

into this:

[d+c+d+f+]4

i.e. Finding duplicates next to each other, then making a shorter "loop" out of these duplicates. So far I have found no suitable solution to this, and I look forward to a response. P.S. To avoid confusion, the aforementioned sample is not the only thing that needs "looping", it differs from file to file. Oh, and this is intended for a C++ or C# program, either is fine, though I'm open to any other suggestions as well. Also, the main idea is that all the work would be done by the program itself, no user input except for the file itself. Here is the full file, for reference, my apologies for the stretched page: #0 @16 v225 y10 w250 t76

l16 $ED $EF $A9 p20,20 >ecegb>d<bgbgecgec<g >d+<b>d+f+a+>c+<a+f+a+f+d+<b>f+d+<bf+ >c<a>cegbgegec<a>ec<ae > d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+ r1^1

/ l8 r1r1r1r1 f+<a+>f+g+cg+r4 a+c+a+g+cg+r4f+<a+>f+g+cg+r4 a+c+a+g+cg+r4f+<a+>f+g+cg+r4 a+c+a+g+cg+r4 f+<a+>f+g+cg+r4 a+c+a+g+r4g+16f16c+ a+2^g+f+g+4 f+ff+4fd+f4 d+c+d+4c+c<a+2^4 >c4d+ <g+2^4r4^ a+>c+d+4g+4a+4 r1^2^4^a+2^g+f+g+4 f+ff+4fd+f4 d+c+d+4c+c<a+2^4 >c4d+ <g+2^4r4^ a+>c+d+4g+4a+4 r1^2^4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#4 @22 v250 y10

l8 o3 rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+ / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#2 @4 v155 y10

l8 $ED $F8 $8F o4 r1r1r1 d+4f4f+4g+4 a+4r1^4^2 / d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2 > d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2 < f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2 > a+4^g+r2 f+1a+4^g+r2 f+1 f+4^fr2 d+1 f+4^fr2 d+2^d+4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#3 @10 v210 y10

r1^1 o3 c8r8d8r8 c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8 c8 @10d16d16@21 c8 @10d16d16@21 c8 @10d16d16@21 / c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8 c4@10d8@21c8<b8> @10d16d16d16d16d16r16 c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8 c4@10d8@21c8 @10b16b16>c16c16<b16b16a16a16

#7 @16 v230 y10

l16 $ED $EF $A9 cceeggbbggeeccee <bb>d+d+f+f+a+a+f+f+d+d+<bb>d+d+ <aa>cceeggeecc<aa>cc <g+g+bb>d+d+ffd+d+<bbg+g+bb / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#5 @4 v155 y10

l8 $ED $F8 $8F o4 r1r1r1r1 d+4r1^2^4 / <a+4^>cr2 c+4^cr2<a+4^>cr2 c+4^cr2<a+4^>cr2 c+4^cr2<a+4^>cr2 c+4^cr2 a+4^>cr2 c+4^cr2 <a+4^>cr2 c+4^c r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 r2 f+4^fr2 d+1f+4^fr2 d+1 c+4^cr2 <a+1 >c+4^cr2 <a+2^a+4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

Not sure if this is what you are looking for.

I took the string "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest" and converted it to "[test]4 4notadupe[d+c+d+f+]4 [test]3 "

I'm sure someone will come up with a better more efficient solution as it's a bit slow when processing your full file. I look forward to other answers.

        string stringValue = "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest";

        for(int i = 0; i < stringValue.Length; i++)
        {
            for (int k = 1; (k*2) + i <= stringValue.Length; k++)
            {
                int count = 1;

                string compare1 = stringValue.Substring(i,k);
                string compare2 = stringValue.Substring(i + k, k);

                //Count if and how many duplicates
                while (compare1 == compare2) 
                {
                    count++;
                    k += compare1.Length;
                    if (i + k + compare1.Length > stringValue.Length)
                        break;

                    compare2 = stringValue.Substring(i + k, compare1.Length);
                } 

                if (count > 1)
                {
                    //New code.  Added a space to the end to avoid [test]4 
                    //turning using an invalid number ie: [test]44.
                    string addString = "[" + compare1 + "]" + count + " ";

                    //Only add code if we are saving space
                    if (addString.Length < compare1.Length * count)
                    {
                        stringValue = stringValue.Remove(i, count * compare1.Length);
                        stringValue = stringValue.Insert(i, addString);
                        i = i + addString.Length - 1;
                    }
                    break;
                }
            }
        }

You can use the Smith-Waterman algorithm to do local alignment, comparing the string against itself.

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

EDIT: To adapt the algorithm for self alignment, you need to force values in the diagonal to zero - that is, penalize the trivial solution of aligning the whole string exactly with itself. Then the "second best" alignment will pop out instead. This will be the longest two matching substrings. Repeat the same sort of thing to find progressively shorter matching substrings.

LZW can help: it uses prefixes dictionary to search for repetitive patterns and replaces such data with references to previous entries. I think it should not be hard to adapt it for your needs.

Why not just use System.IO.Compression?

来源：https://stackoverflow.com/questions/859739/duplicate-text-finding

标签

c++

text

compression

duplicates