Finding a repeating sequence at the end of a sequence of numbers

一个人想着一个人 提交于 2019-11-29 14:27:34

You can use a rolling hash to achieve linear time complexity and O(1) space complexity (I think this is the case, since I don't believe you can have an infinite repeating sequence with two frequencies which are not multiples of each other).

Algorithm: You just keep two rolling hashes which expand like this:

                       _______  _______  _______
                      /       \/       \/       \
...2038975623895769874883301010883301010883301010
                      .        .        .      ||
                      .        .        .    [][]
                      .        .        .  [ ][ ]
                      .        .        .[  ][  ]
                      .        .       [.  ][   ]
                      .        .     [  . ][    ]
                      .        .   [    .][     ]
                      .        . [      ][      ]
                      .        [       ][       ]

Keep on doing this for the entire sequence. The first pass will only detect repetitions repeated 2*n times for some value of n. However that's not our goal: our goal in the first pass is to detect all possible periods, which this does. As we go along the sequence performing this process, we also keep track of all relatively prime periods we will need to later check:

periods = Set(int)
periodsToFurthestReach = Map(int -> int)

for hash1,hash2 in expandedPairOfRollingHashes(sequence):
    L = hash.length
    if hash1==hash2:
        if L is not a multiple of any period:
            periods.add(L)
            periodsToFurthestReach[L] = 2*L
        else L is a multiple of some periods:
            for all periods P for which L is a multiple:
                periodsToFurthestReach[P] = 2*L

After this process, we have a list of all periods and how far they've reached. Our answer is probably the one with the furthest reach, but we check all other periods for repetition (fast because we know the periods we're checking for). If this is computationally difficult, we can optimize by pruning away periods (which stop repeating) as we're going through the list, very much like the sieve of Eratosthenes, by keeping a priority queue of when we next expect a period to repeat.

At the end, we double-check the result to make sure there was no hash collision (in unlikely even there is, blacklist and repeat).

Here I assumed your goal was to minimize non-repeating-length, and not give a repeating element which can be further factored; you can modify this algorithm to find all other compressions, if they exist.

So, ninjagecko provided a good working answer to the question I posed. Thanks very much! However, I ended up finding a more efficient, mathematically based way to do the specific case that I am looking at - that is, writing out a closed form expression for the continued fraction expansion of a quadratic irrational. Obviously this solution will only work for this specific case, rather than the general case that I asked about, but I thought it might be useful to put it here in case others have a similar question.

Basically, I remembered that a quadratic irrational is reduced if and only if its continued fraction expansion is purely periodic - as in, it repeats right from the beginning, without any leading terms.

When you work out the continued fraction expansion of a number x, you basically set x_0 to be x, and then you form your sequence [a_0; a_1, a_2, a_3, ... ] by defining a_n = floor(x_n) and x_(n+1) = 1/(x_n - a_n). Normally you would just continue this until you reach a desired precision. However, for our purposes, we just run this method until x_k is a reduced quadratic irrational (which occurs if it is bigger than 1 and its conjugate is between -1 and 0). Once this happens, we know that a_k is the first term of our repeating block. Then, when we find x_(k+m+1) equal to x_k, we know that a_(k+m) is the last term in our repeating block.

Search from the right:

  • does a_n == a_n-1
  • does (a_n,a_n-1) == (a_n-2,a_n-3)
  • ...

This is clearly O(m^2). The only available bound appears to be that m<n/2, so it's O(n^2)

Is this acceptable for your application? (Are we doing your homework for you, or is there an actual real-world problem here?)

This page lists several good cycle-detection algorithms and gives an implementation of an algorithm in C.

Consider the sequence once it has repeated a number of times. It will end e.g. ...12341234123412341234. If you take the repeating part of the string up to just before the last cycle of repeats, and then slide it along by the length of that cycle, you will find that you have a long match between a substring at the end of the sequence and the same substring slid to the left a distance which is small compared with its length.

Conversely, if you have a string where a[x] = a[x + k] for a large number of x, then you also have a[x] = a[x + k] = a[x + 2k] = a[x + 3k]... so a string that matches itself when slid a short distance compared to its length must contain repeats.

If you look at http://en.wikipedia.org/wiki/Suffix_array, you will see that you can build the list of all suffixes of a string, in sorted order, in linear time, and also an array which tells you how many characters each suffix has in common with the previous suffix in sorted order. If you look for the entry with the largest value of this, this would be my candidate for a string going ..1234123412341234, and the distance between the starting points of the two suffixes would tell you the length at which the sequence repeats. (but in practice some sort of rolling hash search like http://en.wikipedia.org/wiki/Rabin-Karp might be quicker and easier, although there are quite codeable linear time Suffix Array algorithms, like "Simple Linear Work Suffix Array Construction" by Karkkainen and Sanders).

Suppose that you apply this algorithm when the number of characters available is 8, 16, 32, 64, .... 2^n, and you finally find a repeat at 2^p. How much time have you wasted in earlier stages? 2^(p-1) + 2^(p-2) + ..., which sums to about 2^p, so the repeated searches are only a constant overhead.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!