Detecting repetition with infinite input

问题

What is the most optimal way to find repetition in a infinite sequence of integers?

i.e. if in the infinite sequence the number '5' appears twice then we will return 'false' the first time and 'true' the second time.

In the end what we need is a function that returns 'true' if the integer appeared before and 'false' if the function received the integer the first time.

If there are two solutions, one is space-wise and the second is time-wise, then mention both. I will write my solution in the answers, but I don't think it is the optimal one.

edit: Please don't assume the trivial cases (i.e. no repetitions, a constantly rising sequence). What interests me is how to reduce the space complexity of the non-trivial case (random numbers with repetitions).

回答1:

I'd use the following approach:

Use a hash table as your datastructure. For every number read, store it in your datastructure. If it's already stored before you found a repetition.

If n is the number of elements in the sequence from start to the repetition, then this only requires O(n) time and space. Time complexity is optimal, as you need to at least read the input sequence's elements up to the repetition point.

How long of a sequence are we talking (before the repetition occurs)? Is a repetition even guaranteed at all? For extreme cases the space complexity might become problematic. But to improve it you will probably need to know more structural information on your sequence.

Update: If the sequence is as you say very long with seldom repetitions and you have to cut down on the space requirement, then you might (given sufficient structural information on the sequence) be able to cut down the space cost.

As an example: let's say you know that your infinite sequence has a general tendency to return numbers that fit within the current range of witnessed min-max numbers. Then you will eventually have whole intervals that have already been contained in the sequence. In that case you can save space by storing such intervals instead of all the elements contained within it.

回答2:

A BitSet for int values (2^32 numbers) would consume 512Mb. This may be ok if the BitSets are allocated not to often, fast enough and the mem is available.

An alternative are compressed BitSets that work best for sparse BitSets.

回答3:

Actually, if the max number of values is infinite, you can use any lossless compression algorithm for a monochrome bitmap. IF you imagine a square with at least as many pixels as the number of possible values, you can map each value to a pixel (with a few to spare). Then you can represent white as the pixels that appeared and black for the others and use any compression algorithm if space is at a premium (that is certainly a problem that has been studied)

You can also store blocks. The worst case is the same in space O(n) but for that worst case you need that the number appeared have exactly 1 in between them. Once more numbers appear, then the storage will decrease: I will write pseudocode and I will use a List, but you can always use a different structure

List changes // global

boolean addNumber(int number):
  boolean appeared = false
  it = changes.begin()
  while it.hasNext():
    if it.get() < number:
      appeared != appeared
      it = it.next()
    else if it.get() == number:
      if !appeared: return true
      if it.next().get() == number + 1
        it.next().remove() // Join 2 blocks 
      else 
        it.insertAfter(number + 1)  // Insert split and create 2 blocks
      it.remove()
        return false
    else: // it.get() > number
      if appeared: return true
      it.insertBefore(number)
      if it.get() == number + 1:
        it.remove() // Extend next block
      else:
        it.insertBefore(number + 1)  
  }
  return false
}

What this code is the following: it stores a list of blocks. For each number that you add, it iterates over the list storing blocks of numbers that appeared and numbers that didn't. Let me illustrate with an example; I will add [) to illustrate which numbers in the block, the first number is included, the last is not.In the pseudocode it is replaced by the boolean appeared. For instance, if you get the 5, 9, 6, 8, 7 (in this order) you will have the following sequences after each function:

[5,6)

[5,6),[9,10)

[5,7),[9,10)

[5,7),[8,10)

[5,10)

In the last value you keep a block of 5 numbers with only 2.

回答4:

Return TRUE

If the sequence is infinite then there will be repetition of every conceivable pattern.

If what you want to know is the first place in the sequence when there is a repeated digit that's another matter, but there's some difference between your question and your example.

回答5:

Well, it seems obvious that in any solution we'll need to save the numbers that already appeared, so space wise we will always have a worst-case of O(N) where N<=possible numbers with the word size of our number type (i.e. 2^32 for C# int) - this is problematic over a long time if the sequence is really infinite/rarely repeats itself.

For saving the numbers that already appeared I would use an hash table and then check it each time I receive a new number.

来源：https://stackoverflow.com/questions/2279568/detecting-repetition-with-infinite-input

标签

language-agnostic

data-structures

repeat

infinite-sequence