Soft heaps: what is corruption and why is it useful?

问题

I recently read Bernard Chazelle's paper "The Soft Heap, An Approximate Priority Queue with Optimal Error Rate by Bernard Chazelle" (http://www.link.cs.cmu.edu/15859-f07/papers/chazelle-soft-heap.pdf)

The paper talks a lot about "corruption." What is corruption, how do elements get corrupted, and how does it help you?

I have spent a lot of time reading through the paper and Googling and this still doesn't make sense.

回答1:

In most research papers on priority queues, each element in the queue has an associated number called a priority that's set when the element is inserted. The elements are then dequeued in order of increasing priority. Most programming languages these days that support priority queues don't actually use explicit priorities and instead rely on a comparison function to rank elements, but the soft heap uses the "associated numerical priority" model.

Because priority queues dequeue elements in increasing order of priority, they can be used to sort a sequence of values - start by inserting every element into the priority queue with priority equal to its rank in the sequence, then dequeuing all the elements from the priority queue. This pulls the elements out in sorted order.

This connection between priority queues and sorting comes at a cost, though. There are known lower bounds on comparison sorting algorithms (no comparison sort algorithm can have a runtime that is better than O(n log n)). Consequently, there's a lower-bound on the runtime of any comparison-based priority queue. Specifically, n enqueues and n dequeues must have a total cost no better than O(n log n). Most of the time, that's fine, but in some cases this isn't fast enough.

As long as the priority queue can be used to sort the input sequence, the runtime of n enqueues and n dequeues will never beat O(n log n). But what if the priority queue doesn't sort the input? Take it to the extreme - if the priority queue hands back elements in a totally arbitrary order, then it's possible to implement n enqueues and n dequeues in time O(n) - just use a stack or a queue, for example.

Intuitively, you can think of a soft heap as a bridge between the two extremes of "always sorted" and "no guarantees whatsoever about the order." Each sort heap is parameterized over some quantity ε called a "corruption parameter" that determines how close to sorted the values that come out of the soft heap can be. Specifically, as ε gets closer to 0, the output will be progressively more sorted, and as ε gets closer to 1, the output will get progressively more arbitrary. Appropriately, the runtime of the soft heap operations is determined as a function of O(log ε^-1), so the runtime of the operations gets cheaper and cheaper as ε goes up (and, therefore, the output gets less sorted) and the operations get more expensive as ε goes down (in which case the output gets more and more sorted).

The soft heap precisely quantifies how unsorted the output will be using the new concept of "corruption." In a normal priority queue, once you insert an element/priority pair, the element's priority never changes. In a soft heap, the elements associated with a priority can become corrupted when the element is inside the soft heap. When an element's priority is corrupted, its priority goes up by some amount. (Since the soft heap dequeues elements in increasing order of priority, the priority of an element increasing means that it will come out of the queue later than it normally should). In other words, corruption will cause elements not to come out in sorted order, since the priorities of the elements when they're dequeued isn't necessarily the same as when they're enqueued.

The choice of ε tunes how many different elements can have their priorities corrupted. With ε small, fewer elements have corrupted priorities, and with ε large, more elements will have corrupted priorities.

Now, to your specific questions - how do elements' priorities get corrupted, and how does that help you? Your first question is a good one - how does the data structure decide when to corrupt priorities? There are two ways of viewing this. First, you can think of a soft heap as a data structure where you specify in advance how much corruption is acceptable (that's the ε parameter), and the data structure then internally decides when and how to corrupt priorities so long as it doesn't exceed some total corruption level. If it seems weird to have a data structure make decisions like this, think about something like a Bloom filter or skiplist, where there really are random choices going on internally that can impact the observable behavior of the data structure. It turns out that the soft heap typically is not implemented using randomness (an impressive feature to have!), but that's not particularly relevant here.

Internally, the two known implementations of soft heaps (the one from Chazelle's original paper, and a later cleanup using binary trees) implement corruption using a technique called carpooling where elements are grouped together and all share a common priority. The corruption occurs because the original priorities of all the elements in each group is forgotten and a new priority is used instead. The actual details of how the elements are grouped is frighteningly complex and isn't really worth looking into, so it's probably best to leave it as "the data structure chooses to corrupt however it wants, as long as it doesn't corrupt more elements than you specified when you picked ε."

Next, why is this useful? In practice, it isn't. The soft heap is almost exclusively of theoretical interest. The reason it's nice in theory is that the runtime of n insertions and n deletions from a soft heap can be O(n) - faster than O(n log n) - if ε is chosen correctly. Originally, soft heaps were used as a building block in a fast algorithm for building minimum spanning trees. They're also used in a new algorithm for linear-time selection, the first such deterministic algorithm to run in linear time since the famous median-of-medians algorithm. In both of these cases, the soft heap is used to "approximately" sort the input elements in a way that lets the algorithms get a rough approximation of a sorted sequence, at which point the algorithm does some extra logic to correct for the lack of perfection. You almost certainly will never see a soft heap used in practice, but if you did end up finding a case where you do, please leave a comment and let me know!

To summarize:

Corrupting priorities is a way of making a tradeoff between perfect sorting (exact, but slow) and arbitrary ordering (inexact, but very fast). The parameter ε determines where on the spectrum the amount of corruption lies.
Corruption works by changing the priorities of existing elements in the soft heap, in particular by raising the priorities of some elements. Low corruption corresponds to approximately sorted sequences, while high corruption corresponds to more arbitrary sequences.
The way corruption is performed is data-structure specific and very hard to understand. It's best to think of soft heaps as performing corruption when they need to, but never in a way that exceeds the limit imposed by the choice of ε.
Corruption is helpful in theoretical settings where sorting is too slow, but an approximately correctly sorted sequence is good enough for practical purposes. It's unlikely to be useful in practice.

Hope this helps!

回答2:

The answer is in the second page:

"The soft heap may, at any time, increase the value of certain keys. Such keys, and by extension, the corresponding items, are called corrupted. Corruption is entirely at the discretion of the data structure and the user has no control over it. Naturally, findmin returns the minimum current key, which might or might not be corrupted. The benefit is speed: during heap updates, items travel together in packets in a form of “car pooling,” in order to save time. From an information-theoretic point of view, corruption is a way to decrease the entropy of the data stored in the data structure, and thus facilitate its treatment. The entropy is defined as the logarithm, in base two, of the number of distinct key assignments (i.e., entropy of the uniform distribution over key assignments). To see the soundness of this idea, push it to its limit, and observe that if every key was corrupted by raising its value to `, then the set of keys would have zero entropy and we could trivially perform all operations in constant time. Interestingly, soft heaps show that the entropy need not drop to zero for the complexity to become constant."

Is this a self-defeating data structure?

来源：https://stackoverflow.com/questions/26126170/soft-heaps-what-is-corruption-and-why-is-it-useful

标签

data-structures

heap

corruption

soft-heap