问题
I am looking for a data structure which keeps the top n elements, similar to this question, but with the added requirement of maintaining sort order. Obviously I could just sort at the end but there might be a more efficient way to do it on the fly. There will only be insertions, never removals, and then an iteration through the top n elements at the end.
This question is language agnostic but it will be in C# so an answer that uses native .NET collections is preferable.
EDIT: I should clarify that the sort order only matters at the very end when the top n elements are iterated over. Along the way as there are insertions the sort order is irrelevant as long as the top n elements are retained.
回答1:
If you really need to keep them sorted all the time, you have to use a self-balanced binary search tree. But take into account that this (keeping the elements sorted) is not an optimization, but a luxury that has a cost.
A self-balanced binary search tree is slower than an implicit heap by a constant factor.
How do you want to access the sorted elements? If you just want to iterate through the sorted sequence, a normal self-balanced binary search tree will be enough.
If you want to access any element by position in the sorted sequence at any time (another luxury...) then you need to augment the tree. Basically, every node would have an additional field counting the number of nodes in its subtree, including itself.
回答2:
You'll have to give us a clue on the order of n and the number of items to be inserted.
I would think the sort order is relevant, how else will you know which elements are part of the top n? Just because you only want the top n at the end of all insertions may be creating a bias for / against structures or algorithms.
Why keep any of the items that aren't in the top n? You could use a sorted set (thinking of Python's deque) of size n+1. When adding, iterate through the list and insert the new item in the correct location in the set. When the set gets to size (n+1), each insert is followed by delete of the last item. This way you always have the top n items without growing the data structure with items that will never be retrieved.
In addition, if you keep the value of the bottom element (the last of n, call it b) then you can use that to skip the insert altogether. This limits the number of comparisons to O(1) when new item x is larger than b.
回答3:
This answer is similar to Kelly's but has a tested code example. Since the size of N is small < 100, I used a simple insertion sort, modified with a binary search lookup if the number of items is above some (non-optimized) value (e.g. 20 items). I have included a sample console app (C#) to show its use. I tested it briefly to make sure it works, but I didn't do a full analysis of it at the moment. This structure is optimized for reducing memory usage.
public class TopNStructure<T> : IEnumerable<T> where T : IComparable<T>
{
private const int SizeForLinearOrBinaryInsert = 20;
private int _maxSize;
private int _currentSize;
private T[] _items;
private IComparer<T> _comparer;
/// <summary>
/// The number of items
/// </summary>
public int Count { get { return _currentSize; } }
public TopNStructure(int maxSize, IComparer<T> comparer)
{
if (maxSize <= 0)
{
throw new ArgumentOutOfRangeException("Max size must be a postive, non-zero value");
}
_maxSize = maxSize;
_currentSize = 0;
_items = new T[maxSize];
_comparer = comparer;
}
public TopNStructure(int maxSize)
: this(maxSize, Comparer<T>.Default) { }
/// <summary>
/// Adds an item to this structure
/// </summary>
/// <param name="item">The item to add</param>
/// <returns>True if the item was added, false otherwise</returns>
public bool Add(T item)
{
if (_currentSize == 0)
{
_items[0] = item;
}
else if (_currentSize == _maxSize)
{
if (_comparer.Compare(_items[_currentSize - 1], item) <= 0)
{
return false;
}
else
{
Insert(item);
return true;
}
}
else if (_currentSize == 1)
{
if (_comparer.Compare(_items[0], item) <= 0)
{
_items[1] = item;
}
else
{
_items[1] = _items[0];
_items[0] = item;
}
}
else
{
if (_comparer.Compare(_items[_currentSize - 1], item) <= 0)
{
_items[_currentSize] = item;
}
else
{
Insert(item);
}
}
_currentSize++;
return true;
}
/// <summary>
/// Insert the item into the data structure
/// </summary>
/// <param name="item">The item to insert</param>
private void Insert(T item)
{
int start = 0;
if (_currentSize >= SizeForLinearOrBinaryInsert)
{
start = Array.BinarySearch<T>(_items, 0, _currentSize, item, _comparer);
if (start < 0)
{
start = ~start;
}
ShiftAndInsert(item, start, _currentSize);
return;
}
else
{
for (int i = start; i < _currentSize; i++)
{
if (_comparer.Compare(_items[i], item) > 0)
{
ShiftAndInsert(item, i, _currentSize);
return;
}
}
_items[_currentSize] = item;
}
}
/// <summary>
///
/// </summary>
/// <param name="index"></param>
/// <param name="maxIndex"></param>
private void ShiftAndInsert(T item, int index, int maxIndex)
{
if (maxIndex >= _maxSize)
{
maxIndex = _maxSize - 1;
}
for (int i = maxIndex; i > index; i--)
{
_items[i] = _items[i - 1];
}
_items[index] = item;
}
public IEnumerator<T> GetEnumerator()
{
return ((IEnumerable<T>)_items).GetEnumerator();
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return _items.GetEnumerator();
}
}
static void Main(string[] args)
{
TopNStructure<double> data = new TopNStructure<double>(25);
Random rand = new Random(132151);
for (int i = 0; i < 50; i++)
{
double value = rand.NextDouble();
data.Add(value);
}
int j = 0;
foreach (double value in data)
{
Console.WriteLine("{0} {1}", j, value);
j++;
}
Console.ReadKey();
}
回答4:
It may be worth it to simply sort your top k elements at the end before you iterate over them. If k is sufficiently small (say less than half the total number of elements), then this will be cheaper than maintaining a sorted list that you never query.
Look into pre-built partial sorting methods, like STL's partial sort.
Creating your sorted list for all elements will be done in O(n log n) for comparison-based sorting, no matter whether if it's on the fly or not. Partial sort will be slightly better.
回答5:
Here is a similar data structure to my first one, this time using internal Linked list to improve insertion speeds. You lose some speed because you cannot use a binary search to find the insertion point, but insertion and removals (of items > n) are O(1) and should balance the lack of binary searching. This structure uses more memory since the Linked list have 2 additional pointers.
public class TopNStructureLinkedList<T> : IEnumerable<T> where T : IComparable<T>
{
private const int SizeForLinearOrBinaryInsert = 20;
private int _maxSize;
private int _currentSize;
private LinkedList<T> _items;
private IComparer<T> _comparer;
private LinkedListNode<T> _largestItemNode;
/// <summary>
/// The number of items
/// </summary>
public int Count { get { return _currentSize; } }
public TopNStructureLinkedList(int maxSize, IComparer<T> comparer)
{
_maxSize = maxSize;
_currentSize = 0;
_items = new LinkedList<T>();
_comparer = comparer;
_largestItemNode = null;
}
public TopNStructureLinkedList(int maxSize)
: this(maxSize, Comparer<T>.Default) { }
/// <summary>
/// Adds an item to this structure
/// </summary>
/// <param name="item">The item to add</param>
/// <returns>True if the item was added, false otherwise</returns>
public bool Add(T item)
{
if (_currentSize == 0)
{
_largestItemNode = _items.AddFirst(item);
}
else if (_currentSize == 1)
{
if (_comparer.Compare(_largestItemNode.Value, item) <= 0)
{
_largestItemNode = _items.AddAfter(_largestItemNode, item);
}
else
{
_items.AddBefore(_largestItemNode, item);
}
}
else if (_currentSize == _maxSize)
{
if (_comparer.Compare(_largestItemNode.Value, item) <= 0)
{
return false;
}
else
{
Insert(item);
_largestItemNode = _items.Last.Previous;
_items.RemoveLast();
return true;
}
}
else
{
if (_comparer.Compare(_largestItemNode.Value, item) <= 0)
{
_largestItemNode = _items.AddAfter(_largestItemNode, item);
}
else
{
Insert(item);
}
}
_currentSize++;
return true;
}
/// <summary>
/// Insert the item into the data structure
/// </summary>
/// <param name="item">The item to insert</param>
private void Insert(T item)
{
LinkedListNode<T> node = _largestItemNode.Previous;
while (node != null)
{
if(_comparer.Compare(node.Value, item) <= 0) {
_items.AddAfter(node, item);
return;
}
node = node.Previous;
}
_items.AddFirst(item);
}
public IEnumerator<T> GetEnumerator()
{
return ((IEnumerable<T>)_items).GetEnumerator();
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return _items.GetEnumerator();
}
}
回答6:
If memory isn't a problem, it is still better to partial sort the entire array in the end. Even if you use an O(log n) insertion data structure, the best complexity you can achieve this way is O(n log k): n insertions costing O(log k).
Using a Selection Algorithm to find the k top elements of an array gives you a complexity of O(k log n). k is lesser than n, so it's better.
There is an implementation at the wikipedia article for QuickSelect. Also, using a pure PriorityQueue (in a different way most of people mentioned here) is easier. In a glance:
create_heap(array) // O(n)
for(int i=0; i<k; i++)
sorted[i] = heap_pop(array) //O(log n)
来源:https://stackoverflow.com/questions/14969909/what-is-the-best-data-structure-for-keeping-the-top-n-elements-in-sort-order