I wrote a compiler cache for MSVC (much like ccache for gcc). One of the things I have to do is to remove the oldest object files in my cache directory to trim the cache to
Partial sorting (see the Wikipedia page) is more efficient than actual sorting. The algorithms are analogous to sorting algorithms. I'll outline heap-based partial sort (though it's not the most efficient on that page).
You want the oldest ones. You stick the elements in a heap, one by one, and pop off the newest element in the heap when it gets too big. Since the heap is kept small, you don't pay as much to insert and remove elements.
In the standard case, you want the smallest/biggest k elements. You want the oldest elements which satisfy a total condition, so keep track of the total condition by keeping a total_size variable.
Code:
import heapq
def partial_bounded_sort(lst, n):
"""
Returns minimal collection of oldest elements
s.t. total size >= n.
"""
# `pqueue` holds (-atime, fsize) pairs.
# We negate atime, because heapq implements a min-heap,
# and we want to throw out newer things.
pqueue = []
total_size = 0
for atime, fsize in lst:
# Add it to the queue.
heapq.heappush(pqueue, (-atime, fsize))
total_size += fsize
# Pop off newest items which aren't needed for maintaining size.
topsize = pqueue[0][1]
while total_size - topsize >= n:
heapq.heappop(pqueue)
total_size -= topsize
topsize = pqueue[0][1]
# Un-negate atime and do a final sort.
oldest = sorted((-priority, fsize) for priority, fsize in pqueue)
return oldest
There are a few things you can do to microoptimize this code. For example, you can fill in the list with the first few items and heapify it all at once.
The complexity could be better than that of sorting. In your particular problem, you don't know the number of elements you'll return, or even how many elements could be in the queue at once. In the worst case, you sort almost all of the list. You might be able to prevent this by preprocessing the list to see whether it's easier to find the set of new things or the set of old things.
If you want to keep track of which items are and aren't removed, you can keep two "pointers" into the original list: one to keep track of what you've processed, and one marking the "free" space. When processing an item, erase it from the list, and when throwing away an item from the heap, put it back into the list. The list will end up with the items that are not in the heap, plus some None entries in the end.