Efficient linked list in C++?

北城余情 提交于 2020-05-09 19:07:11

问题


This document says std::list is inefficient:

std::list is an extremely inefficient class that is rarely useful. It performs a heap allocation for every element inserted into it, thus having an extremely high constant factor, particularly for small data types.

Comment: that is to my surprise. std::list is a doubly linked list, so despite its inefficiency in element construction, it supports insert/delete in O(1) time complexity, but this feature is completely ignored in this quoted paragraph.

My question: Say I need a sequential container for small-sized homogeneous elements, and this container should support element insert/delete in O(1) complexity and does not need random access (though supporting random access is nice, it is not a must here). I also don't want the high constant factor introduced by heap allocation for each element's construction, at least when the number of element is small. Lastly, iterators should be invalidated only when the corresponding element is deleted. Apparently I need a custom container class, which might (or might not) be a variant of doubly linked list. How should I design this container?

If the aforementioned specification cannot be achieved, then perhaps I should have a custom memory allocator, say, bump pointer allocator? I know std::list takes an allocator as its second template argument.

Edit: I know I shouldn't be too concerned with this issue, from an engineering standpoint - fast enough is good enough. It is just a hypothetical question so I don't have a more detailed use case. Feel free to relax some of the requirements!

Edit2: I understand two algorithms of O(1) complexity can have entirely different performance due to the difference in their constant factors.


回答1:


The simplest way I see to fulfill all your requirements:

  1. Constant-time insertion/removal (hope amortized constant-time is okay for insertion).
  2. No heap allocation/deallocation per element.
  3. No iterator invalidation on removal.

... would be something like this, just making use of std::vector:

template <class T>
struct Node
{
    // Stores the memory for an instance of 'T'.
    // Use placement new to construct the object and
    // manually invoke its dtor as necessary.
    typename std::aligned_storage<sizeof(T), alignof(T)>::type element;

    // Points to the next element or the next free
    // element if this node has been removed.
    int next;

    // Points to the previous element.
    int prev;
};

template <class T>
class NodeIterator
{
public:
    ...
private:
    std::vector<Node<T>>* nodes;
    int index;
};

template <class T>
class Nodes
{
public:
    ...
private:
    // Stores all the nodes.
    std::vector<Node> nodes;

    // Points to the first free node or -1 if the free list
    // is empty. Initially this starts out as -1.
    int free_head;
};

... and hopefully with a better name than Nodes (I'm slightly tipsy and not so good at coming up with names at the moment). I'll leave the implementation up to you but that's the general idea. When you remove an element, just do a doubly-linked list removal using the indices and push it to the free head. The iterator doesn't invalidate since it stores an index to a vector. When you insert, check if the free head is -1. If not, overwrite the node at that position and pop. Otherwise push_back to the vector.

Illustration

Diagram (nodes are stored contiguously inside std::vector, we simply use index links to allow skipping over elements in a branchless way along with constant-time removals and insertions anywhere):

Let's say we want to remove a node. This is your standard doubly-linked list removal, except we use indices instead of pointers and you also push the node to the free list (which just involves manipulating integers):

Removal adjustment of links:

Pushing removed node to free list:

Now let's say you insert to this list. In that case, you pop off the free head and overwrite the node at that position.

After insertion:

Insertion to the middle in constant-time should likewise be easy to figure out. Basically you just insert to the free head or push_back to the vector if the free stack is empty. Then you do your standard double-linked list insertion. Logic for the free list (though I made this diagram for someone else and it involves an SLL, but you should get the idea):

Make sure you properly construct and destroy the elements using placement new and manual calls to the dtor on insertion/removal. If you really want to generalize it, you'll also need to think about exception-safety and we also need a read-only const iterator.

Pros and Cons

The benefit of such a structure is that it does allow very rapid insertions/removals from anywhere in the list (even for a gigantic list), insertion order is preserved for traversal, and it never invalidates the iterators to element which aren't directly removed (though it will invalidate pointers to them; use deque if you don't want pointers to be invalidated). Personally I'd find more use for it than std::list (which I practically never use).

For large enough lists (say, larger than your entire L3 cache as a case where you should definitely expect a huge edge), this should vastly outperform std::vector for removals and insertions to/from the middle and front. Removing elements from vector can be quite fast for small ones, but try removing a million elements from a vector starting from the front and working towards the back. There things will start to crawl while this one will finish in the blink of an eye. std::vector is ever-so-slightly overhyped IMO when people start using its erase method to remove elements from the middle of a vector spanning 10k elements or more, though I suppose this is still preferable over people naively using linked lists everywhere in a way where each node is individually allocated against a general-purpose allocator while causing cache misses galore.

The downside is that it only supports sequential access, requires the overhead of two integers per element, and as you can see in the above diagram, its spatial locality degrades if you constantly remove things sporadically.

Spatial Locality Degradation

The loss of spatial locality as you start removing and inserting a lot from/to the middle will lead to zig-zagging memory access patterns, potentially evicting data from a cache line only to go back and reload it during a single sequential loop. This is generally inevitable with any data structure that allows removals from the middle in constant-time while likewise allowing that space to be reclaimed while preserving the order of insertion. However, you can restore spatial locality by offering some method or you can copy/swap the list. The copy constructor can copy the list in a way that iterates through the source list and inserts all the elements which gives you back a perfectly contiguous, cache-friendly vector with no holes (though doing this will invalidate iterators).

Alternative: Free List Allocator

An alternative that meets your requirements is implement a free list conforming to std::allocator and use it with std::list. I never liked reaching around data structures and messing around with custom allocators though and that one would double the memory use of the links on 64-bit by using pointers instead of 32-bit indices, so I'd prefer the above solution personally using std::vector as basically your analogical memory allocator and indices instead of pointers (which both reduce size and become a requirement if we use std::vector since pointers would be invalidated when vector reserves a new capacity).

Indexed Linked Lists

I call this kind of thing an "indexed linked list" as the linked list isn't really a container so much as a way of linking together things already stored in an array. And I find these indexed linked lists exponentially more useful since you don't have to get knee-deep in memory pools to avoid heap allocations/deallocations per node and can still maintain reasonable locality of reference (great LOR if you can afford to post-process things here and there to restore spatial locality).

You can also make this singly-linked if you add one more integer to the node iterator to store the previous node index (comes free of memory charge on 64-bit assuming 32-bit alignment requirements for int and 64-bit for pointers). However, you then lose the ability to add a reverse iterator and make all iterators bidirectional.

Benchmark

I whipped up a quick version of the above since you seem interested in 'em: release build, MSVC 2012, no checked iterators or anything like that:

--------------------------------------------
- test_vector_linked
--------------------------------------------
Inserting 200000 elements...
time passed for 'inserting': {0.000015 secs}

Erasing half the list...
time passed for 'erasing': {0.000021 secs}
time passed for 'iterating': {0.000002 secs}
time passed for 'copying': {0.000003 secs}

Results (up to 10 elements displayed):
[ 11 13 15 17 19 21 23 25 27 29 ]

finished test_vector_linked: {0.062000 secs}
--------------------------------------------
- test_vector
--------------------------------------------
Inserting 200000 elements...
time passed for 'inserting': {0.000012 secs}

Erasing half the vector...
time passed for 'erasing': {5.320000 secs}
time passed for 'iterating': {0.000000 secs}   
time passed for 'copying': {0.000000 secs}

Results (up to 10 elements displayed):
[ 11 13 15 17 19 21 23 25 27 29 ]

finished test_vector: {5.320000 secs}

Was too lazy to use a high-precision timer but hopefully that gives an idea of why one shouldn't use vector's linear-time erase method in critical paths for non-trivial input sizes with vector above there taking ~86 times longer (and exponentially worse the larger the input size -- I tried with 2 million elements originally but gave up after waiting almost 10 minutes) and why I think vector is ever-so-slightly-overhyped for this kind of use. That said, we can turn removal from the middle into a very fast constant-time operation without shuffling the order of the elements, without invalidating indices and iterators storing them, and while still using vector... All we have to do is simply make it store a linked node with prev/next indices to allow skipping over removed elements.

For removal I used a randomly shuffled source vector of even-numbered indices to determine what elements to remove and in what order. That somewhat mimics a real world use case where you're removing from the middle of these containers through indices/iterators you formerly obtained, like removing the elements the user formerly selected with a marquee tool after he his the delete button (and again, you really shouldn't use scalar vector::erase for this with non-trivial sizes; it'd even be better to build a set of indices to remove and use remove_if -- still better than vector::erase called for one iterator at a time).

Note that iteration does become slightly slower with the linked nodes, and that doesn't have to do with iteration logic so much as the fact that each entry in the vector is larger with the links added (more memory to sequentially process equates to more cache misses and page faults). Nevertheless, if you're doing things like removing elements from very large inputs, the performance skew is so epic for large containers between linear-time and constant-time removal that this tends to be a worthwhile exchange.




回答2:


Your requirements are exactly those of std::list, except that you've decided you don't like the overhead of node-based allocation.

The sane approach is to start at the top and only do as much as you really need:

  1. Just use std::list.

    Benchmark it: is the default allocator really too slow for your purposes?

    • No: you're done.

    • Yes: goto 2

  2. Use std::list with an existing custom allocator such as the Boost pool allocator

    Benchmark it: is the Boost pool allocator really too slow for your purposes?

    • No: you're done.

    • Yes: goto 3

  3. Use std::list with a hand-rolled custom allocator finely tuned to your unique needs, based on all the profiling you did in steps 1 and 2

    Benchmark as before etc. etc.

  4. Consider doing something more exotic as a last resort.

    If you get to this stage, you should have a really well-specified SO question, with lots of detail about exactly what you need (eg. "I need to squeeze n nodes into a cacheline" rather than "this doc said this thing is slow and that sounds bad").


PS. The above makes two assumptions, but both are worth investigation:

  1. as Baum mit Augen points out, it's not sufficient to do simple end-to-end timing, because you need to be sure where your time is going. It could be the allocator itself, or cache misses due to the memory layout, or something else. If something's slow, you still need to be sure why before you know what ought to change.
  2. your requirements are taken as a given, but finding ways to weaken requirements is often the easiest way to make something faster.

    • do you really need constant-time insertion and deletion everywhere, or only at the front, or the back, or both but not in the middle?
    • do you really need those iterator invalidation constraints, or can they be relaxed?
    • are there access patterns you can exploit? If you're frequently removing an element from the front and then replacing it with a new one, could you just update it in-place?



回答3:


As an alternative, you can use a growable array and handle the links explicitly, as indexes into the array.

Unused array elements are put in a linked list using one of the links. When an element is deleted, it is returned to the free list. When the free list is exhausted, grow the array and use the next element.

For the new free elements, you have two options:

  • append them to the free list at once,
  • append them on demand, based on the number of elements in the free list vs. the array size.



回答4:


The requirement of not invalidating iterators except the one on a node being deleted is forbidding to every container that doesn't allocate individual nodes and is much different from e.g. list or map.
However, I've found that in almost every case when I thought that this was necessary, it turned out with a little discipline I could just as well do without. You might want to verify if you can, you would benefit greatly.

While std::list is indeed the "correct" thing if you need something like a list (for CS class, mostly), the statement that it is almost always the wrong choice is, unluckily, exactly right. While the O(1) claim is entirely true, it's nevertheless abysmal in relation to how actual computer hardware works, which gives it a huge constant factor. Note that not only are the objects that you iterate randomly placed, but the nodes that you maintain are, too (yes, you can somehow work around that with an allocator, but that is not the point). On the average, you have two one guaranteed cache misses for anything you do, plus up to two one dynamic allocations for mutating operations (one for the object, and another one for the node).

Edit: As pointed out by @ratchetfreak below, implementations of std::list commonly collapse the object and node allocation into one memory block as an optimization (akin to what e.g. make_shared does), which makes the average case somewhat less catastrophic (one allocation per mutation and one guaranteed cache miss instead of two).
A new, different consideration in this case might be that doing so may not be entirely trouble-free either. Postfixing the object with two pointers means reversing the direction while dereference which may interfere with auto prefetch.
Prefixing the object with the pointers, on the other hand, means you push the object back by two pointers' size, which will mean as much as 16 bytes on a 64-bit system (that might split a mid-sized object over cache line boundaries every time). Also, there's to consider that std::list cannot afford to break e.g. SSE code solely because it adds a clandestine offset as special surprise (so for example the xor-trick would likely not be applicable for reducing the two-pointer footprint). There would likely have to be some amount of "safe" padding to make sure objects added to a list still work the way they should.
I am unable to tell whether these are actual performance problems or merely distrust and fear from my side, but I believe it's fair to say that there may be more snakes hiding in the grass than one expects.

It's not for no reason that high-profile C++ experts (Stroustrup, notably) recommend using std::vector unless you have a really good reason not to.

Like many people before, I've tried to be smart about using (or inventing) something better than std::vector for one or the other particular, specialized problem where it seems you can do better, but it turns out that simply using std::vector is still almost always the best, or second best option (if std::vector happens to be not-the-best, std::deque is usually what you need instead).
You have way fewer allocations than with any other approach, way less memory fragmentation, way fewer indirections, and a much more favorable memory access pattern. And guess what, it's readily available and just works.
The fact that every now and then inserts require a copy of all elements is (usually) a total non-issue. You think it is, but it's not. It happens rarely and it is a copy of a linear block of memory, which is exactly what processors are good at (as opposed to many double-indirections and random jumps over memory).

If the requirement not to invalidate iterators is really an absolute must, you could for example pair a std::vector of objects with a dynamic bitset or, for lack of something better, a std::vector<bool>. Then use reserve() appropriately so reallocations do not happen. When deleting an element, do not remove it but only mark it as deleted in the bitmap (call the destructor by hand). At appropriate times, when you know that it's OK to invalidate iterators, call a "vacuum cleaner" function that compacts both the bit vector and the object vector. There, all unforeseeable iterator invalidations gone.

Yes, that requires maintaining one extra "element was deleted" bit, which is annoying. But a std::list must maintain two pointers as well, in additon to the actual object, and it must do allocations. With the vector (or two vectors), access is still very efficient, as it happens in a cache-friendly way. Iterating, even when checking for deleted nodes, still means you move linearly or almost-linearly over memory.




回答5:


std::list is a doubly linked list, so despite its inefficiency in element construction, it supports insert/delete in O(1) time complexity, but this feature is completely ignored in this quoted paragraph.

It's ignored because it's a lie.

The problem of algorithmic complexity is that it generally measures one thing. For example, when we say that insertion in a std::map is O(log N), we mean that it performs O(log N) comparisons. The costs of iterating, fetching cache lines from memory, etc... are not taken into account.

This greatly simplifies analysis, of course, but unfortunately does not necessarily map cleanly to real-world implementation complexities. In particular, one egregious assumption is that memory allocation is constant-time. And that, is a bold-faced lie.

General purpose memory allocators (malloc and co), do not have any guarantee on the worst-case complexity of memory allocations. The worst-case is generally OS-dependent, and in the case of Linux it may involve the OOM killer (sift through the ongoing processes and kill one to reclaim its memory).

Special purpose memory allocators could potentially be made constant time... within a particular range of number of allocations (or maximum allocation size). Since Big-O notation is about the limit at infinity, it cannot be called O(1).

And thus, where the rubber meets the road, the implementation of std::list does NOT in general feature O(1) insertion/deletion, because the implementation relies on a real memory allocator, not an ideal one.


This is pretty depressing, however you need not lose all hopes.

Most notably, if you can figure out an upper-bound to the number of elements and can allocate that much memory up-front, then you can craft a memory allocator which will perform constant-time memory allocation, giving you the illusion of O(1).




回答6:


Use two std::lists: One "free-list" that's preallocated with a large stash of nodes at startup, and the other "active" list into which you splice nodes from the free-list. This is constant time and doesn't require allocating a node.




回答7:


The new slot_map proposal claim O(1) for insert and delete.

There is also a link to a video with a proposed implementation and some previous work.

If we knew more about the actual structure of the elements there might be some specialized associative containers that are much better.




回答8:


I would suggest doing exactly what @Yves Daoust says, except instead of using a linked list for the free list, use a vector. Push and pop the free indices on the back of the vector. This is amortized O(1) insert, lookup, and delete, and doesn't involve any pointer chasing. It also doesn't require any annoying allocator business.




回答9:


I second @Useless' answer, particularly PS item 2 about revising requirements. If you relax the iterator invalidation constraint, then using std::vector<> is Stroustrup's standard suggestion for a small-number-of-items container (for reasons already mentioned in the comments). Related questions on SO.

Starting from C++11 there is also std::forward_list.

Also, if standard heap allocation for elements added to the container is not good enough, then I would say you need to look very carefully at your exact requirements and fine tune for them.




回答10:


I just wanted to make a small comment about your choice. I'm a huge fan of vector because of it's read speeds, and you can direct access any element, and do sorting if need be. (vector of class/struct for example).

But anyways I digress, there's two nifty tips I wanted to disclose. With vector inserts can be expensive, so a neat trick, don't insert if you can get away with not doing it. do a normal push_back (put at the end) then swap the element with one you want.

Same with deletes. They are expensive. So swap it with the last element, delete it.




回答11:


Thanks for all the answers. This is a simple - though not rigorous - benchmark.

// list.cc
#include <list>
using namespace std;

int main() {
    for (size_t k = 0; k < 1e5; k++) {
        list<size_t> ln;
        for (size_t i = 0; i < 200; i++) {
            ln.insert(ln.begin(), i);
            if (i != 0 && i % 20 == 0) {
                ln.erase(++++++++++ln.begin());
            }
        }
    }
}

and

// vector.cc
#include <vector>
using namespace std;

int main() {
    for (size_t k = 0; k < 1e5; k++) {
        vector<size_t> vn;
        for (size_t i = 0; i < 200; i++) {
            vn.insert(vn.begin(), i);
            if (i != 0 && i % 20 == 0) {
                vn.erase(++++++++++vn.begin());
            }
        }
    }
}

This test aims to test what std::list claims to excel at - O(1) inserting and erasing. And, because of the positions I ask to insert/delete, this race is heavily skewed against std::vector, because it has to shift all the following elements (hence O(n)), while std::list doesn't need to do that.

Now I compile them.

clang++ list.cc -o list
clang++ vector.cc -o vector

And test the runtime. The result is:

  time ./list
  ./list  4.01s user 0.05s system 91% cpu 4.455 total
  time ./vector
  ./vector  1.93s user 0.04s system 78% cpu 2.506 total

std::vector has won.

Compiled with optimization O3, std::vector still wins.

  time ./list
  ./list  2.36s user 0.01s system 91% cpu 2.598 total
  time ./vector
  ./vector  0.58s user 0.00s system 50% cpu 1.168 total

std::list has to call heap allocation for each element, while std::vector can allocate heap memory in batch (though it might be implementation-dependent), hence std::list's insert/delete has a higher constant factor, though it is O(1).

No wonder this document says

std::vector is well loved and respected.

EDIT: std::deque does even better in some cases, at least for this task.

// deque.cc
#include <deque>
using namespace std;

int main() {
    for (size_t k = 0; k < 1e5; k++) {
        deque<size_t> dn;
        for (size_t i = 0; i < 200; i++) {
            dn.insert(dn.begin(), i);
            if (i != 0 && i % 20 == 0) {
                dn.erase(++++++++++dn.begin());
            }
        }
    }
}

Without optimization:

./deque  2.13s user 0.01s system 86% cpu 2.470 total

Optimized with O3:

./deque  0.27s user 0.00s system 50% cpu 0.551 total


来源:https://stackoverflow.com/questions/45717938/efficient-linked-list-in-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!