`std::list::sort()` - why the sudden switch to top-down strategy?

问题

I remember that since the beginning of times the most popular approach to implementing std::list<>::sort() was the classic Merge Sort algorithm implemented in bottom-up fashion (see also What makes the gcc std::list sort implementation so fast?).

I remember seeing someone aptly refer to this strategy as \"onion chaining\" approach.

At least that\'s the way it is in GCC\'s implementation of C++ standard library (see, for example, here). And this is how it was in old Dimkumware\'s STL in MSVC version of standard library, as well as in all versions of MSVC all the way to VS2013.

However, the standard library supplied with VS2015 suddenly no longer follows this sorting strategy. The library shipped with VS2015 uses a rather straightforward recursive implementation of top-down Merge Sort. This strikes me as strange, since top-down approach requires access to the mid-point of the list in order to split it in half. Since std::list<> does not support random access, the only way to find that mid-point is to literally iterate through half of the list. Also, at the very beginning it is necessary to know the total number of elements in the list (which was not necessarily an O(1) operation before C++11).

Nevertheless, std::list<>::sort() in VS2015 does exactly that. Here\'s an excerpt from that implementation that locates the mid-point and performs recursive calls

...
iterator _Mid = _STD next(_First, _Size / 2);
_First = _Sort(_First, _Mid, _Pred, _Size / 2);
_Mid = _Sort(_Mid, _Last, _Pred, _Size - _Size / 2);
...

As you can see, they just nonchalantly use std::next to walk through the first half of the list and arrive at _Mid iterator.

What could be the reason behind this switch, I wonder? All I see is a seemingly obvious inefficiency of repetitive calls to std::next at each level of recursion. Naive logic says that this is slower. If they are willing to pay this kind of price, they probably expect to get something in return. What are they getting then? I don\'t immediately see this algorithm as having better cache behavior (compared to the original bottom-up approach). I don\'t immediately see it as behaving better on pre-sorted sequences.

Granted, since C++11 std::list<> is basically required to store its element count, which makes the above slightly more efficient, since we always know the element count in advance. But that still does not seem to be enough to justify the sequential scan on each level of recursion.

(Admittedly, I haven\'t tried to race the implementations against each other. Maybe there are some surprises there.)

回答1:

1st update - VS2015 introduced non-default-constructible and stateful allocators, which presents an issue when using local lists as was done with the prior bottom up approach. I was able to handle this issue by using node pointers instead of lists (see below) for a bottom up approach.

2nd update - While the switch from lists to iterators was one way to solve the issue with allocators and exception handling, it wasn't necessary to switch from top down to bottom up, as bottom up can be implemented using iterators. I created a bottom up merge sort with iterators, and essentially the same merge/splice logic used in the VS2015 top down approach. It's at the end of this answer.

In @sbi's comment, he asked the author of the top down aproach, Stephan T. Lavavej, why the change was made. Stephan's response was "to avoid memory allocation and default constructing allocators". The new top down approach is slower than the old bottom up approach, but it only uses iterators (recursively stored on the stack), doesn't use any local lists and avoids issues related to non-default-constructible or stateful allocators. The merge operation uses splice() with iterators to "move" nodes within a list, which provides exception safety (assuming splice() can't fail). @T.C.'s answer goes into detail about this. 2nd update - however, a bottom up approach can also be based on iterators and essentially the same merge logic (example code at the bottom of this answer). Once the merge logic was determined, I'm not sure why a bottom up approach based on iterators and the splice based merge wasn't investigated.

As for performance, if there's enough memory, it would usually be faster to move the list to an array or vector, sort, then move the sorted array or vector back to the list.

I am able to reproduce the issue (old sort fails to compile, new one works) based on a demo from @IgorTandetnik:

#include <iostream>
#include <list>
#include <memory>

template <typename T>
class MyAlloc : public std::allocator<T> {
public:
    MyAlloc(T) {}  // suppress default constructor

    template <typename U>
    MyAlloc(const MyAlloc<U>& other) : std::allocator<T>(other) {}

    template< class U > struct rebind { typedef MyAlloc<U> other; };
};

int main()
{
    std::list<int, MyAlloc<int>> l(MyAlloc<int>(0));
    l.push_back(3);
    l.push_back(0);
    l.push_back(2);
    l.push_back(1);
    l.sort();
    return 0;
}

I noticed this change back in July, 2016 and emailed P.J. Plauger about this change on August 1, 2016. A snippet of his reply:

Interestingly enough, our change log doesn't reflect this change. That probably means it was "suggested" by one of our larger customers and got by me on the code review. All I know now is that the change came in around the autumn of 2015. When I reviewed the code, the first thing that struck me was the line:
    iterator _Mid = _STD next(_First, _Size / 2);
which, of course, can take a very long time for a large list.

The code looks a bit more elegant than what I wrote in early 1995(!), but definitely has worse time complexity. That version was modeled after the approach by Stepanov, Lee, and Musser in the original STL. They are seldom found to be wrong in their choice of algorithms.

I'm now reverting to our latest known good version of the original code.

I don't know if P.J. Plauger's reversion to the original code dealt with the new allocator issue, or if or how Microsoft interacts with Dinkumware.

For a comparison of the top down versus bottom up methods, I created a linked list with 4 million elements, each consisting of one 64 bit unsigned integer, assuming I would end up with a doubly linked list of nearly sequentially ordered nodes (even though they would be dynamically allocated), filled them with random numbers, then sorted them. The nodes don't move, only the linkage is changed, but now traversing the list accesses the nodes in random order. I then filled those randomly ordered nodes with another set of random numbers and sorted them again. I compared the 2015 top down approach with the prior bottom up approach modified to match the other changes made for 2015 (sort() now calls sort() with a predicate compare function, rather than having two separate functions). These are the results. update - I added a node pointer based version and also noted the time for simply creating a vector from list, sorting vector, copy back.

sequential nodes: 2015 version 1.6 seconds, prior version 1.5  seconds
random nodes:     2015 version 4.0 seconds, prior version 2.8  seconds
random nodes:                  node pointer based version 2.6  seconds
random nodes:    create vector from list, sort, copy back 1.25 seconds

For sequential nodes, the prior version is only a bit faster, but for random nodes, the prior version is 30% faster, and the node pointer version 35% faster, and creating a vector from the list, sorting the vector, then copying back is 69% faster.

Below is the first replacement code for std::list::sort() I used to compare the prior bottom up with small array (_BinList[]) method versus VS2015's top down approach I wanted the comparison to be fair, so I modified a copy of < list >.

    void sort()
        {   // order sequence, using operator<
        sort(less<>());
        }

    template<class _Pr2>
        void sort(_Pr2 _Pred)
        {   // order sequence, using _Pred
        if (2 > this->_Mysize())
            return;
        const size_t _MAXBINS = 25;
        _Myt _Templist, _Binlist[_MAXBINS];
        while (!empty())
            {
            // _Templist = next element
            _Templist._Splice_same(_Templist.begin(), *this, begin(),
                ++begin(), 1);
            // merge with array of ever larger bins
            size_t _Bin;
            for (_Bin = 0; _Bin < _MAXBINS && !_Binlist[_Bin].empty();
                ++_Bin)
                _Templist.merge(_Binlist[_Bin], _Pred);
            // don't go past end of array
            if (_Bin == _MAXBINS)
                _Bin--;
            // update bin with merged list, empty _Templist
            _Binlist[_Bin].swap(_Templist);
            }
            // merge bins back into caller's list
            for (size_t _Bin = 0; _Bin < _MAXBINS; _Bin++)
                if(!_Binlist[_Bin].empty())
                    this->merge(_Binlist[_Bin], _Pred);
        }

I made some minor changes. The original code kept track of the actual maximum bin in a variable named _Maxbin, but the overhead in the final merge is small enough that I removed the code associated with _Maxbin. During the array build, the original code's inner loop merged into a _Binlist[] element, followed by a swap into _Templist, which seemed pointless. I changed the inner loop to just merge into _Templist, only swapping once an empty _Binlist[] element is found.

Below is a node pointer based replacement for std::list::sort() I used for yet another comparison. This eliminates allocation related issues. If a compare exception is possible and occurred, all the nodes in the array and temp list (pNode) would have to be appended back to the original list, or possibly a compare exception could be treated as a less than compare.

    void sort()
        {   // order sequence, using operator<
        sort(less<>());
        }

    template<class _Pr2>
        void sort(_Pr2 _Pred)
        {   // order sequence, using _Pred
        const size_t _NUMBINS = 25;
        _Nodeptr aList[_NUMBINS];           // array of lists
        _Nodeptr pNode;
        _Nodeptr pNext;
        _Nodeptr pPrev;
        if (this->size() < 2)               // return if nothing to do
            return;
        this->_Myhead()->_Prev->_Next = 0;  // set last node ->_Next = 0
        pNode = this->_Myhead()->_Next;     // set ptr to start of list
        size_t i;
        for (i = 0; i < _NUMBINS; i++)      // zero array
            aList[i] = 0;
        while (pNode != 0)                  // merge nodes into array
            {
            pNext = pNode->_Next;
            pNode->_Next = 0;
            for (i = 0; (i < _NUMBINS) && (aList[i] != 0); i++)
                {
                pNode = _MergeN(_Pred, aList[i], pNode);
                aList[i] = 0;
                }
            if (i == _NUMBINS)
                i--;
            aList[i] = pNode;
            pNode = pNext;
            }
        pNode = 0;                          // merge array into one list
        for (i = 0; i < _NUMBINS; i++)
            pNode = _MergeN(_Pred, aList[i], pNode);
        this->_Myhead()->_Next = pNode;     // update sentinel node links
        pPrev = this->_Myhead();            //  and _Prev pointers
        while (pNode)
            {
            pNode->_Prev = pPrev;
            pPrev = pNode;
            pNode = pNode->_Next;
            }
        pPrev->_Next = this->_Myhead();
        this->_Myhead()->_Prev = pPrev;
        }

    template<class _Pr2>
        _Nodeptr _MergeN(_Pr2 &_Pred, _Nodeptr pSrc1, _Nodeptr pSrc2)
        {
        _Nodeptr pDst = 0;          // destination head ptr
        _Nodeptr *ppDst = &pDst;    // ptr to head or prev->_Next
        if (pSrc1 == 0)
            return pSrc2;
        if (pSrc2 == 0)
            return pSrc1;
        while (1)
            {
            if (_DEBUG_LT_PRED(_Pred, pSrc2->_Myval, pSrc1->_Myval))
                {
                *ppDst = pSrc2;
                pSrc2 = *(ppDst = &pSrc2->_Next);
                if (pSrc2 == 0)
                    {
                    *ppDst = pSrc1;
                    break;
                    }
                }
            else
                {
                *ppDst = pSrc1;
                pSrc1 = *(ppDst = &pSrc1->_Next);
                if (pSrc1 == 0)
                    {
                    *ppDst = pSrc2;
                    break;
                    }
                }
            }
        return pDst;
        }

As an alternative to the new VS2015 std::list::sort(), you could use this standalone version.

template <typename T>
void listsort(std::list <T> &dll)
{
    const size_t NUMLISTS = 32;
    std::list <T> al[NUMLISTS]; // array of lists
    std::list <T> tl;           // temp list
    while (!dll.empty()){
        // t1 = next element from dll
        tl.splice(tl.begin(), dll, dll.begin(), std::next(dll.begin()));
        // merge element into array
        size_t i;
        for (i = 0; i < NUMLISTS && !al[i].empty(); i++){
            tl.merge(al[i], std::less<T>());
        }
        if(i == NUMLISTS)       // don't go past end of array
            i -= 1;
        al[i].swap(tl);         // update array list, empty tl
    }
    // merge array back into original list
    for(size_t i = 0; i < NUMLISTS; i++)
        dll.merge(al[i], std::less<T>());
}

or use the similar gcc algorithm.

Update #2: I've since written a bottom up merge sort using a small array of iterators, and essentially the same iterator based merge via splice function from the VS2015 std::list::sort, which should eliminate the allocator and exception issues addressed by VS2015's std::list::sort. Example code below. The call to splice() in Merge() is a bit tricky, the last iterator is post incremented before the actual call to splice, due to the way iterator post increment is implemented in std::list, compensating for the splice. The natural order of array operation avoids any corruption of iterators from the merge/splice operations. Each iterator in the array points to the start of a sorted sub-list. The end of each sorted sub-list will be the start of a sorted-sub list in the next prior non-empty entry in the array, or if at the start of the array, in a variable.

// iterator array size
#define ASZ 32

template <typename T>
void SortList(std::list<T> &ll)
{
    if (ll.size() < 2)                  // return if nothing to do
        return;
    std::list<T>::iterator ai[ASZ];     // array of iterators
    std::list<T>::iterator li;          // left   iterator
    std::list<T>::iterator ri;          // right  iterator
    std::list<T>::iterator ei;          // end    iterator
    size_t i;
    for (i = 0; i < ASZ; i++)           // "clear" array
        ai[i] = ll.end();
    // merge nodes into array
    for (ei = ll.begin(); ei != ll.end();) {
        ri = ei++;
        for (i = 0; (i < ASZ) && ai[i] != ll.end(); i++) {
            ri = Merge(ll, ai[i], ri, ei);
            ai[i] = ll.end();
        }
        if(i == ASZ)
            i--;
        ai[i] = ri;
    }
    // merge array into single list
    ei = ll.end();                              
    for(i = 0; (i < ASZ) && ai[i] == ei; i++);
    ri = ai[i++];
    while(1){
        for( ; (i < ASZ) && ai[i] == ei; i++);
        if (i == ASZ)
            break;
        li = ai[i++];
        ri = Merge(ll, li, ri, ei);
    }
}

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                             typename std::list<T>::iterator li,
                             typename std::list<T>::iterator ri,
                             typename std::list<T>::iterator ei)
{
    std::list<T>::iterator ni;
    (*ri < *li) ? ni = ri : ni = li;
    while(1){
        if(*ri < *li){
            ll.splice(li, ll, ri++);
            if(ri == ei)
                return ni;
        } else {
            if(++li == ri)
                return ni;
        }
    }
}

Replacement code for VS2015's std::list::sort() (adds an internal function _Merge):

    template<class _Pr2>
        iterator _Merge(_Pr2& _Pred, iterator li, iterator ri, iterator ei)
        {
        iterator ni;
        _DEBUG_LT_PRED(_Pred, *ri, *li) ? ni = ri : ni = li;
        while(1)
            {
            if(_DEBUG_LT_PRED(_Pred, *ri, *li))
                {
                splice(li, *this, ri++);
                if(ri == ei)
                    return ni;
                }
            else
                {
                if(++li == ri)
                    return ni;
                }
            }
        }

    void sort()
        {   // order sequence, using operator<
        sort(less<>());
        }

    template<class _Pr2>
        void sort(_Pr2 _Pred)
        {
        if (size() < 2)                 // if size < 2 nothing to do
            return;
        const size_t _ASZ = 32;         // array size
        iterator ai[_ASZ];              // array of iterators
        iterator li;                    // left  iterator
        iterator ri;                    // right iterator
        iterator ei = end();            // end iterator
        size_t i;
        for(i = 0; i < _ASZ; i++)       // "clear array"
            ai[i] = ei;
        // merge nodes into array
        for(ei = begin(); ei != end();)
            {
            ri = ei++;
            for (i = 0; (i < _ASZ) && ai[i] != end(); i++)
                {
                ri = _Merge(_Pred, ai[i], ri, ei);
                ai[i] = end();
                }
            if(i == _ASZ)
                i--;
            ai[i] = ri;
            }
        // merge array into single list
        ei = end();                              
        for(i = 0; (i < _ASZ) && ai[i] == ei; i++);
        ri = ai[i++];
        while(1)
            {
            for( ; (i < _ASZ) && ai[i] == ei; i++);
            if (i == _ASZ)
                break;
            li = ai[i++];
            ri = _Merge(_Pred, li, ri, ei);
            }
        }

Replacement code for VS2019's std::list::sort() (adds an internal function _Merge, and uses VS template naming convention):

private:
    template <class _Pr2>
    iterator _Merge(_Pr2 _Pred, iterator _First, iterator _Mid, iterator _Last){
        iterator _Newfirst = _First;
        for (bool _Initial_loop = true;;
            _Initial_loop       = false) { // [_First, _Mid) and [_Mid, _Last) are sorted and non-empty
            if (_DEBUG_LT_PRED(_Pred, *_Mid, *_First)) { // consume _Mid
                if (_Initial_loop) {
                    _Newfirst = _Mid; // update return value
                }
                splice(_First, *this, _Mid++);
                if (_Mid == _Last) {
                    return _Newfirst; // exhausted [_Mid, _Last); done
                }
            }
            else { // consume _First
                ++_First;
                if (_First == _Mid) {
                    return _Newfirst; // exhausted [_First, _Mid); done
                }
            }
        }
    }

    template <class _Pr2>
    void _Sort(iterator _First, iterator _Last, _Pr2 _Pred,
        size_type _Size) { // order [_First, _Last), using _Pred, return new first
                           // _Size must be distance from _First to _Last
        if (_Size < 2) {
            return;        // nothing to do
        }
        const size_t _ASZ = 32;         // array size
        iterator _Ai[_ASZ];             // array of iterators to runs
        iterator _Mi;                   // middle   iterator
        iterator _Li;                   // last     iterator
        size_t _I;                      // index to _Ai
        for (_I = 0; _I < _ASZ; _I++)   // "empty" array
            _Ai[_I] = _Last;            //   _Ai[] == _Last => empty entry
        // merge nodes into array
        for (_Li = _First; _Li != _Last;) {
            _Mi = _Li++;
            for (_I = 0; (_I < _ASZ) && _Ai[_I] != _Last; _I++) {
                _Mi = _Merge(_Pass_fn(_Pred), _Ai[_I], _Mi, _Li);
                _Ai[_I] = _Last;
            }
            if (_I == _ASZ)
                _I--;
            _Ai[_I] = _Mi;
        }
        // merge array runs into single run
        for (_I = 0; _I < _ASZ && _Ai[_I] == _Last; _I++);
        _Mi = _Ai[_I++];
        while (1) {
            for (; _I < _ASZ && _Ai[_I] == _Last; _I++);
            if (_I == _ASZ)
                break;
            _Mi = _Merge(_Pass_fn(_Pred), _Ai[_I++], _Mi, _Last);
        }
    }

回答2:

@sbi asked Stephan T. Lavavej, MSVC's standard library maintainer, who responded:

I did that to avoid memory allocation and default constructing allocators.

To this I'll add "free basic exception safety".

To elaborate: the pre-VS2015 implementation suffers from several defects:

_Myt _Templist, _Binlist[_MAXBINS]; creates a bunch of intermediate lists (_Myt is simply a typedef for the current instantiation of list; a less confusing spelling for that is, well, list) to hold the nodes during sorting, but these lists are default constructed, which leads to a multitude of problems:
1. If the allocator used is not default constructible (and there is no requirement that allocators be default constructible), this simply won't compile, because the default constructor of list will attempt to default construct its allocator.
2. If the allocator used is stateful, then a default-constructed allocator may not compare equal to this->get_allocator(), which means that the later splices and merges are technically undefined behavior and may well break in debug builds. ("Technically", because the nodes are all merged back in the end, so you don't actually deallocate with the wrong allocator if the function successfully completes.)
3. Dinkumware's list uses a dynamically allocated sentinel node, which means that the above will perform _MAXBINS + 1 dynamic allocations. I doubt that many people expect sort to potentially throw bad_alloc. If the allocator is stateful, then these sentinel nodes may not be even allocated from the right place (see #2).
The code is not exception safe. In particular, the comparison is allowed to throw, and if it throws while there are elements in the intermediate lists, those elements are simply destroyed with the lists during stack unwinding. Users of sort don't expect the list to be sorted if sort throws an exception, of course, but they probably also don't expect the elements to go missing.
- This interacts very poorly with #2 above, because now it's not just technical undefined behavior: the destructor of those intermediate lists will be deallocating and destroying the nodes spliced into them with the wrong allocator.

Are those defects fixable? Probably. #1 and #2 can be fixed by passing get_allocator() to the constructor of the lists:

 _Myt _Templist(get_allocator());
 _Myt _Binlist[_MAXBINS] = { _Myt(get_allocator()), _Myt(get_allocator()), 
                             _Myt(get_allocator()),  /* ... repeat _MAXBINS times */ };

The exception safety problem can be fixed by surrounding the loop with a try-catch that splices all the nodes in the intermediate lists back into *this without regard to order if an exception is thrown.

Fixing #3 is harder, because that means not using list at all as the holder of nodes, which probably requires a decent amount of refactoring, but it's doable.

The question is: is it worth jumping through all these hoops to improve the performance of a container that has reduced performance by design? After all, someone who really cares about performance probably won't be using list in the first place.

来源：https://stackoverflow.com/questions/40622430/stdlistsort-why-the-sudden-switch-to-top-down-strategy

标签

c++