boost multi_index_container and slow operator++

问题

It is follow-up question for this MIC question. When adding items to the vector of reference wrappers I spend about 80% of time inside ++ operator whatever iterating approach I choose.
The query works as following

VersionView getVersionData(int subdeliveryGroupId, int retargetingId,
                             const std::wstring &flightName) const {
    VersionView versions;
    for (auto i = 0; i < 3; ++i) {
      for (auto j = 0; j < 3; ++j) {
        versions.insert(m_data.get<mvKey>().equal_range(boost::make_tuple(subdeliveryGroupId + i, retargetingId + j,
                                 flightName)));
      }
    }
    return versions;
  }

I've tried following ways to fill the reference wrapper

template <typename InputRange> void insert(const InputRange &rng) {
    // 1)   base::insert(end(), rng.first, rng.second); // 12ms
    // 2)   std::copy(rng.first, rng.second, std::back_inserter(*this)); // 6ms
    /* 3)   size_t start = size();  // 12ms
                    auto tmp = std::reference_wrapper<const
       VersionData>(VersionData(0,0,L""));
                    resize(start + boost::size(rng), tmp);
                    auto beg = rng.first;
                    for (;beg != rng.second; ++beg, ++start)
                    {
                         this->operator[](start) = std::reference_wrapper<const VersionData>(*beg);
                    }
    */
    std::copy(rng.first, rng.second, std::back_inserter(*this));
  }

Whatever I do I pay for operator ++ or the size method which just increments the iterator - meaning I'm still stuck in ++. So the question is if there is a way to iterate result ranges faster. If there is no such a way is it worth to try and go down the implementation of equal_range adding new argument which holds reference to the container of reference_wrapper which will be filled with results instead of creating range?

EDIT 1: sample code http://coliru.stacked-crooked.com/a/8b82857d302e4a06/
Due to this bug it will not compile on Coliru
EDIT 2: Call tree, with time spent in operator ++

EDIT 3: Some concrete stuff. First of all I didnt started this thread just because the operator++ takes too much time in overall execution time and I dont like it just "because" but at this very moment it is the major bottleneck in our performance tests. Each request usually processed in hundreds of microseconds, request similar to this one (they are somewhat more complex) are processed ~1000-1500 micro and it is still acceptable. The original problem was that once the number of items in datastructure grows to hundreds of thousands the performance deteriorates to something like 20 milliseconds. Now after switching to MIC (which drastically improved the code readability, maintainability and overall elegance) I can reach something like 13 milliseconds per request of which 80%-90% spent in operator++. Now the question if this could be improved somehow or should I look for some tar and feathers for me? :)

回答1:

The fact that 80% of getVersionData´s execution time is spent in operator++ is not indicative of any performance problem per se --at most, it tells you that equal_range and std::reference_wrapper insertion are faster in comparison. Put another way, when you profile some piece of code you will typically find locations where the most time is spent, but whether this is a problem or not depends on the required overall performance.

回答2:

@kreuzerkrieg, your sample code does not exercise any kind of insertion into a vector of std::reference_wrappers! Instead, you're projecting the result of equal_range into a boost::any_range, which is expected to be fairly slow at iteration --basically, increment ops resolve to virtual calls.

So, unless I'm seriously missing something here, the sample code performance or lack thereof does not have anything to do with whatever your problem is in real code (assuming VersionView, of which you don't show the code, is not using boost::any_range).

That said, if you can afford replacing your ordered indices with equivalent hashed indices, iteration will probably be faster, but this is is an utter shot in the dark given you're not showing the real stuff.

回答3:

I think that you're measuring the wrong things entirely. When I scale up from 3x3x11111 to 10x10x111111 (so 111x as many items in the index), it still runs in 290ms.

And populating the stuff takes orders of magnitude more time. Even deallocating the container appears to take more time.

What Doesn't Matter?

I've contributed a version with some trade offs, which mainly show that there's no sense in tweaking things: View On Coliru

there's a switch to avoid the any_range (it doesn't make sense using that if you care for performance)
there's a switch to tweak the flyweight:
```
#define USE_FLYWEIGHT 0 // 0: none 1: full 2: no tracking 3: no tracking no locking
```
again, it merely shows you could easily do without, and should consider doing so unless you need the memory optimization for the string (?). If so, consider using the OPTIMIZE_ATOMS approach:
the OPTIMIZE_ATOMS basically does fly weight for wstring there. Since all the strings are repeated here it will be mighty storage efficient (although the implementation is quick and dirty and should be improved). The idea is much better applied here: How to improve performance of boost interval_map lookups

Here's some rudimentary timings:

As you can see, basically nothing actually matters for query/iteration performance

Any Iterators: Doe They Matter?

It might be the culprit on your compiler. On my compile (gcc 4.8.2) it wasn't anything big, but see the disassembly of the accumulate loop without the any iterator:

As you can see from the sections I've highlighted, there doesn't seem to be much fat from the algorithm, the lambda nor from the iterator traversal. Now with the any_iterator the situation is much less clear, and if your compile optimizes less well, I can imagine it failing to inline elementary operations making iteration slow. (Just guessing a little now)

回答4:

Ok, so the solution I applied is as following: in addition to the odered_non_unique index (the 'byKey') I've added random_access index. When the data is loaded I rearrange the random index with m_data.get.begin(). Then when the MIC is queried for the data I just do boost::equal_range on the random index with custom predicate which emulates the same logic that was applied in ordering of 'byKey' index. That's it, it gave me fast 'size()' (O(1), as I understand) and fast traversal. Now I'm ready for your rotten tomatoes :)

EDIT 1: of course I've changed the any_range from bidirectional traversal tag to the random access one

来源：https://stackoverflow.com/questions/27588018/boost-multi-index-container-and-slow-operator

标签

c++

boost

stl

boost-multi-index

boost-range