Why is `std::copy` 5x (!) slower than `memcpy` in my test program?

故事扮演 提交于 2019-12-03 07:10:08

That is not the results I get:

> g++ -O3 XX.cpp 
> ./a.out
cast:      5 ms
memcpy:    4 ms
std::copy: 3 ms
(counter:  1264720400)

Hardware: 2GHz Intel Core i7
Memory:   8G 1333 MHz DDR3
OS:       Max OS X 10.7.5
Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1

On a Linux box I get different results:

> g++ -std=c++0x -O3 XX.cpp 
> ./a.out 
cast:      3 ms
memcpy:    4 ms
std::copy: 21 ms
(counter:  731359744)


Hardware:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory:    61363780 kB
OS:        Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
Compiler:  g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Void

I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using memcpy(), memmove(), std::copy() and the std::vector assignment operator:

#include <algorithm>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstring>
#include <cassert>

typedef std::vector<int> vector_type;

void test_memcpy(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memcpy(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_memmove(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memmove(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_std_copy(vector_type & dest, vector_type const & src)
{
    std::copy(src.begin(), src.end(), dest.begin());
}

void test_assignment(vector_type & dest, vector_type const & src)
{
    dest = src;
}

auto
benchmark(std::function<void(vector_type &, vector_type const &)> copy_func)
    ->decltype(std::chrono::milliseconds().count())
{
    std::random_device rd;
    std::mt19937 generator(rd());
    std::uniform_int_distribution<vector_type::value_type> distribution;

    static vector_type::size_type const num_elems = 2000;

    vector_type dest(num_elems);
    vector_type src(num_elems);

    // Fill the source and destination vectors with random data.
    for (vector_type::size_type i = 0; i < num_elems; ++i) {
        src.push_back(distribution(generator));
        dest.push_back(distribution(generator));
    }

    static int const iterations = 50000;

    std::chrono::time_point<std::chrono::system_clock> start, end;

    start = std::chrono::system_clock::now();

    for (int i = 0; i != iterations; ++i)
        copy_func(dest, src);

    end = std::chrono::system_clock::now();

    assert(src == dest);

    return
        std::chrono::duration_cast<std::chrono::milliseconds>(
            end - start).count();
}

int main()
{
    std::cout
        << "memcpy:     " << benchmark(test_memcpy)     << " ms" << std::endl
        << "memmove:    " << benchmark(test_memmove)    << " ms" << std::endl
        << "std::copy:  " << benchmark(test_std_copy)   << " ms" << std::endl
        << "assignment: " << benchmark(test_assignment) << " ms" << std::endl
        << std::endl;
}

I went a little overboard with C++11 just for fun.

Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:

$ g++ -O3 -std=c++0x foo.cpp ; ./a.out 
memcpy:     33 ms
memmove:    33 ms
std::copy:  33 ms
assignment: 34 ms

The results are all quite comparable! I get comparable times in all test cases when I change the integer type, e.g. to long long, in the vector as well.

Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison. HTH!

Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy. gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction.

std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). So the benchmark calls memcpy.

I got all that by invoking gcc with -S and flipping quickly through the output; I could easily have gotten it wrong, but what I saw seems consistent with your measurements.

By the way, I think the test is more or less meaningless. A more plausible real-world test might be creating an actual vector<int> src and an int[N] dst, and then comparing memcpy(dst, src.data(), sizeof(int)*src.size()) with std::copy(src.begin(), src.end(), &dst).

memcpy and std::copy each have their uses, std::copy should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators). memcpy only work on contiguous reasons and as such can be heavily optimized.

According to assembler output of G++ 4.8.1, test_memcpy:

movl    (%r15), %r15d

test_std_copy:

movl    $4, %edx
movq    %r15, %rsi
leaq    16(%rsp), %rdi
call    memcpy

As you can see, std::copy successfully recognized that it can copy data with memcpy, but for some reason further inlining did not happen - so that is the reason of performance difference.

By the way, Clang 3.4 produces identical code for both cases:

movl    (%r14,%rbx), %ebp
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!