Dynamic array on stack (VLA) vs heap performance

瘦欲@ 提交于 2021-01-29 22:18:37

问题


Most of the time we can assume that stack is faster and cleaner. No memory fragmentation, easier to cache, quick allocation. That's also why people always assume that static buffer allocated on stack is much faster than dynamic buffer on heap. Is it? One misconception I see most of the time is that people assume that c99 extension (which is supported as non-standard extension in common C++ compilers like GCC) allocating dynamic sized array on stack will perform as fast as static size. I think it's not the case here. Stack operations are faster because we already know stack size and all of the offsets, compiler can easily reach to something allocated before or jump by constant offset. If stack becomes dynamic, it can no longer optimize it this way. You can simply write code with static and dynamic array allocated on stack, check assembly code and second option will be much more complicated.

Now the question is: Is using non-standard extension like VLA really more efficient than dynamic array on heap? If so, then in which cases it's faster, in which can be comparable or even worse?

I tried to run some benchmarks on example code on quick-bench. I wanted to test performance difference with small and large allocations. To make it even more extensive, I wanted to simulate accessing something allocated before this dynamically allocated stack to see if it is affected in any way.


#define STATIC_BUF_SIZE 10
#define TEMP_BUF_SIZE 5000

static void Static(benchmark::State& state) {
  char sthAllocatedBefore[STATIC_BUF_SIZE];
  for (int i = 0; i < STATIC_BUF_SIZE; ++i)
  {
    sthAllocatedBefore[i] = 0;
  }
  for (auto _ : state) {
    char arr[TEMP_BUF_SIZE];
    for (int i = 0; i < TEMP_BUF_SIZE; ++i)
    {
      arr[i] = i;
      benchmark::DoNotOptimize(arr[i]);
    }
    for (int i = 0; i < STATIC_BUF_SIZE; ++i)
    {
      sthAllocatedBefore[i]++;
      benchmark::DoNotOptimize(sthAllocatedBefore[i]);
    }
  }
}
BENCHMARK(Static);

static void Heap(benchmark::State& state) {
  char sthAllocatedBefore[STATIC_BUF_SIZE];
  for (int i = 0; i < STATIC_BUF_SIZE; ++i)
  {
    sthAllocatedBefore[i] = 0;
  }
  for (auto _ : state) {
    volatile int size = TEMP_BUF_SIZE;
    benchmark::DoNotOptimize(size);
    char* arr = new char[size];
    for (int i = 0; i < TEMP_BUF_SIZE; ++i)
    {
      arr[i] = i;
      benchmark::DoNotOptimize(arr[i]);
    }
    for (int i = 0; i < STATIC_BUF_SIZE; ++i)
    {
      sthAllocatedBefore[i]++;
      benchmark::DoNotOptimize(sthAllocatedBefore[i]);
    }
    delete[] arr;
  }
}
BENCHMARK(Heap);

static void DynamicStack(benchmark::State& state) {
  char sthAllocatedBefore[STATIC_BUF_SIZE];
  for (int i = 0; i < STATIC_BUF_SIZE; ++i)
  {
    sthAllocatedBefore[i] = 0;
  }
  for (auto _ : state) {
    volatile int size = TEMP_BUF_SIZE;
    benchmark::DoNotOptimize(size);
    char arr[size];
    for (int i = 0; i < TEMP_BUF_SIZE; ++i)
    {
      arr[i] = i;
      benchmark::DoNotOptimize(arr[i]);
    }
    for (int i = 0; i < STATIC_BUF_SIZE; ++i)
    {
      sthAllocatedBefore[i]++;
      benchmark::DoNotOptimize(sthAllocatedBefore[i]);
    }
  }
}
BENCHMARK(DynamicStack);

Test is simple.

  1. Create static buffer on stack and initialize it.
  2. Start measurements in loop.
  3. Allocate temporary buffer.
  4. Fill buffer with values.
  5. Modify values in static buffer.
  6. Remove temporary buffer.

Test #1 - 5k operations on static buffer and 5k size of temp buffer They look comparable although static allocation seems to be worse of them which is a bit suspicious.

Test #2 - 10 operations on static buffer and 5k size of temp buffer I'm not sure if this test is valid, quick-bench shows that incrementing counter took most of execution time.

Test #3 - 5k operations on static buffer and 10 size of temp buffer Looks a bit more reasonable although performance difference is very small.

Test #4 - 10 operations on static buffer and 10 size of temp buffer Heap allocation generates huge overhead but static dynamic stack is still much slower than constant stack buffer.

I performed the same tests in different versions, results are sometimes quite interesting although I'm not sure how reliable are results I get, most of all how "DoNotOptimize" affects this code.

Similar test but temp buffer is not used, just created to mess up with stack. Big static buffer, small temp.

I'm mostly curious about cases in which you can clearly see the difference and explain what happens. Also it would be good to get better benchmarking setup and understanding.

To me it seems like using VLA is just lazyness. For bigger allocations performance is comparable. For lots of small allocations it will still be better to use constant size buffer on stack. For lots of big allocations it might be good just because it reduces memory fragmentation but then maybe use some big, static buffer? For loading very big files it would be better to use memory mapping instead. I'm looking for some rationale why to use not supported VLA extension, what is real benefit of that? It might make more sense to create const size small buffer for smaller messages and dynamic buffers for messages that will not fit there.

来源:https://stackoverflow.com/questions/58848183/dynamic-array-on-stack-vla-vs-heap-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!