Why deallocating heap memory is much slower than allocating it?

问题

This is an empirical assumption (that allocating is faster then de-allocating).

This is also one of the reason, i guess, why heap based storages (like STL containers or else) choose to not return currently unused memory to the system (that is why shrink-to-fit idiom was born).

And we shouldn't confuse, of course, 'heap' memory with the 'heap'-like data structures.

So why de-allocation is slower?

Is it Windows-specific (i see it on Win 8.1) or OS independent?

Is there some C++ specific memory manager automatically involved on using 'new' / 'delete' or the whole mem. management is completely relies on the OS? (i know C++11 introduced some garbage-collection support, which i never used really, better relying on the old stack and static duration or self managed containers and RAII).

Also, in the code of the FOLLY string i saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?

P. S. please note that the question is not about virtual memory mechanics, i understand that user-space programs didn't use real mem. addresation.

回答1:

I had much the same idea as @Basile: I wondered whether your base assumption was actually (even close to) correct. Since you tagged the question C++, I wrote a quick benchmark in C++ instead.

#include <vector>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iomanip>
#include <locale>

int main() {
    std::cout.imbue(std::locale(""));

    using namespace std::chrono;
    using factor = microseconds;

    auto const size = 2000;

    std::vector<int *> allocs(size);

    auto start = high_resolution_clock::now();

    for (int i = 0; i < size; i++)
        allocs[i] = new int[size];

    auto stop = high_resolution_clock::now();
    auto alloc_time = duration_cast<factor>(stop - start).count();

    start = high_resolution_clock::now();

    for (int i = 0; i < size; i++)
        delete[] allocs[i];

    stop = high_resolution_clock::now();

    auto del_time = duration_cast<factor>(stop - start).count();

    std::cout << std::left << std::setw(20) << "alloc time: " << alloc_time << " uS\n";
    std::cout << std::left << std::setw(20) << "del time: " << del_time << " uS\n";
}

I also used VC++ on Windows instead of gcc on Linux. The result wasn't much different though: freeing the memory took substantially less time than allocating it did. Here are the results from three successive runs.

alloc time:         2,381 uS
del time:           1,429 uS

alloc time:         2,764 uS
del time:           1,592 uS

alloc time:         2,492 uS
del time:           1,442 uS

I'd warn, however, allocation and freeing is handled (primarily) by the standard library, so this could be different between one standard library and another (even when using the same compiler). I'd also note that it wouldn't surprise me if this were to change somewhat in multi-threaded code. Although it's not actually correct, there appear to be a few authors who are under the mis-apprehension that freeing in a multithreaded environment requires locking a heap for exclusive access. This can be avoided, but the means to do so isn't necessarily immediately obvious.

回答2:

The assertion that allocating memory is faster than deallocating it seemed a bit odd to me, so I tested it. I ran a test where I allocated 64MB of memory in 32-byte chunks (so 2M calls to new), and I tried deleting that memory in the same order it was allocated, and in a random order. I found that linear-order deallocation was about 3% faster than allocation, and that random deallocation was about 10% slower than linear allocation.

I then ran a test where I started with 64MB of allocated memory, and then 2M times either allocated new memory or deleted existing memory (at random). Here, I found that deallocation was about 4.3% slower than allocation.

So, it turns out you were correct - deallocation is slower than allocation (though I wouldn't call it "much" slower). I suspect this has simply to do with more random accesses, but I have no evidence for this other than that the linear deallocation was faster.

To answer some of your questions:

Is there some C++ specific memory manager automatically involved on using 'new' / 'delete'?

Yes. The OS has system calls which allocate pages of memory (typically 4KB chunks) to processes. It's the process' job to divide up those pages into objects. Try looking up the "GNU Memory Allocator."

I saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?

Most C++ new/delete implementations just call malloc and free under the hood. This is not required by the standard, however, so it's a good idea to always use the same allocation and deallocation function on any particular object.

I ran my tests with the native testing framework provided in Visual Studio 2015, on a Windows 10 64-bit machine (The tests were also 64-bit). Here's the code:

#include "stdafx.h"
#include "CppUnitTest.h"

using namespace Microsoft::VisualStudio::CppUnitTestFramework;

namespace AllocationSpeedTest
{       
    class Obj32 {
        uint64_t a;
        uint64_t b;
        uint64_t c;
        uint64_t d;
    };
    constexpr int len = 1024 * 1024 * 2;
    Obj32* ptrs[len];
    TEST_CLASS(UnitTest1)
    {
    public:
        TEST_METHOD(Linear32Alloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
        }
        TEST_METHOD(Linear32AllocDealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            for (int i = 0; i < len; ++i) {
                delete ptrs[i];
            }
        }
        TEST_METHOD(Random32AllocShuffle)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                int pos = (rand() % (len - i)) + i;
                Obj32* temp = ptrs[i];
                ptrs[i] = ptrs[pos];
                ptrs[pos] = temp;
            }
        }
        TEST_METHOD(Random32AllocShuffleDealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                int pos = (rand() % (len - i)) + i;
                Obj32* temp = ptrs[i];
                ptrs[i] = ptrs[pos];
                ptrs[pos] = temp;
            }
            for (int i = 0; i < len; ++i) {
                delete ptrs[i];
            }
        }
        TEST_METHOD(Mixed32Both)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    ptrs[i] = new Obj32();
                }
                else {
                    delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Alloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    ptrs[i] = new Obj32();
                }
                else {
                    //delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Dealloc)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    //ptrs[i] = new Obj32();
                }
                else {
                    delete ptrs[i];
                }
            }
        }
        TEST_METHOD(Mixed32Neither)
        {
            for (int i = 0; i < len; ++i) {
                ptrs[i] = new Obj32();
            }
            srand(0);
            for (int i = 0; i < len; ++i) {
                if (rand() % 2) {
                    //ptrs[i] = new Obj32();
                }
                else {
                    //delete ptrs[i];
                }
            }
        }
    };
}

And here are the raw results over several runs. All numbers are in milliseconds.

回答3:

I am not sure of your observation. I wrote the following program (on Linux, hopefully you could port it to your system).

// public domain code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <assert.h>


const unsigned possible_word_sizes[] = {
  1, 2, 3, 4, 5,
  8, 12, 16, 24,
  32, 48, 64, 128,
  256, 384, 2048
};

long long totalsize;

// return a calloc-ed array of nbchunks malloced zones of
// somehow random size
void **
malloc_chunks (int nbchunks)
{
  const int nbsizes =
    (int) (sizeof (possible_word_sizes)
       / sizeof (possible_word_sizes[0]));
  void **ad = calloc (nbchunks, sizeof (void *));
  if (!ad)
    {
      perror ("calloc chunks");
      exit (EXIT_FAILURE);
    };
  for (int ix = 0; ix < nbchunks; ix++)
    {
      unsigned sizindex = random () % nbsizes;
      unsigned size = possible_word_sizes[sizindex];
      void *zon = malloc (size * sizeof (void *));
      if (!zon)
    {
      fprintf (stderr,
           "malloc#%d (%d words) failed (total %lld) %s\n",
           ix, size, totalsize, strerror (errno));
      exit (EXIT_FAILURE);
    }
      ((int *) zon)[0] = ix;
      totalsize += size;
      ad[ix] = zon;
    }
  return ad;
}

void
free_chunks (void **chks, int nbchunks)
{
// first, free the two thirds of chunks in random order
  for (int i = 0; 3 * i < 2 * nbchunks; i++)
    {
      int pix = random () % nbchunks;
      if (chks[pix])
    {
      free (chks[pix]);
      chks[pix] = NULL;
    }
    }
// then, free the rest in reverse order
  for (int i = nbchunks - 1; i >= 0; i--)
    if (chks[i])
      {
    free (chks[i]);
    chks[i] = NULL;
      }
}

int
main (int argc, char **argv)
{
  assert (sizeof (int) <= sizeof (void *));
  int nbchunks = (argc > 1) ? atoi (argv[1]) : 32768;
  if (nbchunks < 128)
    nbchunks = 128;
  srandom (time (NULL));
  printf ("nbchunks=%d\n", nbchunks);
  void **chks = malloc_chunks (nbchunks);
  clock_t clomall = clock ();
  printf ("clomall=%ld totalsize=%lld words\n",
      (long) clomall, totalsize);
  free_chunks (chks, nbchunks);
  clock_t clofree = clock ();
  printf ("clofree=%ld\n", (long) clofree);
  return 0;
}

I compiled it with gcc -O2 -Wall mf.c -o mf on my Debian/Sid/x86-64 (i3770k, 16Gb). I run time ./mf 100000 and got:

nbchunks=100000
clomall=54162 totalsize=19115681 words
clofree=83895
./mf 100000  0.02s user 0.06s system 95% cpu 0.089 total

on my system clock gives CPU microseconds. If the call to random is negligible (and I don't know if it is) w.r.t. malloc & free time, I tend to disagree with your observations. free seems to be twice as fast as malloc. My gcc is 6.1, my libc is Glibc 2.22.

Please take time to compile the above benchmark on your system and report the timings.

FWIW, I took Jerry's code and

 g++ -O3 -march=native jerry.cc -o jerry
 time ./jerry;  time ./jerry; time ./jerry

gives

alloc time:         1940516
del time:           602203
./jerry  0.00s user 0.01s system 68% cpu 0.016 total
alloc time:         1893057
del time:           558399
./jerry  0.00s user 0.01s system 68% cpu 0.014 total
alloc time:         1818884
del time:           527618
./jerry  0.00s user 0.01s system 70% cpu 0.014 total

回答4:

When you allocate small memory blocks, the block size you specify maps directly to a suballocator for that size, which is commonly represented as a "slab" of memory containing same size records, to avoid memory fragmentation. This can be very fast, similar to an array access. But freeing such blocks is not so straight forward, because you are passing a pointer to memory of unknown size, requiring additional work to determine what slab it belongs to, before the block can be returned to its proper place.

When you allocate large blocks of virtual memory, a memory page range is set up in your process space without actually mapping any physical memory to it, and that requires very little work to accomplish. But freeing such large blocks can require much more work, because the pointer freed must first be matched to the page tables for that range, followed by walking through all of the page entries for the memory range that it spans, and releasing all of the physical memory pages assigned to that range by the intervening page faults.

Of course, the details of this will vary depending on the implementation being used, but the principles remain much the same: memory allocation of a known block size requires less effort than releasing a pointer to a memory block of unknown size. My knowledge of this comes directly from my experience developing high-performance commercial grade RAII memory allocators.

I should also point out that since every heap allocation has a matching and corresponding release, this pair of operations represents a single allocation cycle, i.e. as the two sides of one coin. Together, their execution time can be accurately measured, but separately such measurement is difficult to pin down, as it varies widely depending on block size, previous activity across similar sizes, caching and other operational considerations. But in the end, allocate/free differences may not much matter, since you don't do one without the other.

回答5:

The problem here is heap fragmentation. Programs written in languages with explicit pointer arithmetic have no realistic ways of defragmenting heap.

If your heap is fragmented, you can't return memory to OS. OS, barring virtual memory, depends on brk(2)-like mechanism - i.e. you set an upper bound for all memory addresses you'll refer to. But when you have even one buffer allocated and still in use near existing boundary, you can't return memory to OS explicitly. Doesn't matter if 99% of all the memory in your program is freed.

Dealocation doesn't have to be slower than allocation. But the fact that you have manual deallocation with heap fragmenting makes allocation slower and more complex.

GCs fight this by compactifying heap. This way, allocation is just incrementing pointer for them, and deallocation is not needed for bulk of objects.

来源：https://stackoverflow.com/questions/38030504/why-deallocating-heap-memory-is-much-slower-than-allocating-it

标签

c++

memory

memory-management

heap-memory