Why are hash table expansions usually done by doubling the size?

后端 未结 6 984
小蘑菇
小蘑菇 2020-12-02 06:21

I\'ve done a little research on hash tables, and I keep running across the rule of thumb that when there are a certain number of entries (either max or via a load factor lik

相关标签:
6条回答
  • 2020-12-02 07:05

    Doubling the memory when expanding any type of collection is an oftenly used strategy to prevent memory fragmentation and not having to reallocate too often. As you point out there might be reasons to have a prime number of elements. When knowing your application and your data, you might also be able to predict the growth of the number of elements and thus choose another (larger or smaller) growth factor than doubling.

    The general implementations found in libraries are exactly that: General implementations. They have to focus on being a reasonable choice in a variety of different situations. When knowing the context, it is almost always possible to write a more specialized and more efficient implementation.

    0 讨论(0)
  • 2020-12-02 07:06

    The same reasoning applies for doubling the size as for vector/ArrayList implementations, see this answer.

    0 讨论(0)
  • 2020-12-02 07:16

    I had read a very interesting discussion on growth strategy on this very site... just cannot find it again.

    While 2 is commonly used, it's been demonstrated that it was not the best value. One often cited problem is that it does not cope well with allocators schemes (which often allocate power of twos blocks) since it would always require a reallocation while a smaller number might in fact be reallocated in the same block (simulating in-place growth) and thus being faster.

    Thus, for example, the VC++ Standard Library uses a growth factor of 1.5 (ideally should be the golden number if a first-fit memory allocation strategy is being used) after an extensive discussion on the mailing list. The reasoning is explained here:

    I'd be interested if any other vector implementations uses a growth factor other than 2, and I'd also like to know whether VC7 uses 1.5 or 2 (since I don't have that compiler here).

    There is a technical reason to prefer 1.5 to 2 -- more specifically, to prefer values less than 1+sqrt(5)/2.

    Suppose you are using a first-fit memory allocator, and you're progressively appending to a vector. Then each time you reallocate, you allocate new memory, copy the elements, then free the old memory. That leaves a gap, and it would be nice to be able to use that memory eventually. If the vector grows too rapidly, it will always be too big for the available memory.

    It turns out that if the growth factor is >= 1+sqrt(5)/2, the new memory will always be too big for the hole that has been left sofar; if it is < 1+sqrt(5)/2, the new memory will eventually fit. So 1.5 is small enough to allow the memory to be recycled.

    Surely, if the growth factor is >= 2 the new memory will always be too big for the hole that has been left so far; if it is < 2, the new memory will eventually fit. Presumably the reason for (1+sqrt(5))/2 is...

    • Initial allocation is s.
    • The first resize is k*s.
    • The second resize is k*k*s, which will fit the hole iff k*k*s <= k*s+s, i.e. iff k <= (1+sqrt(5))/2

    ...the hole can be recycled asap.

    It could, by storing its previous size, grow fibonaccily.

    Of course, it should be tailored to the memory allocation strategy.

    0 讨论(0)
  • 2020-12-02 07:22

    Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.

    Most implementations allow the average bucket occupation to grow to until a bound fixed in advance before resizing (anywhere between 0.5 and 3, which are all acceptable values). With this convention, just after resizing the average bucket occupation becomes half that bound. Resizing by doubling keeps the average bucket occupation in a band of width *2.

    Sub-note: because of statistical clustering, you have to take an average bucket occupation as low as 0.5 if you want many buckets to have at most one elements (maximum speed for finding ignoring the complex effects of cache size), or as high as 3 if you want a minimum number of empty buckets (that correspond to wasted space).

    0 讨论(0)
  • 2020-12-02 07:22

    One reason for doubling size that is specific to hash containers is that if the container capacity is always a power of two, then instead of using a general purpose modulo for converting a hash to an offset, the same result can be achieved with bit shifting. Modulo is a slow operation for the same reasons that integer division is slow. (Whether integer division is "slow" in the context of whatever else is going in a program is of course case dependent but it's certainly slower than other basic integer arithmetic.)

    0 讨论(0)
  • 2020-12-02 07:22

    If you don't know how many objects you will end up using (lets say N),
    by doubling the space you'll do log2N reallocations at most.

    I assume that if you choose a proper initial "n", you increase the odds
    that 2*n + 1 will produce prime numbers in subsequent reallocations.

    0 讨论(0)
提交回复
热议问题