How to deal with old references to a resized hash table?

喜你入骨 提交于 2019-12-04 19:55:22

Do not use the hash table as the container for the data; only use it to refer to the data, and you won't have that problem.

For example, let's say you have key-value pairs, using a structure with the actual data in the C99 flexible array member:

struct pair {
    struct pair  *next; /* For hash chaining */
    size_t        hash; /* For the raw key hash */

    /* Payload: */
    size_t        offset; /* value starts at (data + offset) */
    char          data[]; /* key starts at (data) */
};

static inline const char *pair_key(struct pair *ref)
{
    return (const char *)(ref->data);
}

static inline const char *pair_value(struct pair *ref)
{
    return (const char *)(ref->data + ref->offset);
}

Your hash table can then be simply

struct pair_hash_table {
    size_t        size;
    struct pair **entry;
};

If you have struct pair_hash_table *ht, and struct pair *foo with foo->hash containing the hash of the key, then foo should be in the singly-linked list hanging off ht->entry[foo->hash % ht->size];.

Let's say you wish to resize the hash table ht. You choose a new size, and allocate enough memory for that many struct pair *. Then, you go through each singly-linked list in each old hash entry, detaching them from the old list, and prepending them to the lists in correct hash table entries in the new hash table. Then you just free the old hash table entry array, replacing it with the new one:

int resize_pair_hash_table(struct pair_hash_table *ht, const size_t new_size)
{
    struct pair **entry, *curr, *next;
    size_t        i, k;

    if (!ht || new_size < 1)
        return -1; /* Invalid parameters */

    entry = malloc(new_size * sizeof entry[0]);
    if (!entry)
        return -1; /* Out of memory */

    /* Initialize new entry array to empty. */
    for (i = 0; i < new_size; i++)
        entry[i] = NULL;

    for (i = 0; i < ht->size; i++) {

        /* Detach the singly-linked list. */
        next = ht->entry[i];
        ht->entry[i] = NULL;

        while (next) {
            /* Detach the next element, as 'curr' */
            curr = next;
            next = next->next;

            /* k is the index to this hash in the new array */
            k = curr->hash % new_size;

            /* Prepend to the list in the new array */
            curr->next = entry[k];
            entry[k] = curr;
        }
    }

    /* Old array is no longer needed, */
    free(ht->entry);

    /* so replace it with the new one. */
    ht->entry = entry;
    ht->size = size;

    return 0; /* Success */
}

Note that the hash field in struct pair is not modified, nor recalculated.

Having the raw hash (as opposed to modulo table-size), means you can speed up the key search even when different keys use the same slot:

struct pair *find_key(struct pair_hash_table *ht,
                      const char *key, const size_t key_hash)
{
    struct pair *curr = ht->entry[key_hash % ht->size];

    while (curr)
        if (curr->hash == key_hash && !strcmp(key, pair_key(next)))
            return curr;
        else
            curr = curr->next;

    return NULL; /* Not found. */
}

In C, the logical and operator, &&, is short-circuiting. If the left side is not true, the right side is not evaluated at all, because the entire expression can never be true in that case.

Above, this means that the raw hash value of the key is compared, and only when they do match, the actual strings are compared. If your hash algorithm is even halfway good, this means that if the key already exists, typically only one string comparison is done; and if the key does not exist in the table, typically no string comparisons are done.

You can deal with them the same way the standard library (C++) deals with this exact problem:

Some operations on containers (e.g. insertion, erasing, resizing) invalidate iterators.

For instance std::unordered_map which is basically a hash table implemented with buckets has these rules:

  • insertion

unordered_[multi]{set,map}: all iterators invalidated when rehashing occurs, but references unaffected [23.2.5/8]. Rehashing does not occur if the insertion does not cause the container's size to exceed z * B where z is the maximum load factor and B the current number of buckets. [23.2.5/14]

  • erasure

unordered_[multi]{set,map}: only iterators and references to the erased elements are invalidated [23.2.5/13]

Iterator invalidation rules

The C++ concept of iterators is a generalization of pointers. So this concept can be applied to C.


Your only other alternative is that instead of holding the objects directly into the container you add another level of indirection and hold some sort of proxy. And so the elements always stay at the same position in memory. It's the proxies that move around on resizing/inserting etc. But you need to analize this scenario: are the added double indirection (which will surely affect performance in a negative way) and increase implementation complexity worth it? Is is that important to have persistent pointers?

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!