Best suited data-structure for prefix based searches

问题

I have to maintain a in-memory data-structure of Key-Value pair. I have following constraints:

Both key and values are text strings of length 256 and 1024 respectively. Any key generally looks like k1k2k3k4k5, each k(i) being 4-8 byte string in itself.
As far as possible, in-memory data-structure should have contiguous memory. I have 400 MB worth of Key-Value pair and am allowed 120% worth of allocation. (Additional 20% for metadata, only if needed.)
DS will have following operations:
Add [Infrequent Operation]: Typical signature looks like void add_kv(void *ds, char *key, char *value);
Delete[Infrequent Operation]: Typical signature looks like void del_kv(void *ds, char *key);
LookUp [MOST FREQUENT OPERATION]: Typical signature looks like char *lookup(void *ds, char *key);
Iterate [MOST FREQUENT OPERATION]: This operation is prefix based. It allocates an iterator i.e iterates the whole DS and returns list of key-values that match prefix_key (e.g. "k1k2k3.*", k(i) defined as above). Every iteration iterates on this iterator(list). Freeing the iterator frees the list. Typically expect an Iterator to return 100 KB worth of key-value pair in 400 MB DS (100KB:400 MB :: 1:4000). Typical signature looks like void *iterate(void *ds, char *prefix_key);
Bullet 6 and Bullet 7 being most frequent operation, needs to be optimized for.

My question is what is the best suited data-structure for above constraints?

I have considered hash. Add/delete/lookup could be done in o(1) as I have sufficient memory but it is not optimum for iteration. Hash-of-hash (hash on k1 then on k2 then on k3...) or array of hash could be done but it then violates Bullet 2. What other options do I have?

回答1:

I would probably use something like a B+tree for this: https://en.wikipedia.org/wiki/B%2B_tree

Since memory-efficiency is important to you, when a leaf block gets full you should redistribute keys among several blocks if possible to ensure that blocks are always >= 85% full. Block size should be large enough that the overhead from internal nodes is only a few %.

You can also optimize storage in the leaf blocks, since most of the keys in a block will have a long common prefix that you can figure out from the blocks in the higher levels. You can therefore remove all the copies of the common prefix from the keys in the leaf blocks, and your 400MB of key-value pairs will take substantially less than 400MB of RAM. This will complicate the insert process somewhat.

There are other things you can do to compress this structure further, but that gets complicated fast and it doesn't sound like you need it.

回答2:

I would implement this as a hash table for lookup, and a separate inverted index for your iteration. I think trying to turn those separate key segments into integers, as you asked in Ways to convert special-purpoes-strings to Integers to be a bunch of unnecessary work.

There are plenty of good hash table implementations for C already available, so I won't go into that.

To create the inverted index for iteration, create N hash tables, where N is the number of key segments. Then, for each key, break it into its individual segments and add an entry for that value into the appropriate hash table. So if you have the key "abcxyzqgx", where:

k1 = abc
k2 = xyz
k3 = qgx

Then in the k1 hash table you add an entry "abc=abcxyzqgx". In the k2 hash table you add an entry "xyz=abcxyzqgx". In the k3 hash table you add "qgx=abcxyzqgx". (The values, of course, wouldn't be the string keys themselves, but rather references to the string keys. Otherwise you'd have O(nk) 256-character strings.)

When you're done, your hash tables each have as keys the unique segment values, and the values are lists of keys in which those segments exist.

When you want to find all of the keys that have k1=abc and k3=qgx, you query the k1 hash table for the list of keys that contain abc, query the k3 hash table for the list of keys that contain qgx. Then you do an intersection of those two lists to obtain the result.

Building the individual hash tables is a one-time cost of O(nk), where n is the total number of keys, and k is the number of key segments. Memory requirement, also, is O(nk). Granted, that's a bit expensive, but you're only talking about 1.6 million keys, total.

The case for iteration is O(m*x), where m is the average number of keys referenced by an individual key segment, and x is the number of key segments in the query.

An obvious optimization is to put an LRU cache in front of this lookup, so that frequent queries are served from the cache.

Another possible optimization is to create additional indexes that combine keys. For example, if queries frequently ask for k1 and k2, and the possible combinations are reasonably small, then it makes sense to have a combined k1k2 cache. So if somebody searches for k1=abc and k2=xyz, you have a k1k2 cache that contains "abcxyz=[list of keys]".

回答3:

I would use five parallel hash tables, corresponding to the five possible prefixes one might search for. Each hash table slot would contain zero or more references, with each reference containing the length of the prefix for that particular key-value pair, the hash of that key prefix, and a pointer to the actual key and data structure.

For deletion, the actual key and data structure would contain all five prefix lengths and corresponding hashes, plus the character data for the key and the value.

For example:

#define  PARTS  5

struct hashitem {
    size_t            hash[PARTS];
    size_t            hlen[PARTS];
    char             *data;
    char              key[];
};

struct hashref {
    size_t            hash;
    size_t            hlen;
    struct hashitem  *item;
};

struct hashrefs {
    size_t            size;
    size_t            used;
    struct hashref    ref[];
};

struct hashtable {
    size_t            size[PARTS];
    struct hashrefs **slot[PARTS];
};

In a struct hashitem, if key is k1k2k3k4k5, then hlen[0]=2, hash[0]=hash("k1"), hlen[1]=4, hash[1]=hash("k1k2"), and so on, until hlen[4]=10, hash[4]=hash("k1k2k3k4k5").

When inserting a new key-value pair, one would first find out the prefix lengths (hlen[]) and their corresponding hashes (hash[]), then call a helper function along the lines of

static int insert_pair(struct hashtable *ht,
                       const char       *key,
                       const size_t      hash[PARTS],
                       const size_t      hlen[PARTS],
                       const char       *data,
                       const size_t      datalen)
{
    struct hashitem *item;
    size_t           p, i;

    /* Verify the key is not already in the
       hash table. */

    /* Allocate 'item', and copy 'key', 'data',
       'hash', and 'hlen' to it. */

    for (p = 0; p < PARTS; p++) {
        i = hash[p] % ht->size[p];

        if (!ht->entry[i]) {
            /* Allocate a new hashrefs array,
               with size=1 or greater, initialize
               used=0 */
        } else
        if (ht->entry[i].used >= ht->entry[i].size) {
            /* Reallocate ht->entry[i] with
               size=used+1 or greater */
        }

        ht->entry[i].ref[ht->entry[i].used].hash = hash[p];
        ht->entry[i].ref[ht->entry[i].used].hlen = plen[p];
        ht->entry[i].ref[ht->entry[i].used].item = item;

        ht->entry[i].used++;
    }

    return 0; /* Success, no errors */
}

Prefix lookup would be the same as hash table lookup using the full key:

int lookup_filter(struct hashtable *ht,
                  const size_t      hash,
                  const size_t      hashlen,
                  const size_t      parts, /* 0 to PARTS-1 */
                  const char       *key,
                  int (*func)(struct hashitem *, void *),
                  void             *custom)
{
    const struct hashrefs *refs = ht->entry[parts][hash % ht->size[parts]];
    int                    retval = -1; /* None found */
    size_t                 i;

    if (!refs)
        return retval;

    for (i = 0; i < refs->used; i++)
        if (refs->ref[i].hash == hash &&
            refs->ref[i].hlen == hashlen &&
            !strncmp(refs->ref[i].item->key, key, hashlen)) {
            if (func) {
                retval = func(refs->ref[i].item, custom);
                if (retval)
                    return retval;
            } else
                retval = 0;
        }

    return retval;
}

Note the callback style used, to allow a single lookup to match all prefixes. A full key match, assuming unique keys, would be slightly simpler:

struct hashitem *lookup(struct hashtable *ht,
                        const size_t      hash,
                        const size_t      hashlen,
                        const char       *key)
{
    const struct hashrefs *refs = ht->entry[PARTS-1][hash % ht->size[PARTS-1]];
    size_t                 i;

    if (!refs)
        return NULL;

    for (i = 0; i < refs->used; i++)
        if (refs->ref[i].hash == hash &&
            refs->ref[i].hlen == hashlen &&
            !strncmp(refs->ref[i].item->key, key, hashlen))
            return refs->ref[i].item;

    return NULL;
}

Deletion would utilize the lookup, except that matches can be removed by replacing the matching entry with the final item in the same reference array; or if the item is the only one in the reference array, freeing the entire array altogether.

The reason why using a reference array (multiple data items per hash table entry) is acceptable, is because current processors cache data in chunks (a cacheline being the smallest chunk cached). Because each hash table slot contains one or more matches, with the full hash and hash length of the code, actual collisions where a byte-by-byte comparisons need to be done to ascertain an actual match are exceedingly rare for even fast-and-simple hash functions. (I would expect something like 1.05 to 1.10 string comparisons per matching entry, even with something as simple as a DJB2 hash.)

In other words, this approach tries to minimize the number of cachelines accessed to find the desired pair(s).

Since the initial parts will have lots of duplicate hashes (relatively few unique prefix hashes) and hash lengths, it may be more efficient to make their hash tables smaller. (The reference arrays will be larger.) Because the hashes and hash lengths do not change, one can at any point resize any of the hash tables, without having to recalculate any hashes.

Do note that because all but the PARTS-1 hash tables are used to scan sets of items, it is not a bad thing that their reference arrays may grow to be quite long: these arrays will contain almost exclusively the items one is looking up anyway! (In other words, if a reference array grows to say 10,000 entries long, it is not a problem, if it is used to find the desired say 9,750 entries or so.)

I personally did also consider a table of some sort, for example with each key part being an additional level in the table. However, looking up the set of entries with a given prefix involves then a table traversal, and pretty scattered memory access. I believe, but have not verified (with a microbenchmark comparing the two approaches), that the hash table with potentially large reference arrays per slot is more efficient runtime-wise.

来源：https://stackoverflow.com/questions/50671680/best-suited-data-structure-for-prefix-based-searches

标签

regex

algorithm

data-structures

hash