How to do fuzzy string matching of bigger than memory dictionary in an ordered key-value store?

南笙酒味 提交于 2020-06-29 03:54:07

问题


I am looking for an algorithm and storage schema to do string matching over a bigger than memory dictionary.

My initial attempt, inspired from https://swtch.com/~rsc/regexp/regexp4.html, was to store trigams of every word of the dictionary for instance the word apple is split into $ap, app, ppl, ple and le$ at index time. All of those trigram as associated with the word they came from.

Then I query time, I do the same for the input string that must be matched. I look up every of those trigram in the database and store in the candidate words in mapping associated with the number of matching trigrams in them. Then, I proceed to compute the levenshtein distance between every candidate and apply the following formula:

score(query, candidate) = common_trigram_number(query, candidate) - abs(levenshtein(query, candidate))

There is two problems with this approach, first the candidate selection is too broad. Second the levenshtein distance is too slow to compute.

Fixing the first, could make the second useless to optimize.

I thought about another approach, at index time, instead of storing trigrams, I will store words (possibly associated with frequency). At query time, I could lookup successive prefixes of the query string and score using levenshtein and frequency.

In particular, I am not looking for an algorithm that gives me strings at a distance of 1, 2 etc... I would like to just have a paginated list of more-or-less relevant words from the dictionary. The actual selection is made by the user.

Also it must be possible to represent it in terms of ordered key-value store like rocksdb or wiredtiger.


回答1:


simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.

I think I found a hash that allows to capture similarity inside the prefix of the hash:

In [1]: import fuzz                                                                                      

In [2]: hello = fuzz.bbkh("hello")                                                                       

In [3]: helo = fuzz.bbkh("helo")                                                                         

In [4]: hellooo = fuzz.bbkh("hellooo")                                                                   

In [5]: salut = fuzz.bbkh("salut")                                                                       

In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix                                                      
Out[6]: 213

In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))                                                        
Out[7]: 12

In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))                                                          
Out[8]: 0

After small test over wikidata labels it seems to give good results:

$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france    0
** farrance      -2
** freande   -2
** defrance      -2

real    0m0.054s

$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace   -1
** france    -2
** fernacre      -2

real    0m0.060s

$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium   -2

real    0m0.047s


$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium   0
** ajbelgium     -2

real    0m0.059s

$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium   -1
** beijum    -2

real    0m0.047s

Here is an implementation:

HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)

chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])


def ngram(string, n):
    return [string[i:i+n] for i in range(len(string)-n+1)]


def integer2booleans(integer):
    return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


def merkletree(booleans):
    assert len(booleans) == HASH_SIZE
    length = (2 * len(booleans) - 1)
    out = [False] * length
    index = length - 1
    booleans = list(reversed(booleans))
    while len(booleans) > 1:
        for boolean in booleans:
            out[index] = boolean
            index -= 1
        new = []
        for (right, left) in chunks(booleans, 2):
            value = right or left
            new.append(value)
        booleans = new
    return out


def bbkh(string):
    integer = 0
    string = "$" + string + "$"
    for gram in ngram(string, 2):
        hotbit = ONE_HOT_ENCODER.index(gram)
        hotinteger = 1 << hotbit
        integer = integer | hotinteger
    booleans = integer2booleans(integer)
    tree = merkletree(booleans)
    fuzz = ''.join('1' if x else '0' for x in tree)
    buzz = int(fuzz, 2)
    hash = buzz.to_bytes(BBKH_LENGTH, 'big')
    return hash

def bbkh(string):
    integer = 0
    string = "$" + string + "$"
    for gram in ngram(string, 2):
        hotbit = ONE_HOT_ENCODER.index(gram)
        hotinteger = 1 << hotbit
        integer = integer | hotinteger
    booleans = integer2booleans(integer)
    tree = merkletree(booleans)
    fuzz = ''.join('1' if x else '0' for x in tree)
    buzz = int(fuzz, 2)
    hash = buzz.to_bytes(BBKH_LENGTH, 'big')
    return hash


def lcp(a, b):
    """Longest Common Prefix between a and b"""
    out = []
    for x, y in zip(a, b):
        if x == y:
            out.append(x)
        else:
            break
    return ''.join(out)

Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.



来源:https://stackoverflow.com/questions/58065020/how-to-do-fuzzy-string-matching-of-bigger-than-memory-dictionary-in-an-ordered-k

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!