Good algorithm and data structure for looking up words with missing letters?

前端 未结 20 1895
不思量自难忘°
不思量自难忘° 2020-12-07 07:12

so I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.

For example, if I have th??

相关标签:
20条回答
  • 2020-12-07 07:39

    Here's how I'd do it:

    1. Concatenate the words of the dictionary into one long String separated by a non-word character.
    2. Put all words into a TreeMap, where the key is the word and the value is the offset of the start of the word in the big String.
    3. Find the base of the search string; i.e. the largest leading substring that doesn't include a '?'.
    4. Use TreeMap.higherKey(base) and TreeMap.lowerKey(next(base)) to find the range within the String between which matches will be found. (The next method needs to calculate the next larger word to the base string with the same number or fewer characters; e.g. next("aa") is "ab", next("az") is "b".)
    5. Create a regex for the search string and use Matcher.find() to search the substring corresponding to the range.

    Steps 1 and 2 are done beforehand giving a data structure using O(NlogN) space where N is the number of words.

    This approach degenerates to a brute-force regex search of the entire dictionary when the '?' appears in the first position, but the further to the right it is, the less matching needs to be done.

    EDIT:

    To improve the performance in the case where '?' is the first character, create a secondary lookup table that records the start/end offsets of runs of words whose second character is 'a', 'b', and so on. This can be used in the case where the first non-'?' is second character. You can us a similar approach for cases where the first non-'?' is the third character, fourth character and so on, but you end up with larger and larger numbers of smaller and smaller runs, and eventually this "optimization" becomes ineffective.

    An alternative approach which requires significantly more space, but which is faster in most cases, is to prepare the dictionary data structure as above for all rotations of the words in the dictionary. For instance, the first rotation would consist of all words 2 characters or more with the first character of the word moved to the end of the word. The second rotation would be words of 3 characters or more with the first two characters moved to the end, and so on. Then to do the search, look for the longest sequence of non-'?' characters in the search string. If the index of the first character of this substring is N, use the Nth rotation to find the ranges, and search in the Nth rotation word list.

    0 讨论(0)
  • 2020-12-07 07:40

    Assume you have enough memory, you could build a giant hashmap to provide the answer in constant time. Here is a quick example in python:

    from array import array
    all_words = open("english-words").read().split()
    big_map = {}
    
    def populate_map(word):
      for i in range(pow(2, len(word))):
        bin = _bin(i, len(word))
        candidate = array('c', word)
        for j in range(len(word)):
          if bin[j] == "1":
            candidate[j] = "?"
        if candidate.tostring() in big_map:
          big_map[candidate.tostring()].add(word)
        else:
          big_map[candidate.tostring()] = set([word])
    
    def _bin(x, width):
        return ''.join(str((x>>i)&1) for i in xrange(width-1,-1,-1))
    
    def run():
      for word in all_words:
        populate_map(word)
    
    run()
    
    >>> big_map["y??r"]
    set(['your', 'year'])
    >>> big_map["yo?r"]
    set(['your'])
    >>> big_map["?o?r"]
    set(['four', 'poor', 'door', 'your', 'hour'])
    
    0 讨论(0)
  • 2020-12-07 07:41

    If you seriously want something on the order of a billion searches per second (though i can't dream why anyone outside of someone making the next grand-master scrabble AI or something for a huge web service would want that fast), i recommend utilizing threading to spawn [number of cores on your machine] threads + a master thread that delegates work to all of those threads. Then apply the best solution you have found so far and hope you don't run out of memory.

    An idea i had is that you can speed up some cases by preparing sliced down dictionaries by letter then if you know the first letter of the selection you can resort to looking in a much smaller haystack.

    Another thought I had was that you were trying to brute-force something -- perhaps build a DB or list or something for scrabble?

    0 讨论(0)
  • 2020-12-07 07:42

    Anna's second solution is the inspiration for this one.

    First, load all the words into memory and divide the dictionary into sections based on word length.

    For each length, make n copies of an array of pointers to the words. Sort each array so that the strings appear in order when rotated by a certain number of letters. For example, suppose the original list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five arrays of pointers will be:

    rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train]
    rotated by 1 letter:  [hacks, happy, plane, space, apple, train, stack]
    rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy]
    rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy]
    rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]
    

    (Instead of pointers, you can use integers identifying the words, if that saves space on your platform.)

    To search, just ask how much you would have to rotate the pattern so that the question marks appear at the end. Then you can binary search in the appropriate list.

    If you need to find matches for ??ppy, you would have to rotate that by 2 to make ppy??. So look in the array that is in order when rotated by 2 letters. A quick binary search finds that "happy" is the only match.

    If you need to find matches for th??g, you would have to rotate that by 4 to make gth??. So look in array 4, where a binary search finds that there are no matches.

    This works no matter how many question marks there are, as long as they all appear together.

    Space required in addition to the dictionary itself: For words of length N, this requires space for (N times the number of words of length N) pointers or integers.

    Time per lookup: O(log n) where n is the number of words of the appropriate length.

    Implementation in Python:

    import bisect
    
    class Matcher:
        def __init__(self, words):
            # Sort the words into bins by length.
            bins = []
            for w in words:
                while len(bins) <= len(w):
                    bins.append([])
                bins[len(w)].append(w)
    
            # Make n copies of each list, sorted by rotations.
            for n in range(len(bins)):
                bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)]
            self.bins = bins
    
        def find(self, pattern):
            bins = self.bins
            if len(pattern) >= len(bins):
                return []
    
            # Figure out which array to search.
            r = (pattern.rindex('?') + 1) % len(pattern)
            rpat = (pattern[r:] + pattern[:r]).rstrip('?')
            if '?' in rpat:
                raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern))
            a = bins[len(pattern)][r]
    
            # Binary-search the array.
            class RotatedArray:
                def __len__(self):
                    return len(a)
                def __getitem__(self, i):
                    word = a[i]
                    return word[r:] + word[:r]
            ra = RotatedArray()
            start = bisect.bisect(ra, rpat)
            stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1))
    
            # Return the matches.
            return a[start:stop]
    
    words = open('/usr/share/dict/words', 'r').read().split()
    print "Building matcher..."
    m = Matcher(words)  # takes 1-2 seconds, for me
    print "Done."
    
    print m.find("st??k")
    print m.find("ov???low")
    

    On my computer, the system dictionary is 909KB big and this program uses about 3.2MB of memory in addition to what it takes just to store the words (pointers are 4 bytes). For this dictionary, you could cut that in half by using 2-byte integers instead of pointers, because there are fewer than 216 words of each length.

    Measurements: On my machine, m.find("st??k") runs in 0.000032 seconds, m.find("ov???low") in 0.000034 seconds, and m.find("????????????????e") in 0.000023 seconds.

    By writing out the binary search instead of using class RotatedArray and the bisect library, I got those first two numbers down to 0.000016 seconds: twice as fast. Implementing this in C++ would make it faster still.

    0 讨论(0)
  • 2020-12-07 07:44

    Directed Acyclic Word Graph would be perfect data structure for this problem. It combines efficiency of a trie (trie can be seen as a special case of DAWG), but is much more space efficient. Typical DAWG will take fraction of size that plain text file with words would take.

    Enumerating words that meet specific conditions is simple and the same as in trie - you have to traverse graph in depth-first fashion.

    0 讨论(0)
  • 2020-12-07 07:44

    First we need a way to compare the query string with a given entry. Let's assume a function using regexes: matches(query,trialstr).

    An O(n) algorithm would be to simply run through every list item (your dictionary would be represented as a list in the program), comparing each to your query string.

    With a bit of pre-calculation, you could improve on this for large numbers of queries by building an additional list of words for each letter, so your dictionary might look like:

    wordsbyletter = { 'a' : ['aardvark', 'abacus', ... ],
                      'b' : ['bat', 'bar', ...],
                      .... }
    

    However, this would be of limited use, particularly if your query string starts with an unknown character. So we can do even better by noting where in a given word a particular letter lies, generating:

    wordsmap = { 'a':{ 0:['aardvark', 'abacus'],
                       1:['bat','bar'] 
                       2:['abacus']},
                 'b':{ 0:['bat','bar'],
                       1:['abacus']},
                 ....
               }
    

    As you can see, without using indices, you will end up hugely increasing the amount of required storage space - specifically a dictionary of n words and average length m will require nm2 of storage. However, you could very quickly now do your look up to get all the words from each set that can match.

    The final optimisation (which you could use off the bat on the naive approach) is to also separate all the words of the same length into separate stores, since you always know how long the word is.

    This version would be O(kx) where k is the number of known letters in the query word, and x=x(n) is the time to look up a single item in a dictionary of length n in your implementation (usually log(n).

    So with a final dictionary like:

    allmap = { 
               3 : { 
                      'a' : {
                              1 : ['ant','all'],
                              2 : ['bar','pat']
                             }
                      'b' : {
                              1 : ['bar','boy'],
                          ...
                    }
               4 : {
                      'a' : {
                              1 : ['ante'],
                          ....
    

    Then our algorithm is just:

    possiblewords = set()
    firsttime = True
    wordlen = len(query)
    for idx,letter in enumerate(query):
        if(letter is not '?'):
            matchesthisletter = set(allmap[wordlen][letter][idx])
            if firsttime:
                 possiblewords = matchesthisletter
            else:
                 possiblewords &= matchesthisletter
    

    At the end, the set possiblewords will contain all the matching letters.

    0 讨论(0)
提交回复
热议问题