Good algorithm and data structure for looking up words with missing letters?

前端 未结 20 1893
不思量自难忘°
不思量自难忘° 2020-12-07 07:12

so I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.

For example, if I have th??

相关标签:
20条回答
  • 2020-12-07 07:23

    My first post had an error that Jason found, it did not work well when ?? was in the beginning. I have now borrowed the cyclic shifts from Anna..

    My solution: Introduce an end-of-word character (@) and store all cyclic shifted words in sorted arrays!! Use one sorted array for each word length. When looking for "th??e@", shift the string to move the ?-marks to the end (obtaining e@th??) and pick the array containing words of length 5 and make a binary search for the first word occurring after string "e@th". All remaining words in the array match, i.e., we will find "e@thoo (thoose), e@thes (these), etc.

    The solution has time complexity Log( N ), where N is the size of the dictionary, and it expands the size of the data by a factor of 6 or so ( the average word length)

    0 讨论(0)
  • 2020-12-07 07:24

    I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.

    Solution #1: Using Regex

    This is working Ruby code for this problem:

    def query(str, data)    
      r = Regexp.new("^#{str.gsub("?", ".")}$")
      idx = 0
      begin
        idx = data.index(r, idx)
        if idx
          yield data[idx, str.size]
          idx += str.size + 1
        end
      end while idx
    end
    
    start_time = Time.now
    query("?r?te", File.read("wordlist.txt")) do |w|
      puts w
    end
    puts Time.now - start_time
    

    The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:

    brute
    crate
    Crete
    grate
    irate
    prate
    write
    wrote
    0.013689
    

    So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:

    query ????????????????e

    counterproductive
    indistinguishable
    microarchitecture
    microprogrammable
    0.018681
    

    query ?h?a?r?c?l?

    theatricals
    0.013608
    

    This looks fast enough for me.

    Solution #2: Regex with Prepared Data

    If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:

    def query_split(str, data)
      query(str, data[str.length]) do |w|
        yield w
      end
    end
    
    # prepare data    
    data = Hash.new("")
    File.read("wordlist.txt").each_line do |w|
      data[w.length-1] += w
    end
    
    # use prepared data for query
    start_time = Time.now
    query_split("?r?te", data) do |w|
      puts w
    end
    puts Time.now - start_time
    

    Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):

    • ?r?te 0.001112 sec
    • ?h?a?r?c?l? 0.000852 sec
    • ????????????????e 0.000169 sec

    Solution #3: One Big Hashtable (Updated Requirements)

    Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.

    Here I create one big hashtable, where each possible query maps to a list of its results:

    def create_big_hash(data)
      h = Hash.new do |h,k|
        h[k] = Array.new
      end    
      data.each_line do |l|
        w = l.strip
        # add all words with one ?
        w.length.times do |i|
          q = String.new(w)
          q[i] = "?"
          h[q].push w
        end
        # add all words with two ??
        (w.length-1).times do |i|
          q = String.new(w)      
          q[i, 2] = "??"
          h[q].push w
        end
      end
      h
    end
    
    # prepare data    
    t = Time.new
    h = create_big_hash(File.read("wordlist.txt"))
    puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"
    
    # use prepared data for query
    t = Time.new
    h["?ood"].each do |w|
      puts w
    end
    puts (Time.new - t)
    

    Output is

    4.960255 sec preparing data
    616745 entries in big hash
    food
    good
    hood
    mood
    wood
    2.0e-05
    

    The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.

    Solution #4: Get Really Serious

    All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:

    • Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
    • Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
    • Google Scholar search for approximate string matching - More than enough to read on this topic.
    0 讨论(0)
  • 2020-12-07 07:27

    Given the current limitations:

    • There will be up to 2 question marks
    • When there are 2 question marks, they appear together
    • There are ~100,000 words in the dictionary, average word length is 6.

    I have two viable solutions for you:

    The fast solution: HASH

    You can use a hash which keys are your words with up to two '?', and the values are a list of fitting words. This hash will have around 100,000 + 100,000*6 + 100,000*5 = 1,200,000 entries (if you have 2 question marks, you just need to find the place of the first one...). Each entry can save a list of words, or a list of pointers to the existing words. If you save a list of pointers, and we assume that there are on average less than 20 words matching each word with two '?', then the additional memory is less than 20 * 1,200,000 = 24,000,000.

    If each pointer size is 4 bytes, then the memory requirement here is (24,000,000+1,200,000)*4 bytes = 100,800,000 bytes ~= 96 mega bytes.

    To sum up this solution:

    • Memory Consumption: ~96 MB
    • Time for each search: calculating a hash function, and following a pointer. O(1)

    Note: if you want to use a hash of a smaller size, you can, but then it is better to save a balanced search tree in each entry instead of a linked list, for better performance.

    The space savvy, but still very fast solution: TRIE variation

    This solution uses the following observation:

    If the '?' signs were at the end of the word, trie would be an excellent solution.

    The search in the trie would search at the length of the word, and for the last couple of letters, a DFS traversal would bring all of the endings. Very fast, and very memory-savvy solution.

    So lets use this observation, in order to build something to work exactly like this.

    You can think about every word you have in the dictionary, as a word ending with @ (or any other symbol that does not exist in your dictionary). So the word 'space' would be 'space@'. Now, if you rotate each of the words, with the '@' sign, you get the following:

    space@, pace@s, ace@sp, *ce@spa*, e@spac
    

    (no @ as first letter).

    If you insert all of these variations into a TRIE, you can easily find the word you are seeking at the length of the word, by 'rotating' your word.

    Example: You want to find all words that fit 's??ce' (one of them is space, another is slice). You build the word: s??ce@, and rotate it so that the ? sign is in the end. i.e. 'ce@s??'

    All of the rotation variations exist inside the trie, and specifically 'ce@spa' (marked with * above). After the beginning is found - you need to go over all of the continuations in the appropriate length, and save them. Then, you need to rotate them again so that the @ is the last letter, and walla - you have all of the words you were looking for!

    To sum up this solution:

    • Memory Consumption: For each word, all of its rotations appear in the trie. On average, *6 of the memory size is saved in the trie. The trie size is around *3 (just guessing...) of the space saved inside it. So the total space necessary for this trie is 6*3*100,000 = 1,800,000 words ~= 6.8 mega bytes.

    • Time for each search:

      • rotating the word: O(word length)
      • seeking the beginning in the trie: O(word length)
      • going over all of the endings: O(number of matches)
      • rotating the endings: O(total length of answers)

      To sum up, it is very very fast, and depends on the word length * small constant.

    To sum up...

    The second choice has a great time/space complexity, and would be the best option for you to use. There are a few problems with the second solution (in which case you might want to use the first solution):

    • More complex to implement. I'm not sure whether there are programming languages with tries built-in out of the box. If there isn't - it means that you'll need to implement it yourself...
    • Does not scale well. If tomorrow you decide that you need your question marks spread all over the word, and not necessarily joined together, you'll need to think hard of how to fit the second solution to it. In the case of the first solution - it is quite easy to generalize.
    0 讨论(0)
  • 2020-12-07 07:27

    The data structure you want is called a trie - see the wikipedia article for a short summary.

    A trie is a tree structure where the paths through the tree form the set of all the words you wish to encode - each node can have up to 26 children, on for each possible letter at the next character position. See the diagram in the wikipedia article to see what I mean.

    0 讨论(0)
  • 2020-12-07 07:28

    To me this problem sounds like a good fit for a Trie data structure. Enter the entire dictionary into your trie, and then look up the word. For a missing letter you would have to try all sub-tries, which should be relatively easy to do with a recursive approach.

    EDIT: I wrote a simple implementation of this in Ruby just now: http://gist.github.com/262667.

    0 讨论(0)
  • 2020-12-07 07:28

    A regex-based solution will consider every possible value in your dictionary. If performance is your largest constraint, an index could be built to speed it up considerably.

    You could start with an index on each word length containing an index of each index=character matching word sets. For length 5 words, for example, 2=r : {write, wrote, drate, arete, arite}, 3=o : {wrote, float, group}, etc. To get the possible matches for the original query, say '?ro??', the word sets would be intersected resulting in {wrote, group} in this case.

    This is assuming that the only wildcard will be a single character and that the word length is known up front. If these are not valid assumptions, I can recommend n-gram based text matching, such as discussed in this paper.

    0 讨论(0)
提交回复
热议问题