Good algorithm and data structure for looking up words with missing letters?

前端未结

关注

 20  1940

不思量自难忘° 2020-12-07 07:12

so I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.

For example, if I have th??

20条回答

臣服心动 (楼主)

2020-12-07 07:27
Given the current limitations:
- There will be up to 2 question marks
- When there are 2 question marks, they appear together
- There are ~100,000 words in the dictionary, average word length is 6.
I have two viable solutions for you:

The fast solution: HASH

You can use a hash which keys are your words with up to two '?', and the values are a list of fitting words. This hash will have around 100,000 + 100,000*6 + 100,000*5 = 1,200,000 entries (if you have 2 question marks, you just need to find the place of the first one...). Each entry can save a list of words, or a list of pointers to the existing words. If you save a list of pointers, and we assume that there are on average less than 20 words matching each word with two '?', then the additional memory is less than 20 * 1,200,000 = 24,000,000.

If each pointer size is 4 bytes, then the memory requirement here is (24,000,000+1,200,000)*4 bytes = 100,800,000 bytes ~= 96 mega bytes.

To sum up this solution:
- Memory Consumption: ~96 MB
- Time for each search: calculating a hash function, and following a pointer. O(1)
Note: if you want to use a hash of a smaller size, you can, but then it is better to save a balanced search tree in each entry instead of a linked list, for better performance.

The space savvy, but still very fast solution: TRIE variation

This solution uses the following observation:

If the '?' signs were at the end of the word, trie would be an excellent solution.

The search in the trie would search at the length of the word, and for the last couple of letters, a DFS traversal would bring all of the endings. Very fast, and very memory-savvy solution.

So lets use this observation, in order to build something to work exactly like this.

You can think about every word you have in the dictionary, as a word ending with @ (or any other symbol that does not exist in your dictionary). So the word 'space' would be 'space@'. Now, if you rotate each of the words, with the '@' sign, you get the following:
```
space@, pace@s, ace@sp, *ce@spa*, e@spac
```
(no @ as first letter).

If you insert all of these variations into a TRIE, you can easily find the word you are seeking at the length of the word, by 'rotating' your word.

Example: You want to find all words that fit 's??ce' (one of them is space, another is slice). You build the word: s??ce@, and rotate it so that the ? sign is in the end. i.e. 'ce@s??'

All of the rotation variations exist inside the trie, and specifically 'ce@spa' (marked with * above). After the beginning is found - you need to go over all of the continuations in the appropriate length, and save them. Then, you need to rotate them again so that the @ is the last letter, and walla - you have all of the words you were looking for!

To sum up this solution:
- Memory Consumption: For each word, all of its rotations appear in the trie. On average, *6 of the memory size is saved in the trie. The trie size is around *3 (just guessing...) of the space saved inside it. So the total space necessary for this trie is 6*3*100,000 = 1,800,000 words ~= 6.8 mega bytes.
- Time for each search:
  - rotating the word: O(word length)
  - seeking the beginning in the trie: O(word length)
  - going over all of the endings: O(number of matches)
  - rotating the endings: O(total length of answers)
  To sum up, it is very very fast, and depends on the word length * small constant.
To sum up...

The second choice has a great time/space complexity, and would be the best option for you to use. There are a few problems with the second solution (in which case you might want to use the first solution):
- More complex to implement. I'm not sure whether there are programming languages with tries built-in out of the box. If there isn't - it means that you'll need to implement it yourself...
- Does not scale well. If tomorrow you decide that you need your question marks spread all over the word, and not necessarily joined together, you'll need to think hard of how to fit the second solution to it. In the case of the first solution - it is quite easy to generalize.
0 讨论(0)

查看其它20个回答
发布评论:

提交评论
- 加载中...

Good algorithm and data structure for looking up words with missing letters?

The fast solution: HASH

The space savvy, but still very fast solution: TRIE variation

To sum up...