so I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.
For example, if I have th??
Given the current limitations:
I have two viable solutions for you:
You can use a hash which keys are your words with up to two '?', and the values are a list of fitting words. This hash will have around 100,000 + 100,000*6 + 100,000*5 = 1,200,000 entries (if you have 2 question marks, you just need to find the place of the first one...). Each entry can save a list of words, or a list of pointers to the existing words. If you save a list of pointers, and we assume that there are on average less than 20 words matching each word with two '?', then the additional memory is less than 20 * 1,200,000 = 24,000,000.
If each pointer size is 4 bytes, then the memory requirement here is (24,000,000+1,200,000)*4 bytes = 100,800,000 bytes ~= 96 mega bytes.
To sum up this solution:
Note: if you want to use a hash of a smaller size, you can, but then it is better to save a balanced search tree in each entry instead of a linked list, for better performance.
This solution uses the following observation:
If the '?' signs were at the end of the word, trie would be an excellent solution.
The search in the trie would search at the length of the word, and for the last couple of letters, a DFS traversal would bring all of the endings. Very fast, and very memory-savvy solution.
So lets use this observation, in order to build something to work exactly like this.
You can think about every word you have in the dictionary, as a word ending with @ (or any other symbol that does not exist in your dictionary). So the word 'space' would be 'space@'. Now, if you rotate each of the words, with the '@' sign, you get the following:
space@, pace@s, ace@sp, *ce@spa*, e@spac
(no @ as first letter).
If you insert all of these variations into a TRIE, you can easily find the word you are seeking at the length of the word, by 'rotating' your word.
Example: You want to find all words that fit 's??ce' (one of them is space, another is slice). You build the word: s??ce@, and rotate it so that the ? sign is in the end. i.e. 'ce@s??'
All of the rotation variations exist inside the trie, and specifically 'ce@spa' (marked with * above). After the beginning is found - you need to go over all of the continuations in the appropriate length, and save them. Then, you need to rotate them again so that the @ is the last letter, and walla - you have all of the words you were looking for!
To sum up this solution:
Memory Consumption: For each word, all of its rotations appear in the trie. On average, *6 of the memory size is saved in the trie. The trie size is around *3 (just guessing...) of the space saved inside it. So the total space necessary for this trie is 6*3*100,000 = 1,800,000 words ~= 6.8 mega bytes.
Time for each search:
To sum up, it is very very fast, and depends on the word length * small constant.
The second choice has a great time/space complexity, and would be the best option for you to use. There are a few problems with the second solution (in which case you might want to use the first solution):