It is a google interview question and I find most answers online using HashMap or similar data structure. I am trying to find a solution using Trie if possible. Anybody could gi
Disclaimer: this is not a trie solution, but I still think it's an idea worth exploring.
Create some sort of hash function that only accounts for letters in a word and not their order (no collisions should be possible except in the case of permutations). For example, ABCD
and DCBA
both generate the same hash (but ABCDD
does not). Generate such a hash table containing every word in the dictionary, using chaining to link collisions (on the other hand, unless you have a strict requirement to find "all" longest words and not just one, you can just drop collisions, which are just permutations, and forgo the whole chaining).
Now, if your search set is 4 characters long, for example A, B, C, D
, then as a näive search you check the following hashes to see if they are already contained in the dictionary:
hash(A), hash(B), hash(C), hash(D) // 1-combinations
hash(AB), hash(AC), hash(AD), hash(BC), hash(BD), hash(CD) // 2-combinations
hash(ABC), hash(ABD), hash(ACD), hash(BCD) // 3-combinations
hash(ABCD) // 4-combinations
If you search the hashes in that order, the last match you find will be the longest one.
This ends up having a run time which is dependent on the length of the search set rather than the length of the dictionary. If M
is the number of characters in the search set, then the number of hash lookups is the sum M choose 1 + M choose 2 + M choose 3 + ... + M choose M
which is also the size of the powerset of the search set, so it's O(2^M)
. At first glance this sounds really bad since it's exponential, but to put things in perspective, if your search set is size 10 there will only be around 1000 lookups, which is probably a lot smaller than your dictionary size in a practical real world scenario. At M = 15 we get 32000 lookups, and really, how many English words are there that are longer than 15 characters?
There are two (alternate) ways I can think of to optimize it though:
1) Search for longer matches first e.g. M-combinations then (M-1)-combinations, etc. As soon as you find a match, you can stop! Chances are you will only cover a small fraction of your search space, probably at worst half.
2) Search for shorter matches first (1-combos, 2-combos, etc). Say you come up with a miss at level 2 (for example, no string in your dictionary is composed only of A
and B
). Use an auxiliary data structure (a bitmap perhaps) that allows you to check if any word in the dictionary is even partially composed of A
and B
(in contrast to your primary hash table which checks for complete composition). If you get a miss on the secondary bitmap also, then you know that you can skip all higher level combinations including A
and B
(i.e. you can skip hash(ABC)
, hash(ABD)
, and hash(ABCD)
because no words contain both A
and B
). This leverages the Apriori principle and would drastically reduce the search space as M grows and misses become more frequent. EDIT: I realize that the details I abstract away relating to the "auxiliary data structure" are significant. As I think more about this idea, I realize it is leaning toward a complete dictionary scan as a subprocedure, which defeats the point of this entire approach. Still, it seems there should be a way to use the Apriori principle here.