Two words are anagrams if one of them has exactly same characters as that of the another word.
Example : Anagram
& Nagaram
are anagrams
Example algorithm:
Open dictionary
Create empty hashmap H
For each word in dictionary:
Create a key that is the word's letters sorted alphabetically (and forced to one case)
Add the word to the list of words accessed by the hash key in H
To check for all anagrams of a given word:
Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams
Relatively fast to build, blazingly fast on look-up.
Well Tries would make it easier to check if the word exists. So if you put the whole dictionary in a trie:
http://en.wikipedia.org/wiki/Trie
then you can afterward take your word and do simple backtracking by taking a char and recursively checking if we can "walk" down the Trie with any combiniation of the rest of the chars (adding one char at a time). When all chars are used in a recursion branch and there was a valid path in the Trie, then the word exists.
The Trie helps because its a nice stopping condition: We can check if the part of a string, e.g "Anag" is a valid path in the trie, if not we can break that perticular recursion branch. This means we don't have to check every single permutation of the characters.
In pseudo-code
checkAllChars(currentPositionInTrie, currentlyUsedChars, restOfWord)
if (restOfWord == 0)
{
AddWord(currentlyUsedChar)
}
else
{
foreach (char in restOfWord)
{
nextPositionInTrie = Trie.Walk(currentPositionInTrie, char)
if (nextPositionInTrie != Positions.NOT_POSSIBLE)
{
checkAllChars(nextPositionInTrie, currentlyUsedChars.With(char), restOfWord.Without(char))
}
}
}
Obviously you need a nice Trie datastructure which allows you to progressively "walk" down the tree and check at each node if there is a path with the given char to any next node...
Generating all permutations is easy, I guess you are worried that checking their existence in the dictionary is the "highly inefficient" part. But that actually depends on what data structure you use for the dictionary: of course, a list of words would be inefficient for your use case. Speaking of Tries, they would probably be an ideal representation, and quite efficient, too.
Another possibility would be to do some pre-processing on your dictionary, e.g. build a hashtable where the keys are the word's letters sorted, and the values are lists of words. You can even serialize this hashtable so you can write it to a file and reload quickly later. Then to look up anagrams, you simply sort your given word and look up the corresponding entry in the hashtable.
clojure.string/lower-case
).group-by
) by letter frequency-map (frequencies
).(These
) are the corresponding functions in the Lisp dialect Clojure.
The whole function can be expressed so:
(defn anagrams [dict]
(->> dict
(map clojure.string/lower-case)
(group-by frequencies)
vals))
For example,
(anagrams ["Salt" "last" "one" "eon" "plod"])
;(["salt" "last"] ["one" "eon"] ["plod"])
An indexing function that maps each thing to its collection is
(defn index [xss]
(into {} (for [xs xss, x xs] [x xs])))
So that, for example,
((comp index anagrams) ["Salt" "last" "one" "eon" "plod"])
;{"salt" ["salt" "last"], "last" ["salt" "last"], "one" ["one" "eon"], "eon" ["one" "eon"], "plod" ["plod"]}
... where comp
is the functional composition operator.
That depends on how you store your dictionary. If it is a simple array of words, no algorithm will be faster than linear.
If it is sorted, then here's an approach that may work. I've invented it just now, but I guess its faster than linear approach.
This is how I would do it, anyway. There should be a more conventional approach, but this is faster then linear.
project each dictionary word's count vector in this random direction and store the value (insert such that the array of values is sorted).
Given a new test word, project it in the same random direction used for the dictionary words.
PS: The above procedure is a generalization of the prime number procedure which may potentially lead to large numbers (and hence computational precision issues)