Recently, I attended an interview and faced a good question regarding hash collisions.
Question : Given a list of strings, print out the anagrams together.
E
The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)
My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.
def word_hash(w)
# 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
# Punctuation gets assigned values that overlap the letters, but we don't care about that much.
primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
# Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
prime_modulus = (1 << 58) - 27
h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end
words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count
whash = {}
inverse_hash = {}
words.each do |w|
h = word_hash(w)
whash[w] = h
x = inverse_hash[h]
if x && x.each_char.sort.join != w.each_char.sort.join
puts "Collision between #{w} and #{x}"
else
inverse_hash[h] = w
end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."