Generate same unique hash code for all anagrams

前端 未结 6 1171
刺人心
刺人心 2020-12-05 08:34

Recently, I attended an interview and faced a good question regarding hash collisions.

Question : Given a list of strings, print out the anagrams together.

E

6条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-05 08:55

    The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)

    My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.

    def word_hash(w)
      # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
      # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
      primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
      # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
      prime_modulus = (1 << 58) - 27
      h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
    end
    
    words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
    wordcount = words.size
    anagramcount = words.map { |w| w.chars.sort.join }.uniq.count
    
    whash = {}
    inverse_hash = {}
    words.each do |w|
      h = word_hash(w)
      whash[w] = h
      x = inverse_hash[h]
      if x && x.each_char.sort.join != w.each_char.sort.join
        puts "Collision between #{w} and #{x}"
      else
        inverse_hash[h] = w
      end
    end
    hashcount = whash.values.uniq.size
    puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."
    

提交回复
热议问题