Algorithm for grouping anagram words

前端 未结 14 1517
悲&欢浪女
悲&欢浪女 2020-12-07 23:30

Given a set of words, we need to find the anagram words and display each category alone using the best algorithm.

input:

man car kile arc none like
<         


        
相关标签:
14条回答
  • 2020-12-07 23:46

    JavaScript version. using hashing.

    Time Complexity: 0(nm) , where n is number of words, m is length of word

    var words = 'cat act mac tac ten cam net'.split(' '),
        hashMap = {};
    
    words.forEach(function(w){
        w = w.split('').sort().join('');
        hashMap[w] = (hashMap[w]|0) + 1;
    });
    
    function print(obj,key){ 
        console.log(key, obj[key]);
    }
    
    Object.keys(hashMap).forEach(print.bind(null,hashMap))
    
    0 讨论(0)
  • 2020-12-07 23:50

    I used a Godel-inspired scheme:

    Assign the primes P_1 to P_26 to the letters (in any order, but to obtain smallish hash values best to give common letters small primes).

    Built a histogram of the letters in the word.

    Then the hash value is the product of each letter's associated prime raised to the power of its frequency. This gives a unique value to every anagram.

    Python code:

    primes = [2, 41, 37, 47, 3, 67, 71, 23, 5, 101, 61, 17, 19, 13, 31, 43, 97, 29, 11, 7, 73, 83, 79, 89, 59, 53]
    
    
    def get_frequency_map(word):
        map = {}
    
        for letter in word:
            map[letter] = map.get(letter, 0) + 1
    
        return map
    
    
    def hash(word):
        map = get_frequency_map(word)
        product = 1
        for letter in map.iterkeys():
            product = product * primes[ord(letter)-97] ** map.get(letter, 0)
        return product
    

    This cleverly transforms the tricky problem of finding subanagrams into the (also known to be tricky) problem of factoring large numbers...

    0 讨论(0)
  • 2020-12-07 23:50

    I have implemented this before with a simple array of letter counts, e.g.:

    unsigned char letter_frequency[26];
    

    Then store that in a database table together with each word. Words that have the same letter frequency 'signature' are anagrams, and a simple SQL query then returns all anagrams of a word directly.

    With some experimentation with a very large dictionary, I found no word that exceeded a frequency count of 9 for any letter, so the 'signature' can be represented as a string of numbers 0..9 (The size could be easily halved by packing into bytes as hex, and further reduced by binary encoding the number, but I didn't bother with any of this so far).

    Here is a ruby function to compute the signature of a given word and store it into a Hash, while discarding duplicates. From the Hash I later build a SQL table:

    def processword(word, downcase)
      word.chomp!
      word.squeeze!(" ") 
      word.chomp!(" ")
      if (downcase)
        word.downcase!
      end
      if ($dict[word]==nil) 
        stdword=word.downcase
        signature=$letters.collect {|letter| stdword.count(letter)}
        signature.each do |cnt|
          if (cnt>9)
            puts "Signature overflow:#{word}|#{signature}|#{cnt}"
          end
        end
        $dict[word]=[$wordid,signature]
        $wordid=$wordid+1
      end
    end
    
    0 讨论(0)
  • 2020-12-07 23:51

    Just want to add simple python solution in addition to the other useful answers:

    def check_permutation_group(word_list):
        result = {}
    
        for word in word_list:
            hash_arr_for_word = [0] * 128  # assuming standard ascii
    
            for char in word:
                char_int = ord(char)
                hash_arr_for_word[char_int] += 1
    
            hash_for_word = ''.join(str(item) for item in hash_arr_for_word)
    
            if not result.get(hash_for_word, None):
                result[str(hash_for_word)] = [word]
            else:
                result[str(hash_for_word)] += [word]
    
    return list(result.values())
    
    0 讨论(0)
  • 2020-12-07 23:52

    You will need large integers (or a bit vector actually) but the following might work

    the first occurrence of each letter get's assigned the bit number for that letter, the second occurence gets the bit number for that letter + 26.

    For example

    a #1 = 1 b #1 = 2 c #1 = 4 a #2 = 2^26 b #2 = 2 ^ 27

    You can then sum these together, to get a unique value for the word based on it's letters.

    Your storage requirements for the word values will be:

    n * 26 bits

    where n is the maximum number of occurrences of any repeated letter.

    0 讨论(0)
  • 2020-12-07 23:53

    In C, I just implemented the following hash which basically does a 26-bit bitmask on whether the word in the dictionary has a particular letter in it. So, all anagrams have the same hash. The hash doesn't take into account repeated letters, so there will be some additional overloading, but it still manages to be faster than my perl implementation.

    #define BUCKETS 49999
    
    struct bucket {
        char *word;
        struct bucket *next;
    };
    
    static struct bucket hash_table[BUCKETS];
    
    static unsigned int hash_word(char *word)
    {
        char *p = word;
        unsigned int hash = 0;
    
        while (*p) {
            if (*p < 97 || *p > 122) {
                return 0;
            }
            hash |= 2 << (*p - 97);
            *p++;
        }
    
        return hash % BUCKETS;
    }
    

    Overloaded buckets created and added as linked list, etc. Then just write a function that makes sure that the words that match the hash value are the same length and that the letters in each are 1 to 1 and return that as a match.

    0 讨论(0)
提交回复
热议问题