Fuzzy Group By, Grouping Similar Words

后端 未结 5 776
耶瑟儿~
耶瑟儿~ 2020-12-10 07:44

this question is asked here before

What is a good strategy to group similar words?

but no clear answer is given on how to \"group\" items. The solution based

5条回答
  •  无人及你
    2020-12-10 08:04

    Here is another version using Affinity Propagation algorithm.

    import numpy as np
    import scipy.linalg as lin
    import Levenshtein as leven
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.cluster import AffinityPropagation
    import itertools
    
    words = np.array(
        ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
         'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
         'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
         'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
         'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
         'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
         'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
         'people', 'into', 'year', 'your', 'good', 'some', 'could',
         'them', 'see', 'other', 'than', 'then', 'now', 'look',
         'only', 'come', 'its', 'over', 'think', 'also', 'back',
         'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
         'way', 'even', 'new', 'want', 'because', 'any', 'these',
         'give', 'day', 'most', 'us'])
    
    print "calculating distances..."
    
    (dim,) = words.shape
    
    f = lambda (x,y): -leven.distance(x,y)
    
    res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
    A = np.reshape(res,(dim,dim))
    
    af = AffinityPropagation().fit(A)
    cluster_centers_indices = af.cluster_centers_indices_
    labels = af.labels_
    
    unique_labels = set(labels)
    for i in unique_labels:
        print words[labels==i]
    

    Distances had to be converted to similarities, I did that by taking the negative of distance. The output is

    ['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
    ['it' 'with' 'at' 'if' 'get' 'its' 'first']
    ['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
    ['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
    ['this' 'his' 'which' 'him']
    ['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
    ['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
    ['would' 'could']
    ['but' 'out' 'about' 'our' 'most']
    ['make' 'like' 'time' 'take' 'back']
    ['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
     'even' 'these']
    ['not' 'no' 'know' 'now' 'how' 'new']
    ['will' 'people' 'year' 'well']
    ['say' 'what' 'way' 'want' 'day']
    ['because']
    ['as' 'up' 'just' 'use' 'us']
    

提交回复
热议问题