Fuzzy Group By, Grouping Similar Words

后端 未结 5 751
耶瑟儿~
耶瑟儿~ 2020-12-10 07:44

this question is asked here before

What is a good strategy to group similar words?

but no clear answer is given on how to \"group\" items. The solution based

相关标签:
5条回答
  • 2020-12-10 07:55

    You have to decide in closed matches words, which words you want to use. May be get the first element from the list which get_close_matches is returning, or just use random function on that list and get one element from closed matches.

    There must be some sort of rule, for it..

    In [19]: import difflib
    
    In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']
    
    In [21]: a = ['appel', 'apple', 'peach', 'puppy']
    
    In [22]: b = difflib.get_close_matches('ape',a)
    
    In [23]: b
    Out[23]: ['apple', 'appel']
    
    In [24]: import random
    
    In [25]: c = random.choice(b)
    
    In [26]: c
    Out[26]: 'apple'
    
    In [27]: 
    

    Now remove c from the initial list, thats it... For c++, you can use Levenshtein_distance

    0 讨论(0)
  • 2020-12-10 07:58

    You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.

    Some possible ways:

    • Pick the first encountered word.
    • Pick the lexicographic first word.
    • Derive a pattern for all the words.
    • Pick an unique index.
    • Use the soundex as pattern.

    Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.


    Going by the first alternative (first encountered word):

    class Seeder:
        def __init__(self):
            self.seeds = set()
            self.cache = dict()
    
        def get_seed(self, word):
            LIMIT = 2
            seed = self.cache.get(word,None)
            if seed is not None:
                return seed
            for seed in self.seeds:
                if self.distance(seed, word) <= LIMIT:
                    self.cache[word] = seed
                    return seed
            self.seeds.add(word)
            self.cache[word] = word
            return word
    
        def distance(self, s1, s2):
            l1 = len(s1)
            l2 = len(s2)
            matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
            for zz in xrange(0,l2):
                for sz in xrange(0,l1):
                    if s1[sz] == s2[zz]:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                    else:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
            return matrix[l2][l1]
    
    import itertools
    
    def group_similar(words):
        seeder = Seeder()
        words = sorted(words, key=seeder.get_seed)
        groups = itertools.groupby(words, key=seeder.get_seed)
        return [list(v) for k,v in groups]
    

    Example:

    import pprint
    
    print pprint.pprint(group_similar([
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
        'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
        'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
        'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
        'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
        'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
        'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
        'people', 'into', 'year', 'your', 'good', 'some', 'could',
        'them', 'see', 'other', 'than', 'then', 'now', 'look',
        'only', 'come', 'its', 'over', 'think', 'also', 'back',
        'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
        'way', 'even', 'new', 'want', 'because', 'any', 'these',
        'give', 'day', 'most', 'us'
    ]), width=120)
    

    Output:

    [['after'],
     ['also'],
     ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
     ['back'],
     ['because'],
     ['but', 'about', 'get', 'just'],
     ['first'],
     ['from'],
     ['good', 'look'],
     ['have', 'make', 'give'],
     ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
     ['into'],
     ['know', 'new'],
     ['like', 'time', 'take'],
     ['most'],
     ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
     ['only'],
     ['over', 'our', 'even'],
     ['people'],
     ['say', 'she', 'way', 'day'],
     ['some', 'see', 'come'],
     ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
     ['think'],
     ['well'],
     ['what', 'who', 'when', 'than'],
     ['with', 'will', 'which'],
     ['work'],
     ['would', 'could'],
     ['year', 'your']]
    
    0 讨论(0)
  • 2020-12-10 08:04

    Here is another version using Affinity Propagation algorithm.

    import numpy as np
    import scipy.linalg as lin
    import Levenshtein as leven
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.cluster import AffinityPropagation
    import itertools
    
    words = np.array(
        ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
         'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
         'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
         'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
         'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
         'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
         'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
         'people', 'into', 'year', 'your', 'good', 'some', 'could',
         'them', 'see', 'other', 'than', 'then', 'now', 'look',
         'only', 'come', 'its', 'over', 'think', 'also', 'back',
         'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
         'way', 'even', 'new', 'want', 'because', 'any', 'these',
         'give', 'day', 'most', 'us'])
    
    print "calculating distances..."
    
    (dim,) = words.shape
    
    f = lambda (x,y): -leven.distance(x,y)
    
    res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
    A = np.reshape(res,(dim,dim))
    
    af = AffinityPropagation().fit(A)
    cluster_centers_indices = af.cluster_centers_indices_
    labels = af.labels_
    
    unique_labels = set(labels)
    for i in unique_labels:
        print words[labels==i]
    

    Distances had to be converted to similarities, I did that by taking the negative of distance. The output is

    ['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
    ['it' 'with' 'at' 'if' 'get' 'its' 'first']
    ['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
    ['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
    ['this' 'his' 'which' 'him']
    ['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
    ['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
    ['would' 'could']
    ['but' 'out' 'about' 'our' 'most']
    ['make' 'like' 'time' 'take' 'back']
    ['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
     'even' 'these']
    ['not' 'no' 'know' 'now' 'how' 'new']
    ['will' 'people' 'year' 'well']
    ['say' 'what' 'way' 'want' 'day']
    ['because']
    ['as' 'up' 'just' 'use' 'us']
    
    0 讨论(0)
  • 2020-12-10 08:06

    Here is an approach based on medoids. First install MlPy. On Ubuntu

    sudo apt-get install python-mlpy
    

    Then

    import numpy as np
    import mlpy
    
    class distance:    
        def compute(self, s1, s2):
            l1 = len(s1)
            l2 = len(s2)
            matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
            for zz in xrange(0,l2):
                for sz in xrange(0,l1):
                    if s1[sz] == s2[zz]:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                    else:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
            return matrix[l2][l1]
    
    x =  np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])
    
    km = mlpy.Kmedoids(k=3, dist=distance())
    medoids,clusters,a,b = km.compute(x)
    
    print medoids
    print clusters
    print a
    
    print x[medoids] 
    for i,c in enumerate(x[medoids]):
        print "medoid", c
        print x[clusters[a==i]]
    

    The output is

    [4 3 1]
    [0 2]
    [2 2]
    ['puppy' 'peach' 'appel']
    medoid puppy
    []
    medoid peach
    []
    medoid appel
    ['ape' 'apple']
    

    The bigger word list and using k=10

    medoid he
    ['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
     'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
     'after' 'get' 'what' 'I']
    medoid out
    ['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
     'about']
    medoid to
    ['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
    medoid now
    ['new' 'how' 'know' 'not']
    medoid time
    ['like' 'take' 'come' 'some' 'give']
    medoid because
    []
    medoid an
    ['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
     'any']
    medoid look
    ['work' 'good']
    medoid will
    ['with' 'well' 'which']
    medoid then
    ['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
     'than' 'them']
    
    0 讨论(0)
  • 2020-12-10 08:10

    Another method could be using matrix factorization, using SVD. First we create word distance matrix, for 100 words this would be 100 x 100 matrix representating the distance from each word to all other words. Then, SVD is ran on this matrix, the u in the resulting u,s,v can be seen as membership strength to each cluster.

    Code

    import numpy as np
    import scipy.linalg as lin
    import Levenshtein as leven
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    import itertools
    
    words = np.array(
        ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
         'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
         'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
         'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
         'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
         'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
         'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
         'people', 'into', 'year', 'your', 'good', 'some', 'could',
         'them', 'see', 'other', 'than', 'then', 'now', 'look',
         'only', 'come', 'its', 'over', 'think', 'also', 'back',
         'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
         'way', 'even', 'new', 'want', 'because', 'any', 'these',
         'give', 'day', 'most', 'us'])
    
    print "calculating distances..."
    
    (dim,) = words.shape
    
    f = lambda (x,y): leven.distance(x,y)
    res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
                    dtype=np.uint8)
    A = np.reshape(res,(dim,dim))
    
    print "svd..."
    
    u,s,v = lin.svd(A, full_matrices=False)
    
    print u.shape
    print s.shape
    print s
    print v.shape
    
    data = u[:,0:10]
    k=KMeans(init='k-means++', k=25, n_init=10)
    k.fit(data)
    centroids = k.cluster_centers_
    labels = k.labels_
    print labels
    
    for i in range(np.max(labels)):
        print words[labels==i]
    
    def dist(x,y):   
        return np.sqrt(np.sum((x-y)**2, axis=1))
    
    print "centroid points.."
    for i,c in enumerate(centroids):
        idx = np.argmin(dist(c,data[labels==i]))
        print words[labels==i][idx]
        print words[labels==i]
    
    plt.plot(centroids[:,0],centroids[:,1],'x')
    plt.hold(True)
    plt.plot(u[:,0], u[:,1], '.')
    plt.show()
    
    from mpl_toolkits.mplot3d import Axes3D
    fig = plt.figure()
    ax = Axes3D(fig)
    ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
            zdir='z', label='zs=0, zdir=z')
    plt.show()
    

    The result

    any
    ['and' 'an' 'can' 'any']
    do
    ['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
    when
    ['who' 'when' 'well']
    my
    ['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
    your
    ['for' 'or' 'out' 'about' 'your' 'our']
    its
    ['it' 'his' 'if' 'him' 'its']
    could
    ['would' 'people' 'could']
    this
    ['this' 'think' 'these']
    she
    ['the' 'he' 'she' 'see']
    back
    ['all' 'back' 'want']
    one
    ['of' 'on' 'one' 'only' 'even' 'new']
    just
    ['but' 'just' 'first' 'most']
    come
    ['some' 'come']
    that
    ['that' 'than']
    way
    ['say' 'what' 'way' 'day']
    like
    ['like' 'time' 'give']
    in
    ['in' 'into']
    get
    ['her' 'get' 'year']
    because
    ['because']
    will
    ['with' 'will' 'which']
    over
    ['other' 'over' 'after']
    as
    ['a' 'as' 'at' 'also' 'us']
    them
    ['they' 'there' 'their' 'them' 'then']
    good
    ['not' 'from' 'know' 'good' 'now' 'look' 'work']
    have
    ['have' 'make' 'take']
    

    The selection of k for number of clusters is important, k=25 gives much better results than k=20 for instance.

    The code also selects a representative word for each cluster by picking the word whose u[..] coordinate is closest to the cluster centroid.

    0 讨论(0)
提交回复
热议问题