this question is asked here before
What is a good strategy to group similar words?
but no clear answer is given on how to \"group\" items. The solution based
Here is an approach based on medoids. First install MlPy. On Ubuntu
sudo apt-get install python-mlpy
Then
import numpy as np
import mlpy
class distance:
def compute(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
x = np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])
km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)
print medoids
print clusters
print a
print x[medoids]
for i,c in enumerate(x[medoids]):
print "medoid", c
print x[clusters[a==i]]
The output is
[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
[]
medoid peach
[]
medoid appel
['ape' 'apple']
The bigger word list and using k=10
medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
'about']
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
[]
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
'any']
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
'than' 'them']