python - calculate orthographic similarity between words of a list

牧云@^-^@ 提交于 2019-12-24 02:13:19

问题


I need to calculate orthographic similarity (edit/Levenshtein distance) among words in a given corpus.

As Kirill suggested below, I tried to do the following:

import csv, itertools, Levenshtein
import numpy as np

# import the list of words from csv file
path = '/Users/my path'
file = path + 'file.csv'

with open(file, 'rb') as f:
    reader = csv.reader(f)
    wordlist = list(reader)

wordlist = np.array(wordlist) #make it a np array
wordlist2 = wordlist[:,0] #subset the first column of the imported list

for a, b in itertools.product(wordlist, wordlist):
    if a < b:
        print(a, b, Levenshtein.distance(a, b))

However, the following error pops up:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I understand the ambiguity in the code, but can someone help me figure out how to solve this? Thanks!


回答1:


Levenshtein distance by its definition can be computed only between two strings: it's how you can edit one string to obtain the other. You can compare words pairwise, it requires n*(n-1)/2 comparisons (where n is the number of unique words in your corpus). Here's how you can do it:

>>> import itertools, Levenshtein
>>> words = sorted(set('little Mary had a little lamb'.split()))
>>> for a, b in itertools.product(words, words):
...     if a < b:
...         print(a, b, Levenshtein.distance(a, b))
... 
Mary a 3
Mary had 3
Mary lamb 3
Mary little 6
a had 2
a lamb 3
a little 6
had lamb 3
had little 6
lamb little 5



回答2:


Here's the code I came up with thank to the help of Kirill.

import csv#, StringIO
import itertools, Levenshtein

# open the newline-separated list of words
path = '/Users/your path'
file = path + 'wordlists.txt'
output = path + 'ortho_similarities.txt'
words = sorted(set(s.strip() for s in open(file)))

# the following loop take all possible pairwise combinations
# of the words in the list words, and calculate the LD
# and then let's write everything in a csv file
with open(output, 'wb') as f:
   writer = csv.writer(f, delimter=",", lineterminator="\n")
   for a, b in itertools.product(words, words):
      if a < b:
        write.writerow([a, b, Levenshtein.distance(a,b)])


来源:https://stackoverflow.com/questions/47680897/python-calculate-orthographic-similarity-between-words-of-a-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!