Similarity between two lists of documents

老子叫甜甜 提交于 2020-01-25 08:57:06

问题


I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with details of these packages.


回答1:


I think your question is ambiguous - You might mean to produce a single similarity score for the similarity of the average of list 1 vs the average of list 2. I'm assuming that you want a similarity score for each combination of items from the two lists. For 10K items per list, that will produce 10K pow 2 = 100M similarity scores.

import spacy
spacyModel = spacy.load('en')

list1 = ["hello, example 1", "right, second example"]
list2 = ["hello, example 1 in the second list", "And now for something completely different"]

list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]

similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]

print(similarityMatrix)
[[0.8537950408055295, 0.8852732956832498], [0.5802435148988874, 0.7643245611465626]]


来源:https://stackoverflow.com/questions/53309192/similarity-between-two-lists-of-documents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!