Pairwise Earth Mover Distance across all documents (word2vec representations)

倾然丶 夕夏残阳落幕 提交于 2019-12-11 00:48:07

问题


Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs. like sklearns cosine_similarity.


回答1:


The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document.

I'm not aware of any tricks that would help it go faster when calculating many at once – even many distances to the same document.

So the only thing needed to calculate pairwise distances are nested loops to consider each (order-ignoring unique) pairing.

For example, assuming your list of documents (each a list-of-words) is docs, a gensim word-vector model in model, and numpy imported as np, you could calculate the array of pairwise distances D with:

D = np.zeros((len(docs), len(docs)))
for i in range(len(docs)):
    for j in range(len(docs)):
        if i == j:
            continue  # self-distance is 0.0
        if i > j:
            D[i, j] = D[j, i]  # re-use earlier calc
        D[i, j] = model.wmdistance(docs[i], docs[j])

It may take a while, but you'll then have all pairwise distances in array D.



来源:https://stackoverflow.com/questions/44380199/pairwise-earth-mover-distance-across-all-documents-word2vec-representations

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!