Getting different results from deeplearning4j and word2vec

时光毁灭记忆、已成空白 提交于 2019-12-12 05:56:53

问题


I trained a word embedding model using Google's word2vec. The output is a file that contains a word and its vector.

I loaded this trained model in deeplearing4j.

    WordVectors vec = WordVectorSerializer.loadTxtVectors(new File("vector.txt"));
    Collection<String> lst = vec.wordsNearest("someWord", 10);

But the two lists of similar words obtained from deeplearing4j's package and word2vec's distance function are totally different although I used the same vector file.

Does anyone have a good understanding on how things work in deeplaring4j and where these differences are coming from?


回答1:


Are the lists similar at all? Does either set seem more reasonable as similar words?

By my understanding, the lists should match almost exactly - they should be implementing the same calculation on the same input vectors. If they don't, and especially if the original word2vec.c similar-list looks more reasonable, then I would suspect a bug in DL4J.

Looking at the method doing the calculation – https://github.com/deeplearning4j/deeplearning4j/blob/f943ea879ab362f66b57b00754b71fb2ff3677a1/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/wordvectors/WordVectorsImpl.java#L385 :

  • the code for the if (lookupTable() instanceof InMemoryLookupTable) {...} branch may be correct – I'm not familiar with the nd4j API – but almost seems too ornate for the calculation of ranked cosine-similarity values;
  • the fallback case that follows does not appear to use unit-vector normalized vector values (as would be usual) – it uses getWordVectorMatrix() instead of getWordVectorMatrixNormalized()



回答2:


There can be multiple reasons why you are getting different vectors from different implementations (and hence difference in the similar words). I can mention a few

  • random initialisation of vectors
  • negative sampling
  • threading

If your number of documents (training data) >> number of unique words (vocabulary size), the vectors for the words will stabilise after few iterations and you can find some of the most similar words from the two, similar.



来源:https://stackoverflow.com/questions/32749330/getting-different-results-from-deeplearning4j-and-word2vec

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!