Combining/adding vectors from different word2vec models

吃可爱长大的小学妹 提交于 2020-06-17 03:53:05

问题


I am using gensim to create Word2Vec models trained on large text corpora. I have some models based on StackExchange data dumps. I also have a model trained on a corpus derived from English Wikipedia.

Assume that a vocabulary term is in both models, and that the models were created with the same parameters to Word2Vec. Is there any way to combine or add the vectors from the two separate models to create a single new model that has the same word vectors that would have resulted if I had combined both corpora initially and trained on this data?

The reason I want to do this is that I want to be able to generate a model with a specific corpus, and then if I process a new corpus later, I want to be able to add this information to an existing model rather than having to combine corpora and retrain everything from scratch (i.e. I want to avoid reprocessing every corpus each time I want to add information to the model).

Are there builtin functions in gensim or elsewhere that will allow me to combine models like this, adding information to existing models instead of retraining?


回答1:


Generally, only word vectors that were trained together are meaningfully comparable. (It's the interleaved tug-of-war during training that moves them to relative orientations that are meaningful, and there's enough randomness in the process that even models trained on the same corpus will vary in where they place individual words.)

Using words from both corpuses as guideposts, it is possible to learn a transformation from one space A to the other B, that tries to move those known-shared-words to their corresponding positions in the other space. Then, applying that same transformation to the words in A that aren't in B, you can find B coordinates for those words, making them comparable to other native-B words.

This technique has been used with some success in word2vec-driven language translation (where the guidepost pairs are known translations), or as a means of growing a limited word-vector set with word-vectors from elsewhere. Whether it'd work well enough for your purposes, I don't know. I imagine it could go astray especially where the two training corpuses use shared tokens in wildly different senses.

There's a class, TranslationMatrix, that may be able to do this for you in the gensim library. See:

https://radimrehurek.com/gensim/models/translation_matrix.html

There's a demo notebook of its use at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

(Whenever practical, doing a full training on a mixed-together corpus, with all word examples, is likely to do better.)




回答2:


If you want to avoid training a new model on large mixed corpora with translations, I'd recommend checking out my new Python package (transvec) that allows you to convert word embeddings between pre-trained word2vec models. All you need to do is provide a representative set of individual words in the target language along with their translations in the source language as training data, which is much more manageable (I just took a few thousand words and threw them into Google translate for some pretty good results).

It works in a similar way to the TranslationMatrix mentioned in the other answer in that it works on pre-trained word2vec models, but as well as providing you with translations it can also provide you with the translated word vectors, allowing you to do things like nearest neighbour clustering on mixed-language corpora.

It also supports using regularisation in the training phase to help improve translations when your training data is limited.

Here's a small example:

import gensim.downloader
from transvec.transformers import TranslationWordVectorizer

# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")

# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
    ("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
    ("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]

bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)

# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]

Installation guidance and more details can be found on PyPi: https://pypi.org/project/transvec/.



来源:https://stackoverflow.com/questions/54243797/combining-adding-vectors-from-different-word2vec-models

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!