Gensim Word2Vec select minor set of word vectors from pretrained model

前端 未结 2 978
生来不讨喜
生来不讨喜 2020-12-19 06:32

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is tha

2条回答
  •  醉酒成梦
    2020-12-19 07:03

    There's no built-in feature that does exactly that, but it shouldn't require much code, and could be modeled on existing gensim code. A few possible alternative strategies:

    1. Load the full vectors, then save in an easy-to-parse format - such as via .save_word2vec_format(..., binary=False). This format is nearly self-explanatory; write your own code to drop all lines from this file that aren't on your whitelist (being sure to update the leading line declaration of entry-count). The existing source code for load_word2vec_format() & save_word2vec_format() may be instructive. You'll then have a subset file.

    2. Or, pretend you were going to train a new Word2Vec model, using your corpus-of-interest (with just the interesting words). But, only create the model and do the build_vocab() step. Now, you have untrained model, with random vectors, but just the right vocabulary. Grab the model's wv property - a KeyedVectors instance with that right vocabulary. Then separately load the oversized vector-set, and for each word in the right-sized KeyedVectors, copy over the actual vector from the larger set. Then save the right-sized subset.

    3. Or, look at the (possibly-broken-since-gensim-3.4) method on Word2Vec intersect_word2vec_format(). It more-or-less tries to do what's described in (2) above: with an in-memory model that has the vocabulary you want, merge in just the overlapping words from another word2vec-format set on disk. It'll either work, or provide the template for what you'd want to do.

提交回复
热议问题