Gensim Word2Vec select minor set of word vectors from pretrained model

前端未结

关注

 2  978

生来不讨喜 2020-12-19 06:32

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is tha

2条回答

醉酒成梦 (楼主)

2020-12-19 07:03
There's no built-in feature that does exactly that, but it shouldn't require much code, and could be modeled on existing gensim code. A few possible alternative strategies:
1. Load the full vectors, then save in an easy-to-parse format - such as via .save_word2vec_format(..., binary=False). This format is nearly self-explanatory; write your own code to drop all lines from this file that aren't on your whitelist (being sure to update the leading line declaration of entry-count). The existing source code for load_word2vec_format() & save_word2vec_format() may be instructive. You'll then have a subset file.
2. Or, pretend you were going to train a new Word2Vec model, using your corpus-of-interest (with just the interesting words). But, only create the model and do the build_vocab() step. Now, you have untrained model, with random vectors, but just the right vocabulary. Grab the model's wv property - a KeyedVectors instance with that right vocabulary. Then separately load the oversized vector-set, and for each word in the right-sized KeyedVectors, copy over the actual vector from the larger set. Then save the right-sized subset.
3. Or, look at the (possibly-broken-since-gensim-3.4) method on Word2Vec intersect_word2vec_format(). It more-or-less tries to do what's described in (2) above: with an in-memory model that has the vocabulary you want, merge in just the overlapping words from another word2vec-format set on disk. It'll either work, or provide the template for what you'd want to do.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...