Export gensim doc2vec embeddings into separate file to use with keras Embedding layer later

隐身守侯 提交于 2021-02-10 07:05:49

问题


I am a bit new to gensim and right now I am trying to solve the problem which involves using the doc2vec embeddings in keras. I wasn't able to find existing implementation of doc2vec in keras - as far as I see in all examples I found so far everyone just uses the gensim to get the document embeddings.

Once I trained my doc2vec model in gensim I need to export embeddings weights from genim into keras somehow and it is not really clear on how to do that. I see that

model.syn0

Supposedly gives the word2vec embedding weights (according to this). But it is unclear how to do the same export for document embeddings. Any advise?

I know that in general I can just get the embeddings for each document directly from gensim model but I want to fine-tune the embedding layer in keras later on, since doc embeddings will be used as a part of a larger task hence they might be fine-tuned a bit.


回答1:


I figured this out.

Assuming you already trained the gensim model and used string tags as document ids:

#get vector of doc
model.docvecs['2017-06-24AEON']
#raw docvectors (all of them)
model.docvecs.doctag_syn0
#docvector names in model
model.docvecs.offset2doctag

You can export this doc vectors into keras embedding layer as below, assuming your DataFrame df has all of the documents out there. Notice that in the embedding matrix you need to pass only integers as inputs. I use raw number in dataframe as the id of the doc for input. Also notice that embedding layer requires to not touch index 0 - it is reserved for masking, so when I pass the doc id as input to my network I need to ensure it is >0

#creating embedding matrix
embedding_matrix = np.zeros((len(df)+1, text_encode_dim))
for i, row in df.iterrows():
    embedding = modelDoc2Vec.docvecs[row['docCode']]
    embedding_matrix[i+1] = embedding

#input with id of document
doc_input = Input(shape=(1,),dtype='int16', name='doc_input')
#embedding layer intialized with the matrix created earlier
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,weights=[embedding_matrix], input_length=1, trainable=False)(doc_input)

UPDATE

After late 2017, with the introduction of Keras 2.0 API very last line should be changed to:

embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,embeddings_initializer=Constant(embedding_matrix), input_length=1, trainable=False)(doc_input)


来源:https://stackoverflow.com/questions/48999199/export-gensim-doc2vec-embeddings-into-separate-file-to-use-with-keras-embedding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!