why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

问题

In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions:

why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix.
why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing time-consuming.

Plus, clarifying remarks on the intuition of how word vectors can be obtained like this will be appreciated.

回答1:

For questiong 1:

Because the input weight matrix is for the target word while the output weight matrix is for an context word. The vector we attempt to learn for a word is the vector of the word itself as the target word - as the intuition for word2vec is that words(as target word!) which occur in similar contexts learn similar vector representations.

So, the vector for an context word exists only for training's need, one might think of using the same vector as target word but learning the two separately is better, for example, if we use the same vector representations then our model yields highest probability for a word occuring in a context of itselves(dot product of two same vectors), but it's obviously false(how often do we use many identical words in a row?).

回答2:

Regarding the two, input-hidden weight matrix and hidden-output weight matrix, there is an interesting research paper. 'A Dual Embedding Space Model for Document Ranking', Mitra et al., arXiv 2016. (https://arxiv.org/pdf/1602.01137.pdf). Similar with your question, this paper studies how these two weight matrix are different, and claims that they encode different characteristics of words.

Overall, from my understanding, it is your choice to use either the input-hidden weight matrix (convention), hidden-output weight matrix, or the combined one as word embeddings, depending on your data and the problem to solve.

来源：https://stackoverflow.com/questions/46065773/why-we-use-input-hidden-weight-matrix-to-be-the-word-vectors-instead-of-hidden-o

标签

nlp

gensim

word2vec