how to preserve number of records in word2vec?

问题

I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words.

After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ?

In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels.

回答1:

If you are splitting each entry into a list of words, that's essentially 'tokenization'.

Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count), you will have 26,000 vectors at the end.

Gensim's Doc2Vec (the ' Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.

If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).

Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.

来源：https://stackoverflow.com/questions/44740161/how-to-preserve-number-of-records-in-word2vec

标签

python-3.x

nlp

word2vec