how to preserve number of records in word2vec?

梦想与她 提交于 2020-01-07 03:48:06

问题


I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words.

After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ?

In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels.


回答1:


If you are splitting each entry into a list of words, that's essentially 'tokenization'.

Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count), you will have 26,000 vectors at the end.

Gensim's Doc2Vec (the ' Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.

If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).

Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.



来源:https://stackoverflow.com/questions/44740161/how-to-preserve-number-of-records-in-word2vec

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!