Understanding the output of Doc2Vec from Gensim package

血红的双手。 提交于 2019-12-01 02:09:02

问题


I have some sample sentences that I want to run through a Doc2Vec model. My end goal is a matrix of size (num_sentences, num_features).

I'm using the Gensim package.

from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
# warning: long sample of data. It's just 40 sentences really though.
labeled_sents = [TaggedDocument(words=['u0644', 'u0646', 'u062f', 'u0646', 'u060c', 'u0628', 'u0631', 'u0637', 'u0627', 'u0646', 'u06cc', 'u06c1', 'u06a9', 'u0627'], tags='400'), TaggedDocument(words=['do', 'pan', 'en', '1713', 'o', 'soar', 'onde', 'se', 'sit', 'xfaa'], tags='401'), TaggedDocument(words=['u0420', 'u044c', 'u043e', 'u043d', 'u0442', 'u0433', 'u0435', 'u043d', '1901', 'xa0', 'u2022', 'u041b', 'u043e', 'u0440', 'u0435', 'u043d', 'u0446', 'xa0', 'u0417', 'u0435', 'u0435', 'u043c', 'u0430', 'u043d', '1902', 'xa0', 'u2022', 'u0411', 'u0435', 'u043a', 'u0435', 'u0440', 'u0435', 'u043b', 'xa0', 'u041f', 'u0438', 'u0435', 'u0440', 'u041a', 'u044e', 'u0440', 'u0438', 'xa0', 'u041c', 'u0430', 'u0440', 'u0438', 'u044f', 'u041a', 'u044e', 'u0440', 'u0438', '1903', 'xa0', 'u2022', 'u0420', 'u0435', 'u043b', 'u0435', 'u0439', '1904', 'xa0', 'u2022', 'u041b', 'u0435', 'u043d', 'u0430', 'u0440', 'u0434', '1905', 'xa0', 'u2022', 'u0414', 'u0436', 'u0414', 'u0436', 'u0422', 'u043e', 'u043c', 'u0441', 'u044a', 'u043d', '1906', 'xa0', 'u2022', 'u041c', 'u0430', 'u0439', 'u043a', 'u0435', 'u043b', 'u0441', 'u044a', 'u043d', '1907', 'xa0', 'u2022', 'u041b', 'u0438', 'u043f', 'u043c', 'u0430', 'u043d', '1908', 'xa0', 'u2022', 'u041c', 'u0430', 'u0440', 'u043a', 'u043e', 'u043d', 'u0438', 'xa0', 'u0411', 'u0440', 'u0430', 'u0443', 'u043d', '1909', 'xa0', 'u2022', 'u0412', 'u0430', 'u043d', 'xa0', 'u0434', 'u0435', 'u0440', 'xa0', 'u0412', 'u0430', 'u0430', 'u043b', 'u0441', '1910', 'xa0', 'u2022', 'u0412', 'u0438', 'u043d', '1911', 'xa0', 'u2022', 'u0414', 'u0430', 'u043b', 'u0435', 'u043d', '1912', 'xa0', 'u2022', 'u041a', 'u0430', 'u043c', 'u0435', 'u0440', 'u043b', 'u0438', 'u043d', 'u0433', 'xa0', 'u041e', 'u043d', 'u0435', 'u0441', '1913', 'xa0', 'u2022', 'u0424', 'u043e', 'u043d', 'xa0', 'u041b', 'u0430', 'u0443', 'u0435', '1914', 'xa0', 'u2022', 'u0423', 'u0438', 'u043b', 'u044f', 'u043c', 'u041b', 'u0411', 'u0440', 'u0430', 'u0433', 'xa0', 'u0423', 'u0438', 'u043b', 'u044f', 'u043c', 'u0425', 'u0411', 'u0440', 'u0430', 'u0433', '1915', 'xa0', 'u2022', 'u0411', 'u0430', 'u0440', 'u043a', 'u043b', 'u0430', '1917', 'xa0', 'u2022', 'u041f', 'u043b', 'u0430', 'u043d', 'u043a', '1918', 'xa0', 'u2022', 'u0429', 'u0430', 'u0440', 'u043a', '1919'], tags='402'), TaggedDocument(words=['nagusia', 'da'], tags='403'), TaggedDocument(words=['sino', 'que', 'los', 'ciudadanos', 'pueden', 'elegir', 'detraer', 'un', 'porcentaje', 'de', 'sus', 'impuestos', 'para', 'esta', 'causa', '68', '69', 'un', 'sistema', 'similar', 'se', 'da', 'en', 'alemania', 'o', 'austria', 'aunque', 'all', 'xed', 'se', 'impone', 'un', 'impuesto', 'eclesi', 'xe1stico'], tags='404'), TaggedDocument(words=['1244', 'c', 'xfc'], tags='405'), TaggedDocument(words=['u062a', 'u063a', 'u064a', 'u064a', 'u0631', 'u0644', 'u0641', 'u0638', 'u0627', 'u0644', 'u0643', 'u0644', 'u0645', 'u0629', 'u060c', 'u0641', 'u0645', 'u062b', 'u0644', 'u0627', 'u064b', 'rat', 'u062a', 'u0644', 'u0641', 'u0638', 'u0631', 'u0627', 'u062a'], tags='406'), TaggedDocument(words=['d', 'xfcrziler'], tags='407'), TaggedDocument(words=['xung', 'quanh', 'u0111', 'xf3'], tags='408'), TaggedDocument(words=['oblika', 'u0161to'], tags='409'), TaggedDocument(words=['u0432', 'u0430', 'u043b', 'u044e', 'u0442', 'u043d', 'u043e', 'u0433', 'u043e', 'u0441', 'u043e', 'u044e', 'u0437', 'u0443'], tags='410'), TaggedDocument(words=['sacerdotal', 'es'], tags='411'), TaggedDocument(words=['natoque', 'nisi'], tags='412'), TaggedDocument(words=['u0631', 'u0627', 'u0645', 'u06cc', 'u200c', 'u062a', 'u0648', 'u0627', 'u0646', 'u062f', 'u0631', 'u0627', 'u06cc', 'u0627', 'u0644', 'u0627', 'u062a', 'u0645', 'u062a', 'u062d', 'u062f', 'u0647', 'u0622', 'u0645', 'u0631', 'u06cc', 'u06a9', 'u0627', 'u06a9', 'u0627', 'u0646', 'u0627', 'u062f', 'u0627', 'u0628', 'u0631', 'u0632', 'u06cc', 'u0644', 'u0648', 'u0622', 'u0631', 'u0698', 'u0627', 'u0646', 'u062a', 'u06cc', 'u0646'], tags='413'), TaggedDocument(words=['u0423', 'u0439', 'u0433', 'u0443', 'u0440', 'u0441', 'u044c', 'u043a', 'u0430', 'u043c', 'u043e', 'u0432', 'u0430'], tags='414'), TaggedDocument(words=['termin', 'poznat', 'kao'], tags='415'), TaggedDocument(words=['les', 'fr', 'xe8res', 'lumi', 'xe8re'], tags='416'), TaggedDocument(words=['26', 'u03c0', 'u03b5', 'u03c1', 'u03af', 'u03c0', 'u03bf', 'u03c5', 'u03b1', 'u03b9', 'u03ce', 'u03bd', 'u03b5', 'u03c2', 'u03b7', 'u03c0', 'u03cc', 'u03bb', 'u03b7', 'u03c4', 'u03b7', 'u03c2', 'u0391', 'u03c5', 'u03bb', 'u03ce', 'u03bd', 'u03b1', 'u03c2', 'u03b5', 'u03af', 'u03bd', 'u03b1', 'u03b9', 'u03c3', 'u03ae', 'u03bc', 'u03b5', 'u03c1', 'u03b1'], tags='417'), TaggedDocument(words=['xcen', '13'], tags='418'), TaggedDocument(words=['acts', 'of', 'civil', 'disobedience', 'forced', 'the', 'head', 'of', 'the', 'local'], tags='419'), TaggedDocument(words=['hugo', 'az', 'xe1llamcs', 'xedny'], tags='420'), TaggedDocument(words=['f', 'xf8rste', 'nu', 'uofficielle', 'vers', 'forbindes', 'ofte', 'med', 'nynazistiske', 'synspunkter'], tags='421'), TaggedDocument(words=['gisulti', 'kanila', 'sa', 'mga', 'langyaw', 'nagtuong', 'gipangutana', 'sila', 'kon'], tags='422'), TaggedDocument(words=['u043d', 'u0430', 'u0438', 'u0432', 'u0440', 'u0438', 'u0442'], tags='423'), TaggedDocument(words=['its', 'influence'], tags='424'), TaggedDocument(words=['a', 'b', 'azerbaijan', 'homeowners', 'evicted', 'for', 'city'], tags='425'), TaggedDocument(words=['dinast', 'xeda', 'lunar', 'de'], tags='426'), TaggedDocument(words=['2', 'wyznawa', 'u0142o', 'judaizmu', '5', 'ponad'], tags='427'), TaggedDocument(words=['quyosh', 'vaqt', 'degani'], tags='428'), TaggedDocument(words=['u306e', 'u884c', 'u4fe1', 'u30fb', 'u91cd', 'u5f18', 'u3001', 'u9678', 'u5965', 'u56fd', 'u306e', 'u821e', 'u8349', 'u6d3e', 'u3001', 'u51fa', 'u7fbd', 'u56fd', 'u306e', 'u6708', 'u5c71', 'u6d3e', 'u3001', 'u4f2f', 'u8006', 'u56fd', 'u306e', 'u5b89', 'u92fc', 'u6d3e', 'u3001', 'u5099', 'u4e2d', 'u56fd', 'u306e', 'u53e4', 'u9752', 'u6c5f', 'u6d3e', 'u306e', 'u5b88', 'u6b21', 'u30fb', 'u6052', 'u6b21', 'u30fb', 'u5eb7', 'u6b21', 'u30fb', 'u8c9e', 'u6b21', 'u30fb', 'u52a9', 'u6b21', 'u30fb', 'u5bb6', 'u6b21', 'u30fb', 'u6b63', 'u6052', 'u3001', 'u8c4a', 'u5f8c', 'u56fd', 'u306e', 'u5b9a', 'u79c0', 'u6d3e', 'u3001', 'u85a9', 'u6469', 'u56fd', 'u306e', 'u53e4', 'u6ce2', 'u5e73', 'u6d3e', 'u306e', 'u884c', 'u5b89', 'u306a', 'u3069', 'u304c', 'u5b58', 'u5728', 'u3059', 'u308b', '7', '8', '9'], tags='429'), TaggedDocument(words=['p', 'xe5', '4'], tags='430'), TaggedDocument(words=['editovat'], tags='431'), TaggedDocument(words=['u0437', 'u0437', 'u0430', 'u0431', 'u043e', 'u0439', 'u0441', 'u0442', 'u0432', 'u0430', 'u043c', 'u0443'], tags='432'), TaggedDocument(words=['10', 'u043b', 'u0438', 'u043f', 'u043d', 'u044f', '1943', 'u0440', 'u043e', 'u043a', 'u0443', 'u0441', 'u043e', 'u044e', 'u0437', 'u043d', 'u0438', 'u043a', 'u0438', 'u0432', 'u0438', 'u0441', 'u0430', 'u0434', 'u0438', 'u043b', 'u0438', 'u0441', 'u044f', 'u0432', 'u0421', 'u0438', 'u0446', 'u0438', 'u043b', 'u0456', 'u0457', 'u0406', 'u0442', 'u0430', 'u043b', 'u0456', 'u0439', 'u0441', 'u044c', 'u043a', 'u0456'], tags='433'), TaggedDocument(words=['136', 'selvom', 'det', 'egentligt', 'ligger', 'i', 'sundby', 'p', 'xe5', 'lollandssiden', 'af', 'guldborgsund', 'centret', 'blev', 'grundlagt', 'i', '1989', 'da', 'byen', 'fejrede', '700', 'xe5rs', 'jubil', 'xe6um', 'bymuseet', 'rekonstruerede', 'som', 'de', 'f', 'xf8rste', 'i', 'verden', 'en', 'middelalderlig', 'kastemaskine', 'kaldet', 'en', 'blide'], tags='434'), TaggedDocument(words=['latine', 'redditur'], tags='435'), TaggedDocument(words=['ljubljani', 'in', 'njeni'], tags='436'), TaggedDocument(words=['u0442', 'u0430', 'u043d', 'u044b', 'u043c', 'u0430', 'u043b', 'u049b', 'u043e', 'u043d', 'u0430', 'u049b', 'u04af', 'u0439', 'u043b', 'u0435', 'u0440'], tags='437'), TaggedDocument(words=['u2022', 'hassib', 'ben'], tags='438'), TaggedDocument(words=['kurtulmu', 'u015f', 'olan', 'u0130talya'], tags='439')]

model = Doc2Vec(documents=labeled_sents, size=10, alpha=.035, window=4, 
    sample=1e-5, workers=4, min_count=1)

Now, I thought that model.docvecs would give me a list of arrays, with the first array corresponding to the vector for sentence 1, the second array corresponding to the vector for sentence 2, etc. But instead, it's got length 10!

I get model.docvecs[0] = array([ 0.02312995, -0.00339695, -0.01273827, 0.01944644, -0.03247212, -0.04663946, 0.01369059, 0.03289782, 0.03516903, -0.03435936], dtype=float32)

What are these docvecs then? How do I get the output desired, which is a matrix of dimensions (40, 10) in this example?


I saw this here, and the correct answer says at the bottom "where 99 is the document id whose vector we want." So this makes me even more confused, as he seems to say that model.docvecs SHOULD be indexing a matrix where each row is a document vector!


回答1:


TaggedDocument expects tags to be a list of tags related to document.

In your case,

sentence = TaggedDocument(words=['a', 'b'], tags='400')

gets interpreted as sentence having 3 tags ['4','0','0'], and hence model.docvecs returns vectors corresponding to 10 tags - ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Try changing this to

sentence = TaggedDocument(words=['a', 'b'], tags=['400'])



回答2:


model.docvecs is an iterable with length equal to the number of documents you supplied the model. Each docvec is a vector representation of a single document. Its length is determined by the size parameter that you gave it when you trained the model. size is commonly between 100 and 300, and sometimes longer. A vector of length 10 would do a poor job at representing the documents you fed it.

Thus, something like this would be more productive:

for i in range(0, len(lot)):
    docs.append(gn.models.doc2vec.TaggedDocument(words=lot[i], tags=[i]))

Where lot is a list of lists of tokens (words) like this:

lot = [['the','cat','sat'],['the','dog','ran']]

Running the model:

gn.models.doc2vec.Doc2Vec(docs, size=300, window=8, dm=1, hs=1, alpha=.025, min_alpha=.0001)


来源:https://stackoverflow.com/questions/37196520/understanding-the-output-of-doc2vec-from-gensim-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!