doc2vec

Gensim: how to retrain doc2vec model using previous word2vec model

不羁岁月 提交于 2019-12-08 04:41:27
问题 With Doc2Vec modelling, I have trained a model and saved following files: 1. model 2. model.docvecs.doctag_syn0.npy 3. model.syn0.npy 4. model.syn1.npy 5. model.syn1neg.npy However, I have a new way to label the documents and want to train the model again. since the word vectors already obtained from previous version. Is there any way to reuse that model (e.g., taking the previous w2v results as initial vectors for training)? Any one know how to do it? 回答1: I've figured out that, we can just

Convert a column in a dask dataframe to a TaggedDocument for Doc2Vec

那年仲夏 提交于 2019-12-08 03:47:29
问题 Intro Currently I am trying to use dask in concert with gensim to do NLP document computation and I'm running into an issue when converting my corpus into a "TaggedDocument". Because I've tried so many different ways to wrangle this problem I'll list my attempts. Each attempt at dealing with this problem is met with slightly different woes. First some initial givens. The Data df.info() <class 'dask.dataframe.core.DataFrame'> Columns: 5 entries, claim_no to litigation dtypes: object(2), int64

5 词袋和词向量模型

拜拜、爱过 提交于 2019-12-06 10:19:43
词袋模型(Bag of Words Model) 词袋模型的概念 先来看张图,从视觉上感受一下词袋模型的样子。 词袋模型看起来像一个口袋把所有词都装进去,但却不完全如此。在自然语言处理和信息检索中作为一种简单假设,词袋模型把文本(段落或者文档)被看作是无序的词汇集合,忽略语法甚至是单词的顺序,把每一个单词都进行统计,同时计算每个单词出现的次数,常被用在文本分类中,如贝叶斯算法、LDA 和 LSA等。 动手实战词袋模型 (1)词袋模型 本例中,我们自己动手写代码看看词袋模型是如何操作的。 首先,引入 jieba 分词器、语料和停用词(标点符号集合,自己可以手动添加或者用一个文本字典代替)。 import jieba #定义停用词、标点符号 punctuation = [",","。", ":", ";", "?"] #定义语料 content = ["机器学习带动人工智能飞速的发展。", "深度学习带动人工智能飞速的发展。", "机器学习和深度学习带动人工智能飞速的发展。" ] 接下来,我们先对语料进行分词操作,这里用到 lcut() 方法: #分词 segs_1 = [jieba.lcut(con) for con in content] print(segs_1) 得到分词后的结果如下: [['机器', '学习', '带动', '人工智能', '飞速', '的', '发展', '

Doc2vec: model.docvecs is only of length 10

只谈情不闲聊 提交于 2019-12-06 06:02:53
I am trying doc2vec for 600000 rows of sentences and my code is below: model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores) model.build_vocab(res) model.train(res, total_examples=model.corpus_count, epochs=model.iter) #len(res) = 663406 #length of unique words 15581 print(len(model.wv.vocab)) #length of doc vectors is 10 len(model.docvecs) # each of length 100 len(model.docvecs[1]) How do I interpret this result? why is the length of vector only 10 with each of size 100? when the length of 'res' is 663406, it does not make sense. I know something

What are doc2vec training iterations?

為{幸葍}努か 提交于 2019-12-06 02:14:49
I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop ? Please let me know how I should change the following code to train the model for 20 epoches. Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.

How does gensim calculate doc2vec paragraph vectors

痞子三分冷 提交于 2019-12-04 17:44:26
问题 i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf and it states that " Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors." How does concatenation or averaging work? example (if paragraph 1 contain word1 and word2): word1 vector =[0.1,0.2,0.3] word2 vector =[0.4,0.5,0.6] concat method does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0

ELKI Kmeans clustering Task failed error for high dimensional data

只愿长相守 提交于 2019-12-04 05:51:43
问题 I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error. Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation

Updating training documents for gensim Doc2Vec model

隐身守侯 提交于 2019-12-04 05:41:59
问题 I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model. I take the new documents, and perform preproecssing as normal: stoplist = nltk.corpus.stopwords.words('english') train_corpus= [] for i, document in enumerate(corpus_update['body'].values.tolist()): train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i])) I then load the

How does gensim calculate doc2vec paragraph vectors

♀尐吖头ヾ 提交于 2019-12-03 10:40:39
i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf and it states that " Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors." How does concatenation or averaging work? example (if paragraph 1 contain word1 and word2): word1 vector =[0.1,0.2,0.3] word2 vector =[0.4,0.5,0.6] concat method does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0.6] ? Average method does paragraph vector = [(0.1+0.4)/2,(0.2+0.5)/2,(0.3+0.6)/2] ? Also from this

Gensim Doc2Vec Exception AttributeError: &#039;str&#039; object has no attribute &#039;words&#039;

匿名 (未验证) 提交于 2019-12-03 01:45:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am learning Doc2Vec model from gensim library and using it as follows: class MyTaggedDocument(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin: print(fname) for item_no, sentence in enumerate(fin): yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]) sentences = MyTaggedDocument(dirname) model = Doc2Vec