doc2vec

Doc2vec : How to get document vectors

匿名 (未验证) 提交于 2019-12-03 01:23:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial I am using gensim python library. doc1=["This is a sentence","This is another sentence"] documents1=[doc.strip().split(" ") for doc in doc1 ] model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4) I get AttributeError: 'list' object has no attribute 'words' whenever I run this 回答1: Gensim was updated . The syntax of LabeledSentence

How to intrepret Clusters results after using Doc2vec?

一曲冷凌霜 提交于 2019-12-02 21:13:45
问题 I am using doc2vec to convert the top 100 tweets of my followers in vector representation (say v1.....v100). After that I am using the vector representation to do the K-Means clusters. model = Doc2Vec(documents=t, size=100, alpha=.035, window=10, workers=4, min_count=2) I can see that cluster 0 is dominated by some values (say v10, v12, v23, ....). My question is what does these v10, v12 ... etc represents. Can I deduce that these specific column clusters specific keywords of document. 回答1:

ELKI Kmeans clustering Task failed error for high dimensional data

岁酱吖の 提交于 2019-12-02 12:30:32
I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error. Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126) at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81) at

Updating training documents for gensim Doc2Vec model

别说谁变了你拦得住时间么 提交于 2019-12-02 07:52:07
I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model. I take the new documents, and perform preproecssing as normal: stoplist = nltk.corpus.stopwords.words('english') train_corpus= [] for i, document in enumerate(corpus_update['body'].values.tolist()): train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i])) I then load the original model, update the vocabulary, and retrain: #### Original model ## model = gensim.models.doc2vec

How to intrepret Clusters results after using Doc2vec?

荒凉一梦 提交于 2019-12-02 07:35:44
I am using doc2vec to convert the top 100 tweets of my followers in vector representation (say v1.....v100). After that I am using the vector representation to do the K-Means clusters. model = Doc2Vec(documents=t, size=100, alpha=.035, window=10, workers=4, min_count=2) I can see that cluster 0 is dominated by some values (say v10, v12, v23, ....). My question is what does these v10, v12 ... etc represents. Can I deduce that these specific column clusters specific keywords of document. Don't use the individual variables. They should be only analyzed together because of the way these embeddings

gensim doc2vec “intersect_word2vec_format” command

╄→尐↘猪︶ㄣ 提交于 2019-12-02 05:15:21
Just reading through the doc2vec commands on the gensim page. I am curious about the command"intersect_word2vec_format" . My understanding of this command is it lets me inject vector values from a pretrained word2vec model into my doc2vec model and then train my doc2vec model using the pretrained word2vec values rather than generating the word vector values from my document corpus. The result is that I get a more accurate doc2vec model because I am using pretrained w2v values which was generated from a much larger corpus of data compared to my relatively small document corpus. Is my

single-pass单遍聚类方法

这一生的挚爱 提交于 2019-12-01 13:39:46
一.通常关于文本聚类也都是针对已有的一堆历史数据进行聚类,比如常用的方法有kmeans,dbscan等。如果有个需求需要针对流式文本进行聚类(即来一条聚一条),那么这些方法都不太适用了,当然也有很多其它针对流式数据进行动态聚类方法,动态聚类也有很多挑战,比如聚类个数是不固定的,聚类的相似阈值也不好设。这些都有待继续研究下去。本文实现一个简单sing-pass单遍聚类方法,文本间的相似度是利用余弦距离,文本向量可以用tfidf(这里的idf可以在一个大的文档集里统计得到,然后在新的文本中的词直接利用),也可以用一些如word2vec,bert等中文预训练模型对文本进行向量表示。 二.程序 1 import numpy as np 2 import os 3 import sys 4 import pickle 5 import collections 6 from sklearn.feature_extraction.text import TfidfVectorizer 7 from sklearn.decomposition import TruncatedSVD 8 from gensim import corpora, models, matutils 9 from utils.tokenizer import load_stopwords, load_samples,

Is there pre-trained doc2vec model?

感情迁移 提交于 2019-12-01 05:47:05
Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar? I don't know of any good one. There's one linked from this project , but: it's based on a custom fork from an older gensim, so won't load in recent code it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4-million-plus) or article paragraphs (tens-of-millions), or a significant number of word-vectors, so it's

Is there pre-trained doc2vec model?

混江龙づ霸主 提交于 2019-12-01 02:23:12
问题 Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar? 回答1: I don't know of any good one. There's one linked from this project, but: it's based on a custom fork from an older gensim, so won't load in recent code it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4

How to break conversation data into pairs of (Context , Response)

懵懂的女人 提交于 2019-11-30 06:48:11
I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data: during the conversation "hello" and "Our offices are located in NYC" should be suggested Figure 2: describes a conversation where the questions and answers are not in sync during the conversation "hello" and "Our offices are located in NYC" should be suggested Figure 3: describes a conversation