doc2vec

How to break conversation data into pairs of (Context , Response)

孤街醉人 提交于 2019-11-29 07:01:33
问题 I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data: during the conversation "hello" and "Our offices are located in NYC" should be suggested Figure 2: describes a conversation where the questions and answers are not in sync during the

Doc2vec学习总结(三)

江枫思渺然 提交于 2019-11-27 08:35:47
这篇是 七月在线 问答系统项目中使用到的一个算法,由于当时有总结,就先放上来了后期再整理。 Doc2vec ​ Doc2vec又叫Paragraph Vector是Tomas Mikolov基于word2vec模型提出的,其具有一些优点,比如 不用固定句子长度,接受不同长度的句子做训练样本 ,Doc2vec是一个无监督学习算法,该算法用于预测一个向量来表示不同的文档,该模型的结构潜在的克服了词袋模型的缺点。 ​ Doc2vec模型是受到了word2vec模型的启发,word2vec里预测词向量时,预测出来的词是含有词义的,比如上文提到的词向量'powerful'会相对于'Paris'离'strong'距离更近,在Doc2vec中也构建了相同的结构。所以Doc2vec克服了词袋模型中没有语义的去缺点。假设现在存在训练样本,每个句子是训练样本。和word2vec一样,Doc2vec也有两种训练方式,一种是PV-DM(Distributed Memory Model of paragraphvectors)类似于word2vec中的CBOW模型,另一种是PV-DBOW(Distributed Bag of Words of paragraph vector)类似于word2vec中的skip-gram模型 1. A distributed memory model ​