word2vec

权威的我的

我与影子孤独终老i 提交于 2019-12-23 05:56:32
import logging import gensim from gensim.models import word2vec # 设置输出日志 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # 直接用gemsim提供的API去读取txt文件,读取文件的API有LineSentence 和 Text8Corpus, PathLineSentences等。 sentences = word2vec.LineSentence("/data4T/share/jiangxinyang848/textClassifier/data/preProcess/wordEmbdiing.txt") # 训练模型,词向量的长度设置为200, 迭代次数为8,采用skip-gram模型,模型保存为bin格式 model = gensim.models.Word2Vec(sentences, size=300, sg=1, iter=8) model.wv.save_word2vec_format("./word2Vec" + ".bin", binary=True) # 加载bin格式的模型 wordVec = gensim.models.KeyedVectors

gensim word2vec - updating word embeddings with newcoming data

冷暖自知 提交于 2019-12-23 04:52:51
问题 I have trained 26 million tweets with skipgram technique to create word embeddings as follows: sentences = gensim.models.word2vec.LineSentence('/.../data/tweets_26M.txt') model = gensim.models.word2vec.Word2Vec(sentences, window=2, sg=1, size=200, iter=20) model.save_word2vec_format('/.../savedModel/Tweets26M_All.model.bin', binary=True) However, I am continuously collecting more tweets in my database. For example, when I have 2 million more tweets, I wanna update my embeddings with also

cosine similarity between two words in a list

醉酒当歌 提交于 2019-12-23 04:15:12
问题 I am defining a function which takes a list of words and returns information about the words in the list that have non-zero, cosine similarity between each other (along with the similarity value). Can anyone help me out with this. I was thinking if I can get a precomputed word2vec vector file then it would be very helpful,but there is none on the internet. 回答1: You could define these two functions def word2vec(word): from collections import Counter from math import sqrt # count the characters

add new words to GoogleNews by gensim

拜拜、爱过 提交于 2019-12-22 22:27:43
问题 I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it? #Load GoogleNews pretrained word2vec model model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin"

【文智背后的奥秘】系列篇——自动文本分类

社会主义新天地 提交于 2019-12-22 19:19:02
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 版权声明:本文由文智原创文章,转载请注明出处: 文章原文链接: https://www.qcloud.com/community/article/132 来源:腾云阁 https://www.qcloud.com/community 一.自动文本分类 概述文本分类,顾名思义,就是将一篇文档归为已知类别中的一类或者几个类,为了实现自动分类的目标,通常有以下几个步骤: 构建分类类别体系 获取带有类别标签的训练数据 训练数据的文本表达及特征选择 分类器的选择与训练 分类应用数据 给定一篇待分类的文档,若对其进行自动分类,通常需要把文档表达成机器可以处理的数据类型。目前常用的文本表达方式有向量空间模型(VSM),即把文档映射为一个特征向量 其中ti为文档分词后的词条项,w(ti)为相应词条项的权重。 我们的自动文本分类系统,为用户提供自动文本分类服务,平台已对文本分类的模型算法进行了封装,用户只需提供待分类的文本数据,而不必关注具体的实现,通过平台就能得到提供文本的所属类别。目前平台能识别类别囊括了软件、影视、音乐、健康养生、财经、广告推广、犯罪、政治等40多个类别,且系统算法支持快速迭代更新已有类别及增加新类别。 二.自动文本分类系统 1.系统主要框架 目前我们的自动分类系统框架如图1.1所示。系统主要分为三大块

What are doc2vec training iterations?

泄露秘密 提交于 2019-12-22 10:29:50
问题 I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop ? Please let me know how I should change the following code to train the model for 20 epoches. Also,

How to generate a sentence from feature vector or words?

筅森魡賤 提交于 2019-12-22 06:38:42
问题 I used VGG 16-Layer Caffe model for image captions and I have several captions per image. Now, I want to generate a sentence from those captions (words). I read in a paper on LSTM that I should remove the SoftMax layer from the training network and provide the 4096 feature vector from fc7 layer directly to LSTM. I am new to LSTM and RNN stuff. Where should I begin? Is there any tutorial showing how to generate sentence by sequence labeling? 回答1: AFAIK the master branch of BVLC/caffe does not

How to obtain antonyms through word2vec?

懵懂的女人 提交于 2019-12-22 04:00:15
问题 I am currently working on word2vec model using gensim in Python, and want to write a function that can help me find the antonyms and synonyms of a given word. For example: antonym("sad")="happy" synonym("upset")="enraged" Is there a way to do that in word2vec? 回答1: In word2vec you can find analogies, the following way model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) model.most_similar(positive=['good', 'sad'], negative=['bad']) [(u

Loss does not decrease during training (Word2Vec, Gensim)

*爱你&永不变心* 提交于 2019-12-22 00:26:27
问题 What can cause loss from model.get_latest_training_loss() increase on each epoch? Code, used for training: class EpochSaver(CallbackAny2Vec): '''Callback to save model after each epoch and show training parameters ''' def __init__(self, savedir): self.savedir = savedir self.epoch = 0 os.makedirs(self.savedir, exist_ok=True) def on_epoch_end(self, model): savepath = os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch)) model.save(savepath) print( "Epoch saved: {}".format(self

AttributeError: module 'tensorflow.models.embedding.gen_word2vec' has no attribute 'skipgram_word2vec'

青春壹個敷衍的年華 提交于 2019-12-21 22:05:30
问题 I am new to Tensorflow and I am running the tutorial of word2vec embedding code (https://github.com/tensorflow/models/tree/master/tutorials/embedding) on Tensorflow (cpu-only), OS X: 10.11.6. I installed tensorflow via pip install . Running word2vec_basic.py can reach the expected result, but when it turns to word2vec.py and word2vec_optimized.py , the following error is displayed: 回答1: You'll need to use bazel to build the directory, since the op 'skipgram_word2vec' is defined in C++ and not