word2vec

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

烈酒焚心 提交于 2020-01-14 10:16:06
问题 I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that. It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector. Thats why I planned to parse the wikipedia articles with spacy and merge entities like

Understanding input and labels in word2vec (TensorFlow)

萝らか妹 提交于 2020-01-13 19:07:39
问题 I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial. For instance, my data 1 1 1 1 1 1 1 1 5 251 371 371 1685 ... ... starts with skip_window = 2 # How many words to consider left and right. num_skips = 1 # How many times to reuse an input to generate a label. Then the generated input array is: bach_input = 1 1 1 1 1 1 5 251 371 .... This makes sense, starts from after 2 (= window size) and then continuous. The

Understanding input and labels in word2vec (TensorFlow)

拜拜、爱过 提交于 2020-01-13 19:07:23
问题 I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial. For instance, my data 1 1 1 1 1 1 1 1 5 251 371 371 1685 ... ... starts with skip_window = 2 # How many words to consider left and right. num_skips = 1 # How many times to reuse an input to generate a label. Then the generated input array is: bach_input = 1 1 1 1 1 1 5 251 371 .... This makes sense, starts from after 2 (= window size) and then continuous. The

Spark word2vector

被刻印的时光 ゝ 提交于 2020-01-09 17:24:26
1、概念 * word2vector 是google开源的一个生成词向量的工具, * 以语言模型为优化目标,迭代更新训练文本中的词向量,最终收敛获得词向量。 * 词向量可以作为文本分析中重要的特征,在分类问题、标注问题等场景都有着重要的应用价值。 * 由于是用向量表示,而且用较好的训练算法得到的词向量的向量一般是有空间上的意义的, * 也就是说,将所有这些向量放在一起形成一个词向量空间, * 而每一向量则为该空间中的一个点,在这个空间上的词向量之间的距离度量也可以表示对应的两个词之间的“距离”。 * 所谓两个词之间的“距离”,就是这两个词之间的语法,语义之间的相似性。 * * 一个比较实用的场景是找同义词,得到词向量后,假如对于词来说,想找出与这个词相似的词, * 这个场景对人来说都不轻松,毕竟比较主观,但是对于建立好词向量后的情况, * 对计算机来说,只要拿这个词的词向量跟其他词的词向量一一计算欧式距离或者cos距离,得到距离小于某个值那些词,就是它的同义词。 * * 这个特性使词向量很有意义,自然会吸引很多人去研究,google的word2vec模型也是基于这个做出来的。 * * Word2Vec 是一种著名的 词嵌入(Word Embedding) 方法, * 它可以计算每个单词在其给定语料库环境下的 分布式词向量(Distributed Representation

【深度学习】用Keras实现word2vec的CBOW模型

久未见 提交于 2020-01-07 08:37:48
前言   尽管gensim里的word2vec已经非常好用,但用别人的模型始终难以直接解决自己的问题,于是去搜有没有直接能用的Keras版,找到了两版,分别为: 版本1: keras训练word2vec代码 版本2: 【不可思议的Word2Vec】6. Keras版的Word2Vec   两位写的都很好,版本1代码上可以直接上手,版本2框架更清晰,但两位大佬的数据集都是基于多篇文章的,版本1是从微信接口里拿的,但我连接不上服务器;版本二的数据集未给,然后面向对象的写法本人不是很熟就不好改。于是就在学习版本框架理论的同时,在版本1上进行更改。   最终形成在单个文本内进行词向量训练。 数据   随便一个utf-8的中文文档就行,我这里为了求快,节选了 《天龙八部》 的第一章,记得改下编码utf-8。 停用词   也是网上扒的停用词,据说非常全, 最全中文停用词表整理(1893个) 。 def stopwordslist ( ) : #设置停用词 stopwords = [ ] if not os . path . exists ( './stopwords.txt' ) : print ( '未发现停用词表!' ) else : stopwords = [ line . strip ( ) for line in open ( 'stopwords.txt' , encoding =

Unpickling Error while using Word2Vec.load()

爱⌒轻易说出口 提交于 2020-01-07 03:58:57
问题 I am trying to load a binary file using gensim.Word2Vec.load(fname) but I get the error: File "file.py", line 24, in model = gensim.models.Word2Vec.load('ammendment_vectors.model.bin') File "/home/hp/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", line 1396, in load model = super(Word2Vec, cls).load(*args, **kwargs) File "/home/hp/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 271, in load obj = unpickle(fname) File "/home/hp/anaconda3/lib/python3.6/site

how to preserve number of records in word2vec?

梦想与她 提交于 2020-01-07 03:48:06
问题 I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words. After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ? In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels. 回答1: If you are

Semantic Similarity across multiple languages

感情迁移 提交于 2020-01-05 05:36:06
问题 I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good). So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)? 回答1: Let's assume that your sentence-similarity scheme uses only word-vectors

how to load a word2vec model and call its function into the mapper

半城伤御伤魂 提交于 2020-01-04 03:54:45
问题 I want to load a word2vec model and evaluate it by executing word analogy tasks (e.g. a is to b as c is to something? ). To do this, first I load my w2v model: model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1])) and then I call the mapper to evaluate the model: rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers) The getAnswers function reads one line per time from questions-words.txt , in which each line contains the question and the answer to evaluate my model

C extension not loaded for Word2Vec

我是研究僧i 提交于 2020-01-03 09:04:24
问题 I re-install the gensim pkg and Cython but it continusly show this warning, Does anybody know about this? I am using Python 3.6,PyCharm Linux Mint. UserWarning: C extension not loaded for Word2Vec, training will be slow. Install a C compiler and reinstall gensim for fast training. warnings.warn("C extension not loaded for Word2Vec, training will be slow. " And it also show this line when I create or load model. Slow version of gensim.models.doc2vec is being used 回答1: There is some problem