word2vec

Spark Word2Vec example using text8 file

匿名 (未验证) 提交于 2019-12-03 01:57:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to run this example from apache.spark.org (code is below & entire tutorial is here: https://spark.apache.org/docs/latest/mllib-feature-extraction.html ) using the text8 file that they reference on their site ( http://mattmahoney.net/dc/text8.zip ): import org.apache.spark._ import org.apache.spark.rdd._ import org.apache.spark.SparkContext._ import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val input = sc.textFile("/Users/rkita/Documents/Learning/random/spark/MLlib/examples/text8",4).map(line => line.split(" ").toSeq

What is a projection layer in the context of neural networks?

帅比萌擦擦* 提交于 2019-12-03 01:56:34
问题 I am currently trying to understand the architecture behind the word2vec neural net learning algorithm, for representing words as vectors based on their context. After reading Tomas Mikolov paper I came across what he defines as a projection layer . Even though this term is widely used when referred to word2vec , I couldn't find a precise definition of what it actually is in the neural net context. My question is, in the neural net context, what is a projection layer? Is it the name given to

How can a sentence or a document be converted to a vector?

情到浓时终转凉″ 提交于 2019-12-03 01:51:58
问题 We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words? 回答1: 1) Skip gram method: paper here and the tool that uses it, google word2vec 2) Using LSTM-RNN to form semantic representations of sentences. 3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm

Spark Word2vec vector mathematics

匿名 (未验证) 提交于 2019-12-03 01:48:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I was looking at the example of Spark site for Word2Vec: val input = sc.textFile("text8").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("country name here", 40) How do I do the interesting vector such as king - man + woman = queen. I can use model.getVectors, but not sure how to proceed further. 回答1: Here is an example in pyspark , which I guess is straightforward to port to Scala - the key is the use of model.transform . First, we train the model as in the

Convert word2vec bin file to text

匿名 (未验证) 提交于 2019-12-03 01:18:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c . Supposedly gensim can do this also, but all the tutorials I've

TensorFlow 'module' object has no attribute 'global_variables_initializer'

梦想的初衷 提交于 2019-12-03 01:10:22
I'm new to Tensorflow I'm running a Deep learning Assignment from Udacity on iPython notebook. link And it has an error. AttributeError Traceback (most recent call last) `<ipython-input-18-3446420b5935>` in `<module>`() 2 3 with tf.Session(graph=graph) as session: ----> 4 tf.global_variables_initializer().run() AttributeError: 'module' object has no attribute 'global_variables_initializer' Please help! How can I fix this? Thank you. In older versions, it was called tf.initialize_all_variables . Seems like you're using tensorflow 0.11 or older versions. If you see this git-commit , they

CBOW v.s. skip-gram: why invert context and target words?

核能气质少年 提交于 2019-12-03 00:49:47
问题 In this page, it is said that: [...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...] However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y): (quick, brown), (brown, quick) So, why distinguish that much between context and targets if it is the same thing in the end? Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they

CS224n笔记二:word2vec

匿名 (未验证) 提交于 2019-12-03 00:41:02
语言学中meaning近似于“指代,代指,符号”。 过去一直采用 分类词典 ,计算语言学中常见的方式时WordNet那样的词库,比如NLTK中可以通过WordNet查询熊猫的上位词(hypernums),得到“食肉动物”,“动物”之类的上位词。也可以查询“good”的同义词,如“just”。 这种离散表示并不准确,丢失了些许韵味。如以下同义词的意思还是有微妙不同的:adept, expert, good, practiced, proficient, skillful 缺少新词 耗费人力 无法准确计算词语相似度 大多数NLP学者将词语作为最小的单位,事实上,词语只是词表长度的one-hot向量,这是一种局部表示(localist representation)。在不同的语料中,词表大小不同,如Google的1TB词料词汇量是1300w,这个向量实在过长了。 词语在符号表示上体现不出意义的相似性,如“motel”和“hotel”,其one-hot向量是正交的,无法通过计算获得相似度。 Distributional similarity based representations 语言学家J. R. Firth提出,通过一个单词的上下文可以得到它的意思。J. R. Firth甚至建议,日过能将单词放到正确的上下文中,才说明掌握了它的意义。这是现代统计自然语言处理最成功的思想之一:

用tensorflow实现word2vec(skip-gram+NEC模型)

匿名 (未验证) 提交于 2019-12-03 00:32:02
本文的代码主要参考github上的一篇开源的代码 “Basic word2vec example” ,但是几乎只提取了其中网络搭建的必要部分,并且为了方便自己作为初学者的理解进行了一些语言上简化(并没有简化模型),同时加上了一些自己的批注。 主要目的是学习熟悉tensorflow的使用,同时加深对word2vec的理解,因此在此进行记录。 读取的数据最终以一个一个单词的形式储存在vacabulary里。注意,这里的单词一定要按照原文的语序排好,而不能是乱序的(从word2vec的原理上来讲必须这样),举个例子vac=[I, like, eating, ChongQing, food, …]这样。 def maybe_download (filename, expected_bytes) : """Download a file if not present, and make sure it's the right size.""" local_filename = os.path.join(gettempdir(), filename) if not os.path.exists(local_filename): local_filename, _ = urllib.request.urlretrieve(url + filename, local_filename)

What does a weighted word embedding mean?

蓝咒 提交于 2019-12-03 00:08:14
In the paper that I am trying to implement, it says, In this work, tweets were modeled using three types of text representation. The first one is a bag-of-words model weighted by tf-idf (term frequency - inverse document frequency) (Section 2.1.1). The second represents a sentence by averaging the word embeddings of all words (in the sentence) and the third represents a sentence by averaging the weighted word embeddings of all words, the weight of a word is given by tf-idf (Section 2.1.2). I am not sure about the third representation which is mentioned as the weighted word embeddings which is