word2vec

可高效训练超大规模图模型,PyTorch BigGraph是如何做到的?

不问归期 提交于 2021-02-08 05:46:18
选自medium 作者: Jesus Rodriguez 机器之心编译 编辑:Panda Facebook 提出了一种可高效训练包含数十亿节点和数万亿边的图模型的框架 BigGraph 并开源了其 PyTorch 实现。 本文将解读它的创新之处,解析它能从大规模图网络高效提取知识的原因。 图(graph)是机器学习应用中最基本的数据结构之一。具体来说,图嵌入方法是一种无监督学习方法,可使用本地图结构来学习节点的表征。社交媒体预测、物联网模式检测或药物序列建模等主流场景中的训练数据可以很自然地表征为图结构。其中每一种场景都可以轻松得到具有数十亿相连节点的图。图结构非常丰富且具有与生俱来的导向能力,因此非常适合机器学习模型。尽管如此,图结构却非常复杂,难以进行大规模扩展应用。也因此,现代深度学习框架对大规模图数据结构的支持仍非常有限。 Facebook 推出过一个框架 PyTorch BigGraph:https://github.com/facebookresearch/PyTorch-BigGraph,它能更快更轻松地为 PyTorch 模型中的超大图结构生成图嵌入。 某种程度上讲,图结构可视为有标注训练数据集的一种替代,因为节点之间的连接可用于推理特定的关系。这种方法遵照无监督图嵌入方法的模式,它可以学习图中每个节点的向量表征,其具体做法是优化节点对的嵌入

How to get both the word embeddings vector and context vector of a given word by using word2vec?

天涯浪子 提交于 2021-02-07 03:52:50
问题 from gensim.models import word2vec sentences = word2vec.Text8Corpus('TextFile') model = word2vec.Word2Vec(sentences, size=200, min_count = 2, workers = 4) print model['king'] Is the output vector the context vector of 'king' or the word embedding vector of 'king'? How can I get both context vector of 'king' and the word embedding vector of 'king'? Thanks! 回答1: It is the embedding vector for 'king'. If you use hierarchical softmax, the context vectors are: model.syn1 and if you use negative

How to get both the word embeddings vector and context vector of a given word by using word2vec?

两盒软妹~` 提交于 2021-02-07 03:51:31
问题 from gensim.models import word2vec sentences = word2vec.Text8Corpus('TextFile') model = word2vec.Word2Vec(sentences, size=200, min_count = 2, workers = 4) print model['king'] Is the output vector the context vector of 'king' or the word embedding vector of 'king'? How can I get both context vector of 'king' and the word embedding vector of 'king'? Thanks! 回答1: It is the embedding vector for 'king'. If you use hierarchical softmax, the context vectors are: model.syn1 and if you use negative

Difference between Fasttext .vec and .bin file

孤人 提交于 2021-02-06 09:45:11
问题 I recently downloaded fasttext pretrained model for english. I got two files: wiki.en.vec wiki.en.bin I am not sure what is the difference between the two files? 回答1: The .vec files contain only the aggregated word vectors, in plain-text. The .bin files in addition contain the model parameters, and crucially, the vectors for all the n-grams. So if you want to encode words you did not train with using those n-grams (FastText's famous "subword information"), you need to find an API that can

自然语言处理系列-3.词向量

烈酒焚心 提交于 2021-02-03 07:01:13
估计有人会说小Dream在偷懒。词向量,网上百度一大把的东西,你还要写。在我看来,词向量在自然语言处理中是非常重要的一环,虽然他在一开始就没有受到大家的重视,但是在神经网络再度流行起来之后,就被当作是自然语言处理中奠基式的工作了。另一方面,网上词向量相关的文章,大多是抄来抄去,能够深入浅出,讲的通俗而又不失深度的少之又少。最后,为了这个系列的系统性和完整性,我还是决定好好讲一下词向量,这个非常基础却又重要的工作。 1.文本向量化 首先,我们提出这样一个问题,一个文本,经过分词之后,送入某一个自然语言处理模型之前该如何表示?例如,“人/如果/没用/梦想/,/跟/咸鱼/还有/什么/差别”,向机器学习模型直接输入字符串显然是不明智的,不便于模型进行计算和文本之间的比较。那么,我们需要一种方式来表示一个文本,这种文本表示方式要能够便于进行文本之间的比较,计算等。最容易想到的,就是对文本进行向量化的表示。例如,根据语料库的分词结果,建立一个词典,每个词用一个向量来表示,这样就可以将文本向量化了。 2.词袋模型 要讲词向量,我们首先不得不说的就是词袋模型。词袋模型是把文本看成是由一袋一袋的词构成的。例如,有这样两个文本: (1)“人/如果/没有/梦想/,/跟/咸鱼/还有/什么/差别” (2)“人生/短短/几十/年/,差别/不大/,/开心/最/重要” 这两个文本,可以构成这样一个词典:{“人”,

word2vec cosine similarity greater than 1 arabic text

自作多情 提交于 2021-01-29 22:01:22
问题 I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores: top neighbors for الاحتلال: الاحتلال: 1.0000001192092896 الاختلال: 0.9541053175926208 الاهتلال: 0.872565507888794 الاحثلال: 0.8386293649673462 الاكتلال: 0.8209128379821777 It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents).

what is workers parameter in word2vec in NLP

天大地大妈咪最大 提交于 2021-01-29 14:57:52
问题 in below code . i didn't understand the meaning of workers parameter . model = Word2Vec(sentences, size=300000, window=2, min_count=5, workers=4) 回答1: workers = use this many worker threads to train the model (=faster training with multicore machines). If your system is having 2 cores, and if you specify workers=2, then data will be trained in two parallel ways. By default , worker = 1 i.e, no parallelization 回答2: As others have mentioned, workers controls the number of independent threads

How to incrementally train a word2vec model with new vocabularies

有些话、适合烂在心里 提交于 2021-01-29 04:33:08
问题 I got a dataset over 40G. The program of my tokenizer is killed due to limited memory, so I try to split my dataset. How can I train the word2vec model incrementally, that is, how can I use separate datasets to train one word2vec model? My current word2vec code is: model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=1, workers=10) model.train(documents,total_examples=len(documents),epochs=epochs) model.save("./word2vec150d/word2vec_{}.model".format(epochs)) Any help would

Different models with gensim Word2Vec on python

我们两清 提交于 2021-01-28 14:02:40
问题 I am trying to apply the word2vec model implemented in the library gensim in python. I have a list of sentences (each sentences is a list of words). For instance let us have: sentences=[['first','second','third','fourth']]*n and I implement two identical models: model = gensim.models.Word2Vec(sententes, min_count=1,size=2) model2=gensim.models.Word2Vec(sentences, min_count=1,size=2) I realize that the models sometimes are the same, and sometimes are different, depending on the value of n. For

Understanding gensim word2vec's most_similar

為{幸葍}努か 提交于 2021-01-28 10:50:30
问题 I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X ; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true. The documentation reads: Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively. This method computes cosine similarity between a