word2vec

Gensim Word2Vec uses too much memory

蹲街弑〆低调 提交于 2019-12-11 06:46:48
问题 I want to train a word2vec model on a tokenized file of size 400MB. I have been trying to run this python code : import operator import gensim, logging, os from gensim.models import Word2Vec from gensim.models import * class Sentences(object): def __init__(self, filename): self.filename = filename def __iter__(self): for line in open(self.filename): yield line.split() def runTraining(input_file,output_file): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging

Gensim Phrases usage to filter n-grams

久未见 提交于 2019-12-11 06:07:25
问题 I am using Gensim Phrases to identify important n-grams in my text as follows. bigram = Phrases(documents, min_count=5) trigram = Phrases(bigram[documents], min_count=5) for sent in documents: bigrams_ = bigram[sent] trigrams_ = trigram[bigram[sent]] However, this detects uninteresting n-grams such as special issue , important matter , high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning , human computer interaction etc. Is there a way to

How is SpaCy's similarity computed?

无人久伴 提交于 2019-12-11 01:36:09
问题 Beginner NLP Question here: How does the .similiarity method work? Wow spaCy is great! Its tfidf model could be easier to preprocess, but w2v with only one line of code (token.vector)?! - Awesome! In his 10 line tutorial on spaCy andrazhribernik show's us the .similarity method that can be run on tokens, sents, word chunks, and docs. After nlp = spacy.load('en') and doc = nlp(raw_text) we can do .similarity queries between tokens and chunks. However, what is being calculated behind the scenes

Pairwise Earth Mover Distance across all documents (word2vec representations)

倾然丶 夕夏残阳落幕 提交于 2019-12-11 00:48:07
问题 Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs. like sklearns cosine_similarity. 回答1: The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document. I'm not aware of any tricks

Spark MLib Word2Vec Error: The vocabulary size should be > 0

荒凉一梦 提交于 2019-12-11 00:18:39
问题 I am trying to implement word vectorization using Spark's MLLib. I am following the example given here. I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string. My input is as below: scala> v.take(5) res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from,

Creating a wordvector model combining words from other models

女生的网名这么多〃 提交于 2019-12-10 22:13:47
问题 I have two different word vector models created using word2vec algorithm . Now issue i am facing is few words from first model is not there in second model . I want to create a third model from two different word vectors models where i can use word vectors from both models without loosing meaning and the context of word vectors. Can I do this, and if so, how? 回答1: You could potentially translate the vectors for the words only in one model to the other model's coordinate space, using other

Use pre-trained word2vec in lstm language model?

北城余情 提交于 2019-12-10 18:26:36
问题 I used tensorflow to train LSTM language model, code is from here. According to article here, it seems that if I use pre-trained word2vec, it works better. Using word embeddings such as word2vec and GloVe is a popular method to improve the accuracy of your model. Instead of using one-hot vectors to represent our words, the low-dimensional vectors learned using word2vec or GloVe carry semantic meaning – similar words have similar vectors. Using these vectors is a form of pre-training. So, I

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

佐手、 提交于 2019-12-10 14:53:51
问题 I am working on node2vec. When I am using small dataset the code works well. But as soon as I try to run the same code on large dataset, the code crashes. Error: Process finished with exit code 134 (interrupted by signal 6: SIGABRT). The line which is giving error is model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter) I am using pycharm and python 3.5. Any idea what is happening? I could not found any post which could

How to do keyword mapping in pandas

≯℡__Kan透↙ 提交于 2019-12-10 12:05:15
问题 I have keyword India Japan United States Germany China Here's sample dataframe id Address 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 2 Arcisstraße 21, 80333 München, Germany 3 Liberty Street, Manhattan, New York, United States 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India My Goal Is make id Address India Japan United States Germany China 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0 2

How to use pretrained Word2Vec model in Tensorflow

让人想犯罪 __ 提交于 2019-12-10 04:01:52
问题 I have a Word2Vec model which is trained in Gensim . How can I use it in Tensorflow for Word Embeddings . I don't want to train Embeddings from scratch in Tensorflow. Can someone tell me how to do it with some example code? 回答1: Let's assume you have a dictionary and inverse_dict list, with index in list corresponding to most common words: vocab = {'hello': 0, 'world': 2, 'neural':1, 'networks':3} inv_dict = ['hello', 'neural', 'world', 'networks'] Notice how the inverse_dict index