word2vec | 易学教程

Gensim Word2Vec uses too much memory

阅读更多关于 Gensim Word2Vec uses too much memory

问题 I want to train a word2vec model on a tokenized file of size 400MB. I have been trying to run this python code : import operator import gensim, logging, os from gensim.models import Word2Vec from gensim.models import * class Sentences(object): def __init__(self, filename): self.filename = filename def __iter__(self): for line in open(self.filename): yield line.split() def runTraining(input_file,output_file): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging

Gensim Phrases usage to filter n-grams

阅读更多关于 Gensim Phrases usage to filter n-grams

问题 I am using Gensim Phrases to identify important n-grams in my text as follows. bigram = Phrases(documents, min_count=5) trigram = Phrases(bigram[documents], min_count=5) for sent in documents: bigrams_ = bigram[sent] trigrams_ = trigram[bigram[sent]] However, this detects uninteresting n-grams such as special issue , important matter , high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning , human computer interaction etc. Is there a way to

How is SpaCy's similarity computed?

阅读更多关于 How is SpaCy's similarity computed?

问题 Beginner NLP Question here: How does the .similiarity method work? Wow spaCy is great! Its tfidf model could be easier to preprocess, but w2v with only one line of code (token.vector)?! - Awesome! In his 10 line tutorial on spaCy andrazhribernik show's us the .similarity method that can be run on tokens, sents, word chunks, and docs. After nlp = spacy.load('en') and doc = nlp(raw_text) we can do .similarity queries between tokens and chunks. However, what is being calculated behind the scenes

Pairwise Earth Mover Distance across all documents (word2vec representations)

阅读更多关于 Pairwise Earth Mover Distance across all documents (word2vec representations)

问题 Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs. like sklearns cosine_similarity. 回答1: The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document. I'm not aware of any tricks

Spark MLib Word2Vec Error: The vocabulary size should be > 0

阅读更多关于 Spark MLib Word2Vec Error: The vocabulary size should be > 0

问题 I am trying to implement word vectorization using Spark's MLLib. I am following the example given here. I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string. My input is as below: scala> v.take(5) res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from,

Creating a wordvector model combining words from other models

阅读更多关于 Creating a wordvector model combining words from other models

问题 I have two different word vector models created using word2vec algorithm . Now issue i am facing is few words from first model is not there in second model . I want to create a third model from two different word vectors models where i can use word vectors from both models without loosing meaning and the context of word vectors. Can I do this, and if so, how? 回答1: You could potentially translate the vectors for the words only in one model to the other model's coordinate space, using other

Use pre-trained word2vec in lstm language model?

阅读更多关于 Use pre-trained word2vec in lstm language model?

问题 I used tensorflow to train LSTM language model, code is from here. According to article here, it seems that if I use pre-trained word2vec, it works better. Using word embeddings such as word2vec and GloVe is a popular method to improve the accuracy of your model. Instead of using one-hot vectors to represent our words, the low-dimensional vectors learned using word2vec or GloVe carry semantic meaning – similar words have similar vectors. Using these vectors is a form of pre-training. So, I

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

阅读更多关于 Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

问题 I am working on node2vec. When I am using small dataset the code works well. But as soon as I try to run the same code on large dataset, the code crashes. Error: Process finished with exit code 134 (interrupted by signal 6: SIGABRT). The line which is giving error is model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter) I am using pycharm and python 3.5. Any idea what is happening? I could not found any post which could

How to do keyword mapping in pandas

阅读更多关于 How to do keyword mapping in pandas

问题 I have keyword India Japan United States Germany China Here's sample dataframe id Address 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 2 Arcisstraße 21, 80333 München, Germany 3 Liberty Street, Manhattan, New York, United States 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India My Goal Is make id Address India Japan United States Germany China 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0 2

How to use pretrained Word2Vec model in Tensorflow

阅读更多关于 How to use pretrained Word2Vec model in Tensorflow

问题 I have a Word2Vec model which is trained in Gensim . How can I use it in Tensorflow for Word Embeddings . I don't want to train Embeddings from scratch in Tensorflow. Can someone tell me how to do it with some example code? 回答1: Let's assume you have a dictionary and inverse_dict list, with index in list corresponding to most common words: vocab = {'hello': 0, 'world': 2, 'neural':1, 'networks':3} inv_dict = ['hello', 'neural', 'world', 'networks'] Notice how the inverse_dict index