gensim

Gensim: “C extension not loaded, training will be slow.”

情到浓时终转凉″ 提交于 2019-12-12 14:45:22
问题 I am running gensim on Linux Suse. I can start my python program but on startup I get: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training. GCC is installed. Does anyone know what I have to do? 回答1: Try the following: Python 3.x $ pip3 uninstall gensim $ apt-get install python3-dev build-essential $ pip3 install --upgrade gensim Python 2.x $ pip uninstall gensim $ apt-get install python-dev build-essential $ pip install --upgrade gensim

Issues in doc2vec tags in Gensim

大城市里の小女人 提交于 2019-12-12 04:46:56
问题 I am using gensim doc2vec as below. from gensim.models import doc2vec from collections import namedtuple import re my_d = {'recipe__001__1': 'recipe 1 details should come here', 'recipe__001__2': 'Ingredients of recipe 2 need to be added'} docs = [] analyzedDocument = namedtuple('AnalyzedDocument', 'words tags') for key, value in my_d.items(): value = re.sub("[^a-zA-Z]"," ", value) words = value.lower().split() tags = key docs.append(analyzedDocument(words, tags)) model = doc2vec.Doc2Vec(docs

How to find most similar terms/words of a document in doc2vec? [duplicate]

爱⌒轻易说出口 提交于 2019-12-12 04:08:49
问题 This question already has answers here : How to intrepret Clusters results after using Doc2vec? (3 answers) Closed 2 years ago . I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure

Memory efficient LDA training using gensim library

社会主义新天地 提交于 2019-12-12 02:58:02
问题 Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using: from gensim import corpora, models, similarities, matutils def train_model(fname): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) dictionary = corpora.Dictionary(line.lower().split() for line in open(fname)) print "DOC2BOW" corpus = [dictionary.doc2bow(line.lower().split()) for line

How to get cython and gensim to work with pyspark

…衆ロ難τιáo~ 提交于 2019-12-12 02:15:34
问题 I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model , it is only ever trained with one worker which is dreadfully slow. As I said gcc was installed from the start. I then maybe made the mistake and installed gensim before cython . I corrected that by forcing a reinstall of gensim via pip . With no effect still just one worker. The machine is setup as a spark master and I interface with spark via pyspark . It

How to Cluster words and phrases with pre-trained model on Gensim

你。 提交于 2019-12-11 19:45:26
问题 What I want exactly is to cluster words and phrases, e.g. knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it? I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words.

gensim.LDAMulticore throwing exception:

放肆的年华 提交于 2019-12-11 17:24:31
问题 I am running LDAMulticore from the python gensim library, and the script cannot seem to create more than one thread. Here is the error: Traceback (most recent call last): File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python2.7/multiprocessing/pool.py", line 97, in worker initializer(*initargs) File "/usr/lib64/python2

Issues in Gensim WordRank Embeddings

混江龙づ霸主 提交于 2019-12-11 16:59:37
问题 I am using Gensim wrapper to obtain wordRank embeddings (I am following their tutorial to do this) as follows. from gensim.models.wrappers import Wordrank model = Wordrank.train(wr_path = "models", corpus_file="proc_brown_corp.txt", out_name= "wr_model") model.save("wordrank") model.save_word2vec_format("wordrank_in_word2vec.vec") However, I am getting the following error FileNotFoundError: [WinError 2] The system cannot find the file specified . I am just wondering what I have made wrong as

TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()

十年热恋 提交于 2019-12-11 15:26:32
问题 There is a dataframe like this: index terms 1345 ['jays', 'place', 'great', 'subway'] 1543 ['described', 'communicative', 'friendly'] 9874 ['great', 'sarahs', 'apartament', 'back'] 2456 ['great', 'sarahs', 'apartament', 'back'] I try to create a dictionary from the corpus of comments[ 'terms' ], but I face an error message ! from gensim import corpora, models dictionary = corpora.Dictionary( comments['terms'] ) TypeError: doc2bow expects an array of unicode tokens on input, not a single

How to get similar words related to one word?

橙三吉。 提交于 2019-12-11 15:03:19
问题 I am trying to solve a nlp problem where i have a dict of words like : list_1={'phone':'android','chair':'netflit','charger':'macbook','laptop','sony'} Now if input is 'phone' i can easily use 'in' operator to get the description of phone and its data by key but problem is if input is something like 'phones' or 'Phones' . I want if i input 'phone' then i get words like 'phone' ==> 'Phones','phones','Phone','Phone's','phone's' I don't know which word2vec i can use and which nlp module can