gensim

UnicodeDecodeError error when loading word2vec

夙愿已清 提交于 2020-01-24 15:11:04
问题 Full Description I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese . Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word

how to add tokens to gensim dictionary

爷,独闯天下 提交于 2020-01-23 11:52:15
问题 I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code def constructModel(self, docTokens): """ Given document tokens, constructs the tf-idf and similarity models""" #construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs #print "dictionary" self.dictionary = corpora.Dictionary(docTokens) # prune dictionary: remove words that

Using Gensim shows “Slow version of gensim.models.doc2vec being used”

荒凉一梦 提交于 2020-01-23 08:27:10
问题 I am trying to run a program using the Gensim library of the Python with the version 3.6. Whenever I ran the program, I came across these statements: C:\Python36\lib\site-packages\gensim-2.0.0-py3.6-win32.egg\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") Slow version of gensim.models.doc2vec is being used I do not understand what is the meaning behind Slow version of gensim

Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'

≯℡__Kan透↙ 提交于 2020-01-17 07:49:08
问题 I am learning Doc2Vec model from gensim library and using it as follows: class MyTaggedDocument(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin: print(fname) for item_no, sentence in enumerate(fin): yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]) sentences =

Calculate perplexity of word2vec model

牧云@^-^@ 提交于 2020-01-15 04:51:53
问题 I trained Gensim W2V model on 500K sentences (around 60K) words and I want to calculate the perplexity. What will be the best way to do so? for 60K words, how can I check what will be a proper amount of data? Thanks 回答1: If you want to calculate the perplexity, you have first to retrieve the loss. On the gensim.models.word2vec.Word2Vec constructor, pass the compute_loss=True parameter - this way, gensim will store the loss for you while training. Once trained, you can call the get_latest

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

烈酒焚心 提交于 2020-01-14 10:16:06
问题 I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that. It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector. Thats why I planned to parse the wikipedia articles with spacy and merge entities like

how to install gensim on windows 8.1

大城市里の小女人 提交于 2020-01-07 09:00:52
问题 I just got acquainted with gensim and I tried to install it. I performed any steps is written in page https://radimrehurek.com/gensim/install.html but I could not install it. I have installed python 2.7, scipy, numpy successfully on windows 8.1 64bit, but when I run setup.py in gensim it doesn't run. Please help me I need to gensim Immediately and tell me installation steps with More details and other software that needs to be installed before it. thanks 回答1: Gensim depends on scipy and numpy

Unpickling Error while using Word2Vec.load()

爱⌒轻易说出口 提交于 2020-01-07 03:58:57
问题 I am trying to load a binary file using gensim.Word2Vec.load(fname) but I get the error: File "file.py", line 24, in model = gensim.models.Word2Vec.load('ammendment_vectors.model.bin') File "/home/hp/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", line 1396, in load model = super(Word2Vec, cls).load(*args, **kwargs) File "/home/hp/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 271, in load obj = unpickle(fname) File "/home/hp/anaconda3/lib/python3.6/site

IndexError while using Gensim package for LDA Topic Modelling

馋奶兔 提交于 2020-01-06 04:05:39
问题 I have a total of 54892 documents which have 360331 unique tokens. The length of the dictionary is 88. mm = corpora.MmCorpus('PRC.mm') dictionary = corpora.Dictionary('PRC.dict') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=50, update_every=0, chunksize=19188, passes=650) Whenever I run this script I get this error: Traceback (most recent call last): File "C:\Users\modelDeTopics.py", line 19, in <module> lda = gensim.models.ldamodel.LdaModel(corpus=mm,

IndexError while using Gensim package for LDA Topic Modelling

拥有回忆 提交于 2020-01-06 04:04:06
问题 I have a total of 54892 documents which have 360331 unique tokens. The length of the dictionary is 88. mm = corpora.MmCorpus('PRC.mm') dictionary = corpora.Dictionary('PRC.dict') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=50, update_every=0, chunksize=19188, passes=650) Whenever I run this script I get this error: Traceback (most recent call last): File "C:\Users\modelDeTopics.py", line 19, in <module> lda = gensim.models.ldamodel.LdaModel(corpus=mm,