gensim

Understanding parameters in Gensim LDA Model

久未见 提交于 2020-03-18 05:32:04
问题 I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify. Specifically, I do not understand: random_state update_every chunksize passes alpha per_word_topics I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the

Word2Vec: Effect of window size used

自古美人都是妖i 提交于 2020-03-17 06:05:40
问题 I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well. I am

Word2Vec: Effect of window size used

纵然是瞬间 提交于 2020-03-17 06:04:08
问题 I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well. I am

【NLP】【六】gensim之doc2vec

爷,独闯天下 提交于 2020-03-15 02:45:40
【一】总述 doc2vec是指将句子、段落或者文章使用向量来表示,这样可以方便的计算句子、文章、段落的相似度。 【二】使用方法介绍 1. 预料准备 def read_corpus(fname, tokens_only=False): with open(fname, encoding="utf-8") as f: for i, line in enumerate(f): if tokens_only: yield gensim.utils.simple_preprocess(line) else: # For training data, add tags # 利用gensim进行doc2vec时,语料库是一个TaggedDocument,其包括原始语料(句子、段落、篇章) # 和对应的id(如句子id,段落id,篇章id)即语料标识 yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i]) 2. 模型训练 方法一: def train_doc2vec2(): train_file = "E:/nlp_data/in_the_name_of_people/in_the_name_of_people.txt" train_corpus = list(read_corpus

How to break the Word2vec training from a callback function?

老子叫甜甜 提交于 2020-03-03 09:07:24
问题 I am training a skipgram model using gensim word2vec. I would like to exit the training before reaching the number of epochs passed in the parameters based on a specific accuracy test in a different set of data in order to avoid the overfitting of the model. Is there a way in gensim to interrupt the train of word2vec from a callback function? 回答1: If in fact more training makes your Word2Vec model worse on some external evaluation, there is likely something else wrong with your setup. (For

How to break the Word2vec training from a callback function?

半腔热情 提交于 2020-03-03 09:06:27
问题 I am training a skipgram model using gensim word2vec. I would like to exit the training before reaching the number of epochs passed in the parameters based on a specific accuracy test in a different set of data in order to avoid the overfitting of the model. Is there a way in gensim to interrupt the train of word2vec from a callback function? 回答1: If in fact more training makes your Word2Vec model worse on some external evaluation, there is likely something else wrong with your setup. (For

gensim word2vec

喜你入骨 提交于 2020-02-28 03:26:20
官方 demo 文件有点大, 可以用迅雷或者网盘下载下来后, 放到这个文件夹下 C:\Users\Ace\gensim-data\word2vec-google-news-300 这个是cpu密集型, 1.62g的模型文件, 我16g的内存都很吃力, 唉...gpu就没用到 链接:https://pan.baidu.com/s/1qEoMqJDBOMYXDPHq7hsDMQ 提取码:mj5j 来源: oschina 链接: https://my.oschina.net/ahaoboy/blog/3166440

Get bigrams and trigrams in word2vec Gensim

不想你离开。 提交于 2020-02-26 07:23:54
问题 I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the paragraph into sentences raw_sentences = tokenizer.tokenize(review.strip()) sentences = [] for raw_sentence in raw_sentences: # If a sentence is empty, skip it if len(raw_sentence) > 0: # Otherwise, call review_to_wordlist to get a list of words sentences.append(

LDA Mallet CalledProcessError

痴心易碎 提交于 2020-02-24 10:14:39
问题 I am trying to implement the following code: import os os.environ.update({'MALLET_HOME':r'c:/mallet-2.0.8/'}) mallet_path = 'C:\\mallet-2.0.8\\bin\\mallet' ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow, num_topics=20, id2word=dictionary) However, I keep getting this error: CalledProcessError: Command 'C:\mallet-2.0.8\bin\mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\Joshua\AppData\Local\Temp\98094d_corpus.txt

Calculating topic distribution of an unseen document on GenSim

雨燕双飞 提交于 2020-02-22 08:48:48
问题 I am trying to use LDA module of GenSim to do the following task "Train a LDA model with one big document and keep track of 10 latent topics. Given a new, unseen document, predict probability distribution of 10 latent topics". As per tutorial here: http://radimrehurek.com/gensim/tut2.html, this seems possible for a document in a corpus, but I am wondering if it it would be possible for an unseen document. Thank you! 回答1: From the documentation you posted it looks like you can train your model