mallet | 易学教程

CalledProcessError: Returned non-zero exit status 1

阅读更多关于 CalledProcessError: Returned non-zero exit status 1

问题 When I try to run: def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod1[doc] for doc in texts] # Remove Stop Words data_words_nostops1 = remove_stopwords(data_words1) # Form Bigrams data_words_bigrams1 = make_bigrams(data_words_nostops1) # Create Dictionary id2word1 = corpora.Dictionary(data_words_bigrams1) # Create Corpus texts1 = data_words_bigrams1 # Term Document

Mallet DMR negative propability for feature-based topic-distribution?

阅读更多关于 Mallet DMR negative propability for feature-based topic-distribution?

问题 I've created a DMR Topic model (via Java API) which calculates the topic distribution based on the publication-year of the documents. The resulting distribution is a bit confusing, because there are a lot of negative propabilities . Sometimes all propabilities for a whole topic are negative values. See: Q1: Why are there negative values? The lowest possible possibility for a topic distribution for a given feature should be at least 0,0 ... I guess? Additional I build a LDA model where the

Saved Gensim LdaMallet model not working in different console

阅读更多关于 Saved Gensim LdaMallet model not working in different console

来源： https://stackoverflow.com/questions/56363181/saved-gensim-ldamallet-model-not-working-in-different-console

How to understand the output of Topic Model class in Mallet?

阅读更多关于 How to understand the output of Topic Model class in Mallet?

问题 As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code. First during the running process, it gives out: Coded LDA: 10 topics, 4 topic bits, 1111 topic mask max tokens: 148 total tokens: 1333 <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353 0 0,5 battle union confederate tennessee american states 1 0,5 hawes sunderland echo war paper commonwealth 2 0,5 test

LDA Mallet CalledProcessError

阅读更多关于 LDA Mallet CalledProcessError

问题 I am trying to implement the following code: import os os.environ.update({'MALLET_HOME':r'c:/mallet-2.0.8/'}) mallet_path = 'C:\\mallet-2.0.8\\bin\\mallet' ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow, num_topics=20, id2word=dictionary) However, I keep getting this error: CalledProcessError: Command 'C:\mallet-2.0.8\bin\mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\Joshua\AppData\Local\Temp\98094d_corpus.txt

Using topic modeling Java toolkit

阅读更多关于 Using topic modeling Java toolkit

问题 I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news. I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it? Also read a little about the lingpipe, the example from tutorial was using

Mallet basic usage. First steps

阅读更多关于 Mallet basic usage. First steps

问题 I'm trying to use Mallet with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic 3) and classify with this results new document in future. I tried to use bigartm for this first, but found nothing for classification in this program, only topic modeling. So Mallet, i created a corpus.txt file with following format: Doc.num. \t

LDA: Why sampling for inference of a new document?

阅读更多关于 LDA: Why sampling for inference of a new document?

问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)

LDA: Why sampling for inference of a new document?

阅读更多关于 LDA: Why sampling for inference of a new document?

LDA: Why sampling for inference of a new document?

阅读更多关于 LDA: Why sampling for inference of a new document?