mallet

CalledProcessError: Returned non-zero exit status 1

北城余情 提交于 2021-02-04 08:27:28
问题 When I try to run: def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod1[doc] for doc in texts] # Remove Stop Words data_words_nostops1 = remove_stopwords(data_words1) # Form Bigrams data_words_bigrams1 = make_bigrams(data_words_nostops1) # Create Dictionary id2word1 = corpora.Dictionary(data_words_bigrams1) # Create Corpus texts1 = data_words_bigrams1 # Term Document

Mallet DMR negative propability for feature-based topic-distribution?

删除回忆录丶 提交于 2021-01-28 05:09:37
问题 I've created a DMR Topic model (via Java API) which calculates the topic distribution based on the publication-year of the documents. The resulting distribution is a bit confusing, because there are a lot of negative propabilities . Sometimes all propabilities for a whole topic are negative values. See: Q1: Why are there negative values? The lowest possible possibility for a topic distribution for a given feature should be at least 0,0 ... I guess? Additional I build a LDA model where the

How to understand the output of Topic Model class in Mallet?

可紊 提交于 2020-02-26 06:36:22
问题 As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code. First during the running process, it gives out: Coded LDA: 10 topics, 4 topic bits, 1111 topic mask max tokens: 148 total tokens: 1333 <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353 0 0,5 battle union confederate tennessee american states 1 0,5 hawes sunderland echo war paper commonwealth 2 0,5 test

LDA Mallet CalledProcessError

痴心易碎 提交于 2020-02-24 10:14:39
问题 I am trying to implement the following code: import os os.environ.update({'MALLET_HOME':r'c:/mallet-2.0.8/'}) mallet_path = 'C:\\mallet-2.0.8\\bin\\mallet' ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow, num_topics=20, id2word=dictionary) However, I keep getting this error: CalledProcessError: Command 'C:\mallet-2.0.8\bin\mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\Joshua\AppData\Local\Temp\98094d_corpus.txt

Using topic modeling Java toolkit

这一生的挚爱 提交于 2020-01-17 09:05:11
问题 I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news. I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it? Also read a little about the lingpipe, the example from tutorial was using

Mallet basic usage. First steps

孤者浪人 提交于 2020-01-04 07:55:27
问题 I'm trying to use Mallet with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic 3) and classify with this results new document in future. I tried to use bigartm for this first, but found nothing for classification in this program, only topic modeling. So Mallet, i created a corpus.txt file with following format: Doc.num. \t

LDA: Why sampling for inference of a new document?

风流意气都作罢 提交于 2020-01-04 06:03:50
问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)

LDA: Why sampling for inference of a new document?

江枫思渺然 提交于 2020-01-04 06:03:04
问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)

LDA: Why sampling for inference of a new document?

有些话、适合烂在心里 提交于 2020-01-04 06:02:49
问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)