gensim

Python NLP British English vs American English

我的梦境 提交于 2021-02-04 13:48:26
问题 I'm currently working on NLP in python. However, in my corpus, there are both British and American English(realize/realise) I'm thinking to convert British to American. However, I did not find a good tool/package to do that. Any suggestions? 回答1: I've not been able to find a package either, but try this: (Note that I've had to trim the us2gb dictionary substantially for it to fit within the Stack Overflow character limit - you'll have to rebuild this yourself). # Based on Shengy's code: #

CalledProcessError: Returned non-zero exit status 1

北城余情 提交于 2021-02-04 08:27:28
问题 When I try to run: def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod1[doc] for doc in texts] # Remove Stop Words data_words_nostops1 = remove_stopwords(data_words1) # Form Bigrams data_words_bigrams1 = make_bigrams(data_words_nostops1) # Create Dictionary id2word1 = corpora.Dictionary(data_words_bigrams1) # Create Corpus texts1 = data_words_bigrams1 # Term Document

利用Python实现主题建模和LDA 算法(附链接)

半腔热情 提交于 2021-02-02 08:29:46
主题建模是一种用于找出文档集合中抽象“主题”的统计模型。LDA(Latent Dirichlet Allocation)是主题模型的一个示例,用于将文档中的文本分类为特定的主题。LDA算法为每一个文档构建出一个主题,再为每一个主题添加一些单词,该算法按照Dirichlet分布来建模。 那便开始吧! 数据 在这里将使用到的数据集是15年内发布的100多万条新闻标题的列表,可以从Kaggle下载。 先来看看数据。 1048575 图1 数据预处理 执行以下步骤: 标记化——将文本分成句子,将句子分成单词,把单词变为小写,去掉标点符号。 删除少于3个字符的单词。 删除所有的句号。 词形还原——将第三人称的单词改为第一人称,将过去和未来时态中的动词改为现在时。 词根化——将单词简化为词根形式。 加载gensim 和nltk库 [nltk_data] Downloading package wordnet to[nltk_data] C:\Users\SusanLi\AppData\Roaming\nltk_data…[nltk_data] Package wordnet is already up-to-date!True 编写一个函数,对数据集执行词形还原和词干预处理。 预处理之后选择要预览的文档。 源文件:[‘rain’, ‘helps’, ‘dampen’, ‘bushfires’

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

半世苍凉 提交于 2021-01-29 16:38:43
问题 Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript. For example, I tried the following: import numpy as np import gensim from gensim.corpora import Dictionary from gensim import models import nltk from nltk.stem import

Keep punctuation and casing in gensim wikicorpus text

孤人 提交于 2021-01-29 15:33:55
问题 I have a Wiki Dump as xml.bz2 file and want to convert it to txt for later processing with BERT. The goal is to have each separate sentence in a new line and an emptly line between articles (requirements of BERT Training) I tried to follow this (How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?) Post and did a lot of research of my own. This is what i got so far: from __future__ import print_function import sys from gensim.corpora import WikiCorpus from

Python connect composed keywords in texts

狂风中的少年 提交于 2021-01-29 10:11:38
问题 So, I have a keyword list lowercase. Let's say keywords = ['machine learning', 'data science', 'artificial intelligence'] and a list of texts in lowercase. Let's say texts = [ 'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed

How to tune the parameters for gensim `LdaMulticore` in Python

夙愿已清 提交于 2021-01-29 08:24:41
问题 I was running gensim LdaMulticore package for the topic modelling using Python. I tried to understand the meaning of the parameters within LdaMulticore and found the website that provides some explanations on the usage of the parameters. As a non-expert, I have some difficulty understanding these intuitively. I also referred some other materials from the website but I guess this page gives relatively full explanations of every parameters. This page chunksize Number of documents to be used in

how to use cosssim in gensim

大兔子大兔子 提交于 2021-01-29 08:07:15
问题 My questioon is about cossim usage. I have this fragment of a very big fuction: for elem in lList: temp = [] try: x = dict(np.ndenumerate(np.asarray(model[elem]))) except: if x not in embedDict.keys(): x = np.random.uniform(low=0.0, high=1.0, size=300) embedDict[elem] = x else: x = dict(np.ndenumerate(np.asarray(embedDict[elem]))) for w in ListWords: try: y = dict(np.ndenumerate(np.asarray(model[w]))) except: if y not in embedDict.keys(): y = np.random.uniform(low=0.0, high=1.0, size=300)

Python: Gensim Memory Error

大城市里の小女人 提交于 2021-01-29 01:37:04
问题 import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) from gensim import corpora, models, similarities from nltk.corpus import stopwords import codecs documents = [] with codecs.open("Master_File_for_Docs.txt", encoding = 'utf-8', mode= "r") as fid: for line in fid: documents.append(line) stoplist = [] x = stopwords.words('english') for word in x: stoplist.append(word) #Removes Stopwords texts = [[word for word in document.lower().split()

Different models with gensim Word2Vec on python

我们两清 提交于 2021-01-28 14:02:40
问题 I am trying to apply the word2vec model implemented in the library gensim in python. I have a list of sentences (each sentences is a list of words). For instance let us have: sentences=[['first','second','third','fourth']]*n and I implement two identical models: model = gensim.models.Word2Vec(sententes, min_count=1,size=2) model2=gensim.models.Word2Vec(sentences, min_count=1,size=2) I realize that the models sometimes are the same, and sometimes are different, depending on the value of n. For