nltk

Pandas NLTK tokenizing “unhashable type: 'list'”

我们两清 提交于 2021-02-05 07:55:10
问题 Following this example: Twitter data mining with Python and Gephi: Case synthetic biology CSV to: df['Country', 'Responses'] 'Country' Italy Italy France Germany 'Responses' "Loren ipsum..." "Loren ipsum..." "Loren ipsum..." "Loren ipsum..." tokenize the text in 'Responses' remove the 100 most common words (based on brown.corpus) identify the remaining 100 most frequent words I can get through step 1 and 2, but get an error on step 3: TypeError: unhashable type: 'list' I believe it's because

Removing specific word in a string in pandas

空扰寡人 提交于 2021-02-05 07:53:28
问题 I'm trying to remove several words in each value of a column but nothing is happening. stop_words = ["and","lang","naman","the","sa","ko","na", "yan","n","yang","mo","ung","ang","ako","ng", "ndi","pag","ba","on","un","Me","at","to", "is","sia","kaya","I","s","sla","dun","po","b","pro" ] newdata['Verbatim'] = newdata['Verbatim'].replace(stop_words,'', inplace = True) I'm trying to generate a word cloud out from the result of the replacement but I am getting the same words(that doesn't mean

Removing specific word in a string in pandas

有些话、适合烂在心里 提交于 2021-02-05 07:53:06
问题 I'm trying to remove several words in each value of a column but nothing is happening. stop_words = ["and","lang","naman","the","sa","ko","na", "yan","n","yang","mo","ung","ang","ako","ng", "ndi","pag","ba","on","un","Me","at","to", "is","sia","kaya","I","s","sla","dun","po","b","pro" ] newdata['Verbatim'] = newdata['Verbatim'].replace(stop_words,'', inplace = True) I'm trying to generate a word cloud out from the result of the replacement but I am getting the same words(that doesn't mean

python3 .remove returns None when i try to remove a variable from a list

生来就可爱ヽ(ⅴ<●) 提交于 2021-02-05 05:52:45
问题 I am working on a part of a program that turns a statement into a question. When i try to remove x it returns none i want it to print the sentence with that item removed, what is it i'm doing wrong? def Ask(Question): Auxiliary = ("will","might","would","do","were","are","did") for x in Auxiliary: if x in Question: Question_l = Question.lower() Question_tk_l = word_tokenize(Question) Aux_Rem = Question_tk_l.remove(x) print (Aux_Rem) example for behaviour wanted: "what we are doing in the

How to restore punctuation using Python? [closed]

半腔热情 提交于 2021-02-04 21:57:26
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . Improve this question I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence: I am XYZ I want to execute I have a doubt And I would like to detect that there should be 1 commas and 1 full stop in the above example: I am XYZ,

Python NLP British English vs American English

我的梦境 提交于 2021-02-04 13:48:26
问题 I'm currently working on NLP in python. However, in my corpus, there are both British and American English(realize/realise) I'm thinking to convert British to American. However, I did not find a good tool/package to do that. Any suggestions? 回答1: I've not been able to find a package either, but try this: (Note that I've had to trim the us2gb dictionary substantially for it to fit within the Stack Overflow character limit - you'll have to rebuild this yourself). # Based on Shengy's code: #

Chaquopy problems with nltk and download

六眼飞鱼酱① 提交于 2021-02-04 08:08:48
问题 According to Chaquopy Not able to download Resource i'm not sure if the problem got solved. So here is question in the nltk context. After including one of the nltk.download line: nltk.download('popular') or nltk.download('punkt') or nltk.download('all') I get this stack trace: 2020-08-26 13:33:45.742 19765-19765/com.pro.useyournotes E/ExceptionTag: com.chaquo.python.PyException: BadZipFile: File is not a zip file com.chaquo.python.PyException: BadZipFile: File is not a zip file at <python>

利用Python实现主题建模和LDA 算法(附链接)

半腔热情 提交于 2021-02-02 08:29:46
主题建模是一种用于找出文档集合中抽象“主题”的统计模型。LDA(Latent Dirichlet Allocation)是主题模型的一个示例,用于将文档中的文本分类为特定的主题。LDA算法为每一个文档构建出一个主题,再为每一个主题添加一些单词,该算法按照Dirichlet分布来建模。 那便开始吧! 数据 在这里将使用到的数据集是15年内发布的100多万条新闻标题的列表,可以从Kaggle下载。 先来看看数据。 1048575 图1 数据预处理 执行以下步骤: 标记化——将文本分成句子,将句子分成单词,把单词变为小写,去掉标点符号。 删除少于3个字符的单词。 删除所有的句号。 词形还原——将第三人称的单词改为第一人称,将过去和未来时态中的动词改为现在时。 词根化——将单词简化为词根形式。 加载gensim 和nltk库 [nltk_data] Downloading package wordnet to[nltk_data] C:\Users\SusanLi\AppData\Roaming\nltk_data…[nltk_data] Package wordnet is already up-to-date!True 编写一个函数,对数据集执行词形还原和词干预处理。 预处理之后选择要预览的文档。 源文件:[‘rain’, ‘helps’, ‘dampen’, ‘bushfires’

Unable to detect gibberish names using Python

浪子不回头ぞ 提交于 2021-01-29 10:32:16
问题 I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate account names could be comprised of all upper-case or all lower-case letters. Disclaimer: this is just a internal research/experiment and no real action will be taken on the classifier outcome. In my particular, there are 2 possible characteristics that can reveal an account name as suspicious, gibberish or both:

Keras Sequential Model-SGD- Neural Network-NLTK

女生的网名这么多〃 提交于 2021-01-29 09:29:31
问题 creating a bot, Here i faced error after training. i trained using the Keras SEQUENTIAL Model, SGD Optimizer, NLTK lemmatizer =WordNetLemmatizer() words =pickle.load(open("words.pkl",'rb'))# reading binary mode classes= pickle.load(open("classes.pkl",'rb')) model =load_model('chatbot.model') print(classes) def clean_up_sentence(sentence): sentence_words =nltk.word_tokenize(sentence) sentence_words=[lemmatizer.lemmatize(word) for word in sentence_words] return sentence_words def bag_of_words