nltk

How do I obtain individual centroids of K mean cluster using nltk (python)

[亡魂溺海] 提交于 2020-01-25 07:32:05
问题 I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters? kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1) predict = kclusterer.cluster(features, assign_clusters = True) centroids = kclusterer._centroid df_clustering['cluster'] = predict #df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist() df_clustering['centroid'

train a language model using Google Ngrams

痞子三分冷 提交于 2020-01-23 18:00:08
问题 I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model). So is there any way I can train a language model using Google Ngrams ? (Even python NLTK library does not support ngram language model anymore) Note - I know that a language model can be trained using ngrams,

Python: Find a list of words in a text and return its index

蓝咒 提交于 2020-01-23 15:15:11
问题 I have to process a document in plain text, looking for a word list and returning a text window around each word found. I'm using NLTK. I found posts on Stack Overflow where they use regular expressions for finding words, but without getting their index, just printing them. I don't think use RE is right, cause I have to find specific words. 回答1: This is what you are looking for: You can either use str.index or str.find: Contents of file: Lorem ipsum dolor sit amet, consectetur adipiscing elit

Python: Find a list of words in a text and return its index

自闭症网瘾萝莉.ら 提交于 2020-01-23 15:14:52
问题 I have to process a document in plain text, looking for a word list and returning a text window around each word found. I'm using NLTK. I found posts on Stack Overflow where they use regular expressions for finding words, but without getting their index, just printing them. I don't think use RE is right, cause I have to find specific words. 回答1: This is what you are looking for: You can either use str.index or str.find: Contents of file: Lorem ipsum dolor sit amet, consectetur adipiscing elit

Python: Find a list of words in a text and return its index

廉价感情. 提交于 2020-01-23 15:14:06
问题 I have to process a document in plain text, looking for a word list and returning a text window around each word found. I'm using NLTK. I found posts on Stack Overflow where they use regular expressions for finding words, but without getting their index, just printing them. I don't think use RE is right, cause I have to find specific words. 回答1: This is what you are looking for: You can either use str.index or str.find: Contents of file: Lorem ipsum dolor sit amet, consectetur adipiscing elit

NLTK was unable to find the gs file

谁说胖子不能爱 提交于 2020-01-22 10:38:25
问题 I'm trying to use NLTK, the stanford natural language toolkit. After install the required files, I start to execute the demo code: http://www.nltk.org/index.html >>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] >>> tagged = nltk.pos_tag(tokens) >>> tagged[0:6] [('At', 'IN'),

机器学习基础——带你实战朴素贝叶斯模型文本分类

风格不统一 提交于 2020-01-22 09:26:00
本文始发于个人公众号: TechFlow 上一篇文章当中我们介绍了 朴素贝叶斯模型的基本原理 。 朴素贝叶斯的核心本质是假设样本当中的变量 服从某个分布 ,从而利用条件概率计算出样本属于某个类别的概率。一般来说一个样本往往会含有许多特征,这些特征之间很有可能是有相关性的。为了简化模型,朴素贝叶斯模型 假设这些变量是独立的 。这样我们就可以很简单地计算出样本的概率。 想要回顾其中细节的同学,可以点击链接回到之前的文章: 机器学习基础——让你一文学会朴素贝叶斯模型 在我们学习算法的过程中,如果只看模型的原理以及理论,总有一些纸上得来终觉浅的感觉。很多时候,道理说的头头是道,可是真正要上手的时候还是会一脸懵逼。或者是勉强能够搞一搞,但是过程当中总会遇到这样或者那样各种意想不到的问题。一方面是我们动手实践的不够, 另一方面也是理解不够深入。 今天这篇文章我们实际动手实现模型,并且在 真实的数据集 当中运行,再看看我们模型的运行效果。 朴素贝叶斯与文本分类 一般来说,我们认为 狭义的事件 的结果应该是有限的,也就是说事件的结果应该是一个 离散值 而不是连续值。所以早期的贝叶斯模型,在引入高斯混合模型的思想之前,针对的也是离散值的样本(存疑,笔者推测)。所以我们先抛开连续特征的场景,先来看看在离散样本当中,朴素贝叶斯模型有哪些实际应用。 在机器学习广泛的应用场景当中,有一个非常经典的应用场景

how to use word_tokenize in data frame

只谈情不闲聊 提交于 2020-01-22 04:38:05
问题 I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe. data example: text 1. This is a very good site. I will recommend it to others. 2. Can you please give me a call at 9983938428. have issues with the listings. 3. good work! keep it up 4. not a very helpful site in finding home decor. expected output: 1. 'This','is','a','very','good','site','.

Determine if text is in English?

假如想象 提交于 2020-01-20 04:37:39
问题 I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true: [ "this is some text written in English", "this is some more text written in English", "Ce n'est pas en anglais" ] For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but

NLTK的使用

馋奶兔 提交于 2020-01-18 21:39:55
安装nltk.参考:http://www.cnblogs.com/kylinsblog/p/7755843.html NLTK是Python很强大的第三方库,可以很方便的完成很多自然语言处理(NLP)的任务,包括分词、词性标注、命名实体识别(NER)及句法分析。 下面介绍如何利用NLTK快速完成NLP基本任务 一、NLTK进行分词 用到的函数: nltk.sent_tokenize(text) #对文本按照句子进行分割 nltk.word_tokenize(sent) #对句子进行分词 #!/usr/bin/python # -*- coding: UTF-8 -*- print('nlp2 test') import nltk text = 'PathonTip.com is a very good website. We can learn a lot from it.' #将文本拆分成句子列表 sens = nltk.sent_tokenize(text) print(sens) #将句子进行分词,nltk的分词是句子级的,因此要先分句,再逐句分词,否则效果会很差. words = [] for sent in sens: words.append(nltk.word_tokenize(sent)) print(words) 执行结果: 二、NLTK进行词性标注 用到的函数: