nltk | 易学教程

How do I obtain individual centroids of K mean cluster using nltk (python)

阅读更多关于 How do I obtain individual centroids of K mean cluster using nltk (python)

问题 I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters? kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1) predict = kclusterer.cluster(features, assign_clusters = True) centroids = kclusterer._centroid df_clustering['cluster'] = predict #df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist() df_clustering['centroid'

train a language model using Google Ngrams

阅读更多关于 train a language model using Google Ngrams

问题 I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model). So is there any way I can train a language model using Google Ngrams ? (Even python NLTK library does not support ngram language model anymore) Note - I know that a language model can be trained using ngrams,

Python: Find a list of words in a text and return its index

阅读更多关于 Python: Find a list of words in a text and return its index

问题 I have to process a document in plain text, looking for a word list and returning a text window around each word found. I'm using NLTK. I found posts on Stack Overflow where they use regular expressions for finding words, but without getting their index, just printing them. I don't think use RE is right, cause I have to find specific words. 回答1: This is what you are looking for: You can either use str.index or str.find: Contents of file: Lorem ipsum dolor sit amet, consectetur adipiscing elit

Python: Find a list of words in a text and return its index

阅读更多关于 Python: Find a list of words in a text and return its index

Python: Find a list of words in a text and return its index

阅读更多关于 Python: Find a list of words in a text and return its index

NLTK was unable to find the gs file

阅读更多关于 NLTK was unable to find the gs file

问题 I'm trying to use NLTK, the stanford natural language toolkit. After install the required files, I start to execute the demo code: http://www.nltk.org/index.html >>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] >>> tagged = nltk.pos_tag(tokens) >>> tagged[0:6] [('At', 'IN'),

机器学习基础——带你实战朴素贝叶斯模型文本分类

阅读更多关于机器学习基础——带你实战朴素贝叶斯模型文本分类

本文始发于个人公众号： TechFlow 上一篇文章当中我们介绍了朴素贝叶斯模型的基本原理。朴素贝叶斯的核心本质是假设样本当中的变量服从某个分布，从而利用条件概率计算出样本属于某个类别的概率。一般来说一个样本往往会含有许多特征，这些特征之间很有可能是有相关性的。为了简化模型，朴素贝叶斯模型假设这些变量是独立的。这样我们就可以很简单地计算出样本的概率。想要回顾其中细节的同学，可以点击链接回到之前的文章：机器学习基础——让你一文学会朴素贝叶斯模型在我们学习算法的过程中，如果只看模型的原理以及理论，总有一些纸上得来终觉浅的感觉。很多时候，道理说的头头是道，可是真正要上手的时候还是会一脸懵逼。或者是勉强能够搞一搞，但是过程当中总会遇到这样或者那样各种意想不到的问题。一方面是我们动手实践的不够，另一方面也是理解不够深入。今天这篇文章我们实际动手实现模型，并且在真实的数据集当中运行，再看看我们模型的运行效果。朴素贝叶斯与文本分类一般来说，我们认为狭义的事件的结果应该是有限的，也就是说事件的结果应该是一个离散值而不是连续值。所以早期的贝叶斯模型，在引入高斯混合模型的思想之前，针对的也是离散值的样本（存疑，笔者推测）。所以我们先抛开连续特征的场景，先来看看在离散样本当中，朴素贝叶斯模型有哪些实际应用。在机器学习广泛的应用场景当中，有一个非常经典的应用场景

how to use word_tokenize in data frame

阅读更多关于 how to use word_tokenize in data frame

问题 I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe. data example: text 1. This is a very good site. I will recommend it to others. 2. Can you please give me a call at 9983938428. have issues with the listings. 3. good work! keep it up 4. not a very helpful site in finding home decor. expected output: 1. 'This','is','a','very','good','site','.

Determine if text is in English?

阅读更多关于 Determine if text is in English?

问题 I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true: [ "this is some text written in English", "this is some more text written in English", "Ce n'est pas en anglais" ] For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but

NLTK的使用

阅读更多关于 NLTK的使用

安装nltk.参考:http://www.cnblogs.com/kylinsblog/p/7755843.html NLTK是Python很强大的第三方库，可以很方便的完成很多自然语言处理（NLP）的任务，包括分词、词性标注、命名实体识别（NER）及句法分析。下面介绍如何利用NLTK快速完成NLP基本任务一、NLTK进行分词用到的函数： nltk.sent_tokenize(text) #对文本按照句子进行分割 nltk.word_tokenize(sent) #对句子进行分词 #!/usr/bin/python # -*- coding: UTF-8 -*- print('nlp2 test') import nltk text = 'PathonTip.com is a very good website. We can learn a lot from it.' #将文本拆分成句子列表 sens = nltk.sent_tokenize(text) print(sens) #将句子进行分词,nltk的分词是句子级的,因此要先分句,再逐句分词,否则效果会很差. words = [] for sent in sens: words.append(nltk.word_tokenize(sent)) print(words) 执行结果: 二、NLTK进行词性标注用到的函数：