nltk | 易学教程

How to grab streaming data from twitter connect with pycurl using nltk - regular expression

阅读更多关于 How to grab streaming data from twitter connect with pycurl using nltk - regular expression

问题 I am newbie in Python and given a task from my boss to do this : Grab streaming data from twitter connect with pycurl and output in JSON Parsing using NLTK and Regular Expression Save it to database file(mySQL) or file base(txt) Note : this is the url that i want to grab ('http://search.twitter.com/search.json?geocode=-0.789275%2C113.921327%2C1.0km&q=+near%3Aindonesia+within%3A1km&result_type=recent&rpp=10') Is there anyone know how to grab a streaming data from twitter using the step above ?

To clean text belonging to different languages in Python

阅读更多关于 To clean text belonging to different languages in Python

问题 I have a collection of text which has sentences either entirely in English or Hindi or Marathi with ids attached to each of these sentences as 0,1,2 respectively representing the language of the text. The text irrespective of any language may have HTML tags, punctuation etc. I could clean the English sentences using my code below: import HTMLParser import re from nltk.corpus import stopwords from collections import Counter import pickle from string import punctuation #creating html_parser

Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

阅读更多关于 Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

问题 The NLTK documentation is rather poor in this integration. The steps I followed were: Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip to /home/me/stanford Download http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar to /home/me/stanford Then in a ipython console: In [11]: import nltk In [12]: nltk.__version__ Out[12]: '3.1' In [13]: from nltk.tag import StanfordNERTagger Then st = StanfordNERTagger('/home/me/stanford/stanford

NLTK Data installation issues

阅读更多关于 NLTK Data installation issues

问题 I am trying to install NLTK Data on Mac OSX 10.9 . The download directory to be set, as mentioned in NLTK 3.0 documentation, is /usr/share/nltk_data for central installation. But for this path, I get the error OSError: [Errno 13] Permission denied: '/usr/share/nltk_data' Can I set the download directory as /Users/ananya/nltk_data for central installation? I have Python 2.7 installed in my machine Thanks, Ananya 回答1: Have you tried: $ sudo python >>> import nltk >>> nltk.download() To check if

Kneser-Ney smoothing of trigrams using Python NLTK

阅读更多关于 Kneser-Ney smoothing of trigrams using Python NLTK

问题 I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse. What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution. I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:

NLTK classify interface using trained classifier

阅读更多关于 NLTK classify interface using trained classifier

问题 I have this little chunk of code I found here: import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews from nltk.corpus import stopwords def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in

FreqDist in NLTK not sorting output

阅读更多关于 FreqDist in NLTK not sorting output

问题 I'm new to Python and I'm trying to teach myself language processing. NLTK in python has a function called FreqDist that gives the frequency of words in a text, but for some reason it's not working properly. This is what the tutorial has me write: fdist1 = FreqDist(text1) vocabulary1 = fdist1.keys() vocabulary1[:50] So basically it's supposed to give me a list of the 50 most frequent words in the text. When I run the code, though, the result is the 50 least frequent words in order of least

Get synonyms from synset returns error - Python

阅读更多关于 Get synonyms from synset returns error - Python

问题 I'm trying to get synonyms of a given word using Wordnet. The problem is that despite I'm doing the same as is written here: here, it returns error. Here is my code: from nltk.corpus import wordnet as wn import nltk dog = wn.synset('dog.n.01') print dog.lemma_names >>> <bound method Synset.lemma_names of Synset('dog.n.01')> for i,j in enumerate(wn.synsets('small')): print "Synonyms:", ", ".join(j.lemma_names) >>> Synonyms: Traceback (most recent call last): File "C:/Users/Python

extracting relations from text

阅读更多关于 extracting relations from text

问题 I want to extract relations from unstructured text in the form of (SUBJECT,OBJECT,ACTION) relations, for instance, "The boy is sitting on the table eating the chicken" would give me, (boy,chicken,eat) (boy,table,LOCATION) etc.. although a python program + NLTK could process such a simple sentence as above. I'd like to know if any of you have used tools or libraries preferably opensource to extract relations from a much wider domain such as a large collection of text documents or the web. 回答1:

python nltk keyword extraction from sentence

阅读更多关于 python nltk keyword extraction from sentence

问题 "First thing we do, let's kill all the lawyers." - William Shakespeare Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags: [["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]] The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the