nltk | 易学教程

Calculating tf-idf among documents using python 2.7

阅读更多关于 Calculating tf-idf among documents using python 2.7

问题 I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files. From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf. For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do

POS tagging - NLTK thinks noun is adjective

阅读更多关于 POS tagging - NLTK thinks noun is adjective

问题 In the following code, why does nltk think 'fish' is an adjective and not a noun? >>> import nltk >>> s = "a woman needs a man like a fish needs a bicycle" >>> nltk.pos_tag(s.split()) [('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')] 回答1: I am not sure what is the workaround but you can check the source here https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/ Meanwhile I

NLTK POS tagger not working

阅读更多关于 NLTK POS tagger not working

问题 If I try this : import nltk text = nltk.word_tokenize("And now for something completely different") nltk.pos_tag(text) Output: Traceback (most recent call last): File "C:/Python27/pos.py", line 3, in <module> nltk.pos_tag(text) File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\tag\__init__.py" ipos_tag tagger = load(_POS_TAGGER) File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\data.py", line 605,in resource_val = pickle.load(_open(resource_url)) ImportError: No module

Error using Stanford POS Tagger in NLTK Python

阅读更多关于 Error using Stanford POS Tagger in NLTK Python

问题 I am trying to use Stanford POS Tagger in NLTK but I am not able to run the example code given here http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford import nltk from nltk.tag.stanford import POSTagger st = POSTagger(r'english-bidirectional-distim.tagger',r'D:/stanford-postagger/stanford-postagger.jar') st.tag('What is the airspeed of an unladen swallow?'.split()) I have already added environment variables as CLASSPATH = D:/stanford-postagger/stanford-postagger.jar STANFORD

Coreference resolution in python nltk using Stanford coreNLP

阅读更多关于 Coreference resolution in python nltk using Stanford coreNLP

问题 Stanford CoreNLP provides coreference resolution as mentioned here, also this thread, this, provides some insights about its implementation in Java. However, I am using python and NLTK and I am not sure how can I use Coreference resolution functionality of CoreNLP in my python code. I have been able to set up StanfordParser in NLTK, this is my code so far. from nltk.parse.stanford import StanfordDependencyParser stanford_parser_dir = 'stanford-parser/' eng_model_path = stanford_parser_dir +

Coreference resolution in python nltk using Stanford coreNLP

阅读更多关于 Coreference resolution in python nltk using Stanford coreNLP

Python can't find module NLTK

阅读更多关于 Python can't find module NLTK

问题 I followed these instructions http://www.nltk.org/install.html to install nltk module on my mac (10.6) I have installed python 2.7, but when I open IDLE and type import nltk it gives me this error Traceback (most recent call last): File "<pyshell#0>", line 1, in <module> import nltk ImportError: No module named nltk The problem is the module is installed in another python version, 2.6. How can I install the package in python version 2.7? I tried some of the solutions suggested in various

How to navigate a nltk.tree.Tree?

阅读更多关于 How to navigate a nltk.tree.Tree?

问题 I've chunked a sentence using: grammar = ''' NP: {<DT>*(<NN.*>|<JJ.*>)*<NN.*>} NVN: {<NP><VB.*><NP>} ''' chunker = nltk.chunk.RegexpParser(grammar) tree = chunker.parse(tagged) print tree The result looks like: (S (NVN (NP The_Pigs/NNS) are/VBP (NP a/DT Bristol-based/JJ punk/NN rock/NN band/NN)) that/WDT formed/VBN in/IN 1977/CD ./.) But now I'm stuck trying to figure out how to navigate that. I want to be able to find the NVN subtree, and access the left-side noun phrase ("The_Pigs"), the

nltk language model (ngram) calculate the prob of a word from context

阅读更多关于 nltk language model (ngram) calculate the prob of a word from context

问题 I am using Python and NLTK to build a language model as follows: from nltk.corpus import brown from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), estimator) # Thanks to miku, I fixed this problem print lm.prob("word", ["This is a context which generates a word"]) >> 0.00493261081006 # But I got another program like this one... print lm.prob("b", ["This is a context

Find multi-word terms in a tokenized text in Python

阅读更多关于 Find multi-word terms in a tokenized text in Python

问题 I have a text that I have tokenized, or in general a list of words is ok as well. For example: >>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] If I have a Python dict that contains single word as well as multi-word keys, how can I efficiently and