nltk

Calculating tf-idf among documents using python 2.7

非 Y 不嫁゛ 提交于 2019-12-29 08:08:27
问题 I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files. From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf. For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do

POS tagging - NLTK thinks noun is adjective

时间秒杀一切 提交于 2019-12-29 07:52:17
问题 In the following code, why does nltk think 'fish' is an adjective and not a noun? >>> import nltk >>> s = "a woman needs a man like a fish needs a bicycle" >>> nltk.pos_tag(s.split()) [('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')] 回答1: I am not sure what is the workaround but you can check the source here https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/ Meanwhile I

NLTK POS tagger not working

你离开我真会死。 提交于 2019-12-29 07:47:07
问题 If I try this : import nltk text = nltk.word_tokenize("And now for something completely different") nltk.pos_tag(text) Output: Traceback (most recent call last): File "C:/Python27/pos.py", line 3, in <module> nltk.pos_tag(text) File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\tag\__init__.py" ipos_tag tagger = load(_POS_TAGGER) File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\data.py", line 605,in resource_val = pickle.load(_open(resource_url)) ImportError: No module

Error using Stanford POS Tagger in NLTK Python

北城余情 提交于 2019-12-29 06:45:08
问题 I am trying to use Stanford POS Tagger in NLTK but I am not able to run the example code given here http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford import nltk from nltk.tag.stanford import POSTagger st = POSTagger(r'english-bidirectional-distim.tagger',r'D:/stanford-postagger/stanford-postagger.jar') st.tag('What is the airspeed of an unladen swallow?'.split()) I have already added environment variables as CLASSPATH = D:/stanford-postagger/stanford-postagger.jar STANFORD

Coreference resolution in python nltk using Stanford coreNLP

末鹿安然 提交于 2019-12-29 06:20:19
问题 Stanford CoreNLP provides coreference resolution as mentioned here, also this thread, this, provides some insights about its implementation in Java. However, I am using python and NLTK and I am not sure how can I use Coreference resolution functionality of CoreNLP in my python code. I have been able to set up StanfordParser in NLTK, this is my code so far. from nltk.parse.stanford import StanfordDependencyParser stanford_parser_dir = 'stanford-parser/' eng_model_path = stanford_parser_dir +

Coreference resolution in python nltk using Stanford coreNLP

≡放荡痞女 提交于 2019-12-29 06:18:14
问题 Stanford CoreNLP provides coreference resolution as mentioned here, also this thread, this, provides some insights about its implementation in Java. However, I am using python and NLTK and I am not sure how can I use Coreference resolution functionality of CoreNLP in my python code. I have been able to set up StanfordParser in NLTK, this is my code so far. from nltk.parse.stanford import StanfordDependencyParser stanford_parser_dir = 'stanford-parser/' eng_model_path = stanford_parser_dir +

Python can't find module NLTK

北城余情 提交于 2019-12-29 05:50:26
问题 I followed these instructions http://www.nltk.org/install.html to install nltk module on my mac (10.6) I have installed python 2.7, but when I open IDLE and type import nltk it gives me this error Traceback (most recent call last): File "<pyshell#0>", line 1, in <module> import nltk ImportError: No module named nltk The problem is the module is installed in another python version, 2.6. How can I install the package in python version 2.7? I tried some of the solutions suggested in various

How to navigate a nltk.tree.Tree?

坚强是说给别人听的谎言 提交于 2019-12-29 04:14:06
问题 I've chunked a sentence using: grammar = ''' NP: {<DT>*(<NN.*>|<JJ.*>)*<NN.*>} NVN: {<NP><VB.*><NP>} ''' chunker = nltk.chunk.RegexpParser(grammar) tree = chunker.parse(tagged) print tree The result looks like: (S (NVN (NP The_Pigs/NNS) are/VBP (NP a/DT Bristol-based/JJ punk/NN rock/NN band/NN)) that/WDT formed/VBN in/IN 1977/CD ./.) But now I'm stuck trying to figure out how to navigate that. I want to be able to find the NVN subtree, and access the left-side noun phrase ("The_Pigs"), the

nltk language model (ngram) calculate the prob of a word from context

青春壹個敷衍的年華 提交于 2019-12-29 03:21:40
问题 I am using Python and NLTK to build a language model as follows: from nltk.corpus import brown from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), estimator) # Thanks to miku, I fixed this problem print lm.prob("word", ["This is a context which generates a word"]) >> 0.00493261081006 # But I got another program like this one... print lm.prob("b", ["This is a context

Find multi-word terms in a tokenized text in Python

雨燕双飞 提交于 2019-12-29 01:49:07
问题 I have a text that I have tokenized, or in general a list of words is ok as well. For example: >>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] If I have a Python dict that contains single word as well as multi-word keys, how can I efficiently and