nltk | 易学教程

NLTK Context Free Grammar Genaration

阅读更多关于 NLTK Context Free Grammar Genaration

问题 I'm working on a non-English parser with Unicode characters. For that, I decided to use NLTK. But it requires a predefined context-free grammar as below: S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" | "walked" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "man" | "dog" | "cat" | "telescope" | "park" P -> "in" | "on" | "by" | "with" In my app, I am supposed to minimize hard coding with the use of a rule-based grammar. For example, I can

Are there any classes in NLTK for text normalizing and canonizing?

阅读更多关于 Are there any classes in NLTK for text normalizing and canonizing?

问题 The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as: converting all letters to lower or upper case removing punctuation converting numbers into words removing accent marks and other diacritics expanding abbreviations removing stopwords or "too common" words text canonicalization (tumor = tumour, it's = it is) Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for

How to output NLTK chunks to file?

阅读更多关于 How to output NLTK chunks to file?

问题 I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web. I need to format and write in a file the output of chunked1 , chunked2 , chunked3 . These have type class 'nltk.tree.Tree' More specifically I need to write only the lines that match the regular expressions chunkGram1 , chunkGram2 , chunkGram3 . How can i do that? #! /usr/bin/python2.7 import nltk import re import codecs xstring = ["An electronic library (also

How do I print out just the word itself in a WordNet synset using Python NLTK?

阅读更多关于 How do I print out just the word itself in a WordNet synset using Python NLTK?

问题 Is there a way in Python 2.7 using NLTK to just get the word and not the extra formatting that includes "synset" and the parentheses and the "n.01" etc? For instance if I do wn.synsets('dog') My results look like: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] How can I instead get a list like this? dog frump cad frank pawl andiron chase Is there a way to do this using

How do I print out just the word itself in a WordNet synset using Python NLTK?

阅读更多关于 How do I print out just the word itself in a WordNet synset using Python NLTK?

Nltk stanford pos tagger error : Java command failed

阅读更多关于 Nltk stanford pos tagger error : Java command failed

问题 I'm trying to use nltk.tag.stanford module for tagging a sentence (first like wiki's example) but i keep getting the following error : Traceback (most recent call last): File "test.py", line 28, in <module> print st.tag(word_tokenize('What is the airspeed of an unladen swallow ?')) File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 59, in tag return self.tag_sents([tokens])[0] File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 81, in tag_sents

What to download in order to make nltk.tokenize.word_tokenize work?

阅读更多关于 What to download in order to make nltk.tokenize.word_tokenize work?

问题 I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB. This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize ? So far, I've seen nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it

What is the connection or difference between lemma and synset in wordnet?

阅读更多关于 What is the connection or difference between lemma and synset in wordnet?

问题 I am a complete beginner to NLP and NLTK. I was not able to understand the exact difference between lemmas and synsets in wordnet , because both are producing nearly the same output. for example for the word cake it produce this output. lemmas : [Lemma('cake.n.01.cake'), Lemma('patty.n.01.cake'), Lemma('cake.n.03.cake'), Lemma('coat.v.03.cake')] synsets : [Synset('cake.n.01'), Synset('patty.n.01'), Synset('cake.n.03'), Synset('coat.v.03')] please help me to understand this concept. Thank you.

Generating random sentences from custom text in Python's NLTK?

阅读更多关于 Generating random sentences from custom text in Python's NLTK?

问题 I'm having trouble with the NLTK under Python, specifically the .generate() method. generate(self, length=100) Print random text, generated using a trigram language model. Parameters: * length (int) - The length of text to generate (default=100) Here is a simplified version of what I am attempting. import nltk words = 'The quick brown fox jumps over the lazy dog' tokens = nltk.word_tokenize(words) text = nltk.Text(tokens) print text.generate(3) This will always generate Building ngram index..

Efficient Context-Free Grammar parser, preferably Python-friendly

阅读更多关于 Efficient Context-Free Grammar parser, preferably Python-friendly

问题 I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently . Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical