nltk

NLTK Context Free Grammar Genaration

﹥>﹥吖頭↗ 提交于 2020-01-01 02:44:16
问题 I'm working on a non-English parser with Unicode characters. For that, I decided to use NLTK. But it requires a predefined context-free grammar as below: S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" | "walked" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "man" | "dog" | "cat" | "telescope" | "park" P -> "in" | "on" | "by" | "with" In my app, I am supposed to minimize hard coding with the use of a rule-based grammar. For example, I can

Are there any classes in NLTK for text normalizing and canonizing?

≡放荡痞女 提交于 2019-12-31 08:29:37
问题 The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as: converting all letters to lower or upper case removing punctuation converting numbers into words removing accent marks and other diacritics expanding abbreviations removing stopwords or "too common" words text canonicalization (tumor = tumour, it's = it is) Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for

How to output NLTK chunks to file?

泪湿孤枕 提交于 2019-12-31 00:03:49
问题 I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web. I need to format and write in a file the output of chunked1 , chunked2 , chunked3 . These have type class 'nltk.tree.Tree' More specifically I need to write only the lines that match the regular expressions chunkGram1 , chunkGram2 , chunkGram3 . How can i do that? #! /usr/bin/python2.7 import nltk import re import codecs xstring = ["An electronic library (also

How do I print out just the word itself in a WordNet synset using Python NLTK?

為{幸葍}努か 提交于 2019-12-30 18:51:21
问题 Is there a way in Python 2.7 using NLTK to just get the word and not the extra formatting that includes "synset" and the parentheses and the "n.01" etc? For instance if I do wn.synsets('dog') My results look like: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] How can I instead get a list like this? dog frump cad frank pawl andiron chase Is there a way to do this using

How do I print out just the word itself in a WordNet synset using Python NLTK?

僤鯓⒐⒋嵵緔 提交于 2019-12-30 18:51:07
问题 Is there a way in Python 2.7 using NLTK to just get the word and not the extra formatting that includes "synset" and the parentheses and the "n.01" etc? For instance if I do wn.synsets('dog') My results look like: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] How can I instead get a list like this? dog frump cad frank pawl andiron chase Is there a way to do this using

Nltk stanford pos tagger error : Java command failed

北战南征 提交于 2019-12-30 04:22:08
问题 I'm trying to use nltk.tag.stanford module for tagging a sentence (first like wiki's example) but i keep getting the following error : Traceback (most recent call last): File "test.py", line 28, in <module> print st.tag(word_tokenize('What is the airspeed of an unladen swallow ?')) File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 59, in tag return self.tag_sents([tokens])[0] File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 81, in tag_sents

What to download in order to make nltk.tokenize.word_tokenize work?

若如初见. 提交于 2019-12-30 02:50:08
问题 I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB. This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize ? So far, I've seen nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it

What is the connection or difference between lemma and synset in wordnet?

限于喜欢 提交于 2019-12-30 02:12:13
问题 I am a complete beginner to NLP and NLTK. I was not able to understand the exact difference between lemmas and synsets in wordnet , because both are producing nearly the same output. for example for the word cake it produce this output. lemmas : [Lemma('cake.n.01.cake'), Lemma('patty.n.01.cake'), Lemma('cake.n.03.cake'), Lemma('coat.v.03.cake')] synsets : [Synset('cake.n.01'), Synset('patty.n.01'), Synset('cake.n.03'), Synset('coat.v.03')] please help me to understand this concept. Thank you.

Generating random sentences from custom text in Python's NLTK?

你说的曾经没有我的故事 提交于 2019-12-29 14:59:03
问题 I'm having trouble with the NLTK under Python, specifically the .generate() method. generate(self, length=100) Print random text, generated using a trigram language model. Parameters: * length (int) - The length of text to generate (default=100) Here is a simplified version of what I am attempting. import nltk words = 'The quick brown fox jumps over the lazy dog' tokens = nltk.word_tokenize(words) text = nltk.Text(tokens) print text.generate(3) This will always generate Building ngram index..

Efficient Context-Free Grammar parser, preferably Python-friendly

浪子不回头ぞ 提交于 2019-12-29 14:21:59
问题 I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently . Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical