nltk

How can I create my own corpus in the Python Natural Language Toolkit? [duplicate]

巧了我就是萌 提交于 2020-01-14 13:40:32
问题 This question already has answers here : Creating a new corpus with NLTK (3 answers) Closed 6 years ago . I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions? Many thanks, James. 回答1: As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to

Python NLTK multi threading

守給你的承諾、 提交于 2020-01-14 12:43:29
问题 I am writing an algorithm which identifies sentences in given text, split each sentence into words & return these words after some validations. I want to implement the same with the help of multi threading. I'm calling my function which deals with each sentence in threading.thread() for which it throws an error: AttributeError: 'WordListCorpusReader' object has no attribute '_LazyCorpusLoader__args' However, there are few blogs which suggest to use " wn.ensure_loaded() " function. But python

Python: Tokenizing with phrases

感情迁移 提交于 2020-01-14 07:55:10
问题 I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization. For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the

Earley 线图解析算法

时光怂恿深爱的人放手 提交于 2020-01-14 02:14:45
Earley 算法由 Earley 于 1970 年提出。该算法类似于自顶向下的语句解析。它可以处理左 递归问题,并且不需要 CNF(乔姆斯基范式)转化。Earley 算法以从左到右的方式填充线图。 考虑一个展示了用 Earley 线图解析器来进行语法解析的示例: >>> import nltk >>> nltk.parse.earleychart.demo(print_times=False, trace=1,sent='I saw a dog', numparses=2) 考虑一个使用 NLTK 中的线图解析器来进行语法解析的示例: >>> import nltk >>> nltk.parse.chart.demo(2, print_times=False, trace=1,sent='John saw a dog', numparses=1) 考虑一个使用了 NLTK 中的 Stepping 线图解析器来进行语法解析的示例: >>> import nltk >>> nltk.parse.chart.demo(5, print_times=False, trace=1,sent='John saw a dog', numparses=2) 让我们来看看 NLTK 中有关 Feature 线图解析的代码: >>> import nltk >>> nltk.parse

using nltk regex example in scikit-learn CountVectorizer

徘徊边缘 提交于 2020-01-14 02:05:08
问题 I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this: pattern = r''' (?x) # set flag to allow verbose regexps ([A-Z]\.)+ # abbreviations (e.g. U.S.A.) | \w+(-\w+)* # words with optional internal hyphens | \$?\d+(\.\d+)?%? # currency & percentages | \.\.\. # ellipses ''' text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...' vectorizer =

using nltk regex example in scikit-learn CountVectorizer

假装没事ソ 提交于 2020-01-14 02:04:51
问题 I was trying to use an example from the nltk book for a regex pattern inside the CountVectorizer from scikit-learn. I see examples with simple regex but not with something like this: pattern = r''' (?x) # set flag to allow verbose regexps ([A-Z]\.)+ # abbreviations (e.g. U.S.A.) | \w+(-\w+)* # words with optional internal hyphens | \$?\d+(\.\d+)?%? # currency & percentages | \.\.\. # ellipses ''' text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...' vectorizer =

NLTK Wordnet Synset for word phrase

房东的猫 提交于 2020-01-13 10:08:20
问题 I'm working with the Python NLTK Wordnet API. I'm trying to find the best synset that represents a group of words. If I need to find the best synset for something like "school & office supplies", I'm not sure how to go about this. So far I've tried finding the synsets for the individual words and then computing the best lowest common hypernym like this: def find_best_synset(category_name): text = word_tokenize(category_name) tags = pos_tag(text) node_synsets = [] for word, tag in tags: pos =

NLTK Wordnet Synset for word phrase

蹲街弑〆低调 提交于 2020-01-13 10:06:23
问题 I'm working with the Python NLTK Wordnet API. I'm trying to find the best synset that represents a group of words. If I need to find the best synset for something like "school & office supplies", I'm not sure how to go about this. So far I've tried finding the synsets for the individual words and then computing the best lowest common hypernym like this: def find_best_synset(category_name): text = word_tokenize(category_name) tags = pos_tag(text) node_synsets = [] for word, tag in tags: pos =

How to install models/download packages on Google Colab?

丶灬走出姿态 提交于 2020-01-13 05:13:59
问题 I am using text analytics library "Spacy". I've installed spacy on Google Colab notebook without any issue. But for using it I need to download "en" model. Generally, that command should look like this: python -m spacy download en I tried few ways but I am not able to get it to install on the notebook. Looking for help. Cheers 回答1: If you have a Python interpreter but not a teriminal, you could try: import spacy.cli spacy.cli.download("en_core_web_sm") More manual alternatives can be found

How to use NLTK to generate sentences from an induced grammar?

余生长醉 提交于 2020-01-11 17:43:54
问题 I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree: (ROOT (S (ADVP (RB Now)) (, ,) (NP (PRP you)) (VP (MD can) (VP (VB be) (VP (VBN entertained)))) (. .))) I am using the set of sentence trees to induce a grammar using nltk: import nltk # ... for each sentence tree t, add its production to allProductions allProductions += t.productions() # Induce the grammar S = nltk.Nonterminal