nltk

Separating nltk.FreqDist words into two lists?

泄露秘密 提交于 2021-01-28 02:53:08
问题 I have a series of texts that are instances of a custom WebText class. Each text is an object that has a rating (-10 to +10) and a word count (nltk.FreqDist) associated with it: >>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')] >>trainingTexts[1].rating 10 >>trainingTexts[1].freq_dist <FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...> How can you now get two lists (or dictionaries) containing every word used exclusively in the

PYTHON自然语言处理中文版pdf

寵の児 提交于 2021-01-10 06:46:21
下载地址: 网盘下载 《Python自然语言处理(影印版)》提供了非常易学的自然语言处理入门介绍,该领域涵盖从文本和电子邮件预测过滤,到自动总结和翻译等多种语言处理技术。在《Python自然语言处理(影印版)》中,你将学会编写Python程序处理大量非结构化文本。你还将通过使用综合语言数据结构访问含有丰富注释的数据集,理解用于分析书面通信内容和结构的主要算法。 《Python自然语言处理》准备了充足的示例和练习,可以帮助你: 从非结构化文本中抽取信息,甚至猜测主题或识别“命名实体”; 分析文本语言结构,包括解析和语义分析; 访问流行的语言学数据库,包括WordNet和树库(treebank); 从多种语言学和人工智能领域中提取的整合技巧。 《Python自然语言处理(影印版)》将帮助你学习运用Python编程语言和自然语言工具包(NLTK)获得实用的自然语言处理技能。如果对于开发Web应用、分析多语言新闻源或记录濒危语言感兴趣——即便只是想从程序员视角观察人类语言如何运作,你将发现《Python自然语言处理》是一本令人着迷且极为有用的好书。 Steven Bird是墨尔本大学计算机科学和软件工程系副教授,以及宾夕法尼亚大学语言数据联合会高级研究助理。 克莱因是爱丁堡大学信息学院语言技术教授。 洛普最近从宾夕法尼亚大学获得机器学习自然语言处理博士学位,目前是波士顿BBN

How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

試著忘記壹切 提交于 2021-01-07 03:12:55
问题 I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output? 回答1: Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words. For this, I exemplarily used the snowball stemmer from nltk. from nltk.stem.snowball import SnowballStemmer englishStemmer=SnowballStemmer("english

How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

≡放荡痞女 提交于 2021-01-07 03:11:56
问题 I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output? 回答1: Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words. For this, I exemplarily used the snowball stemmer from nltk. from nltk.stem.snowball import SnowballStemmer englishStemmer=SnowballStemmer("english

Select Constituents to parse tree representation

吃可爱长大的小学妹 提交于 2021-01-05 07:24:10
问题 Consider we have the spans, corresponding to the sentence, s = "Our intent is to promote the best alternative he says" spans = [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)] I delete (0, 3) and (8, 10) . I want to put brackets over, like this: (((0 1 2) (3 (4 ((5 6 7) 8)))) 9 10) where 0, 1, ... , 10 are the indices of single-words of the sentence. For instance, if we were to remove ONLY "he says" and "Our intent is" . Here, the span of "Our intent is" corresponds to (0, 3)

How to ignore punctuation in-between words using word_tokenize in NLTK?

和自甴很熟 提交于 2021-01-04 06:41:40
问题 I'm looking to ignore characters in-between words using NLTK word_tokenize. If I have a a sentence: test = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email test@testing.com' The word_tokenize method is splitting the S&P into 'S','&','P','?' Is there a way to have this library ignore punctuation between words or letters? Expected output: 'S&P','?' 回答1: Let me know how this works with your sentences. I added an additional test with a bunch of punctuation. The

Problem with NLTK, collocations: too many values to unpack (expected 2)

跟風遠走 提交于 2020-12-30 07:44:57
问题 i tried to retrieve collocations with NLTK, yet i get an error. I used built-in gutenberg corpus I wrote: alice = nltk.corpus.gutenberg.fileids()[7] al = nltk.corpus.gutenberg.words(alice) al_text = nltk.Text(al) al_text.collocations(25) i got: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-a6905d575410> in <module> ----> 1 al_text.collocations(25) C:\ProgramData\Anaconda3\lib\site-packages\nltk\text

WordNet - What does n and the number represent?

半世苍凉 提交于 2020-12-29 13:14:21
问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string

WordNet - What does n and the number represent?

送分小仙女□ 提交于 2020-12-29 13:13:32
问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string

WordNet - What does n and the number represent?

给你一囗甜甜゛ 提交于 2020-12-29 13:09:34
问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string