nltk

自然语言18.2_NLTK命名实体识别

六月ゝ 毕业季﹏ 提交于 2020-01-18 05:38:26
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share 机器学习,统计项目合作QQ:231469242 http://blog.csdn.net/u010718606/article/details/50148261参考 NLTK中对于很多自然语言处理应用有着开箱即用的api,但是结果往往让人弄不清楚状况。 下面的例子使用NLTK进行命名实体的识别。第一例中,Apple成功被识别出来,而第二例并未被识别。究竟是什么原因导致这样的结果,接下来一探究竟。 In [1]: import nltk In [2]: tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.') In [3]: tokens = nltk.pos_tag(tokens) In [4]: print tokens [('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), (

自然语言22_Wordnet with NLTK

跟風遠走 提交于 2020-01-18 04:38:58
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share 机器学习,统计项目合作QQ:231469242 Wordnet with NLTK 英语的同义词和反义词函数 # -*- coding: utf-8 -*- """ Spyder Editor 英语的同义词和反义词函数 """ import nltk from nltk.corpus import wordnet syns=wordnet.synsets('program') ''' syns Out[11]: [Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'),

How to convert token list into wordnet lemma list using nltk?

淺唱寂寞╮ 提交于 2020-01-17 15:02:16
问题 I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this: ['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...] There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age',

How to convert token list into wordnet lemma list using nltk?

偶尔善良 提交于 2020-01-17 15:01:53
问题 I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this: ['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...] There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age',

NLTK Wordnet Download Out of Date

佐手、 提交于 2020-01-17 03:44:12
问题 New to Python, tying to get started with NLTK. After a rough time installing Python on my Windows 7 64-bit system, I am now having a rough time downloading Wordnet and other NLTK data packages located here: http://nltk.org/nltk_data/ Some packages download, some say "Out of Date" import nltk nltk.download() When I use the above to download, the program doesn't let me cancel if I hit the cancel button. So, I just shut it down and go directly to the link above to try and download it manually.

问题处理:[nltk_data] Error loading brown : urlopen error [Errno 111] Connection refused

北城余情 提交于 2020-01-16 20:14:22
一、错误信息 错误定位: if nltk.download(brown): //从nltk语料库网站下载指定的语料库 错误提示: 上图提示错误是:[nltk_data] Error loading brown:<urlopen error [Errno 111] Connection refused>,也就是:下载语料库时连接到网页被拒绝。 二、错误原因 第一步,在网上查找[Errno 111]的解决办法,有部分答案说电脑添加了代理或翻墙了,才不能正常下载,而我的电脑不符合这种情况。 第二步,尝试手动下载数据库(参考链接http://www.nltk.org/data.html#installing-via-a-proxy-web-server),也以失败告终。但也因此发现问题所在。 定位错误原因 注意到网址前方带红色斜杠的小锁(见下方第一张图),点开小锁–>向右的箭头–>更多信息(见下方第二张图)发现 错误原因:网页权限不足 。 三 、问题解决 解决办法就是 修改网页权限 ,将“安装附加组件”和“打开弹出式窗口”的权限都改成允许,参考百度经验 — 如何设置火狐浏览器的信任站点 ,修改后再查看网页权限如下。至此,问题解决,此时再运行程序就能正常下载了。 来源: CSDN 作者: lyumoon 链接: https://blog.csdn.net/dengzhuo8077

what is the difference between tfidf vectorizer and tfidf transformer

跟風遠走 提交于 2020-01-16 19:12:24
问题 I know that the formula for tfidf vectorizer is Count of word/Total count * log(Number of documents / no.of documents where word is present) I saw there's tfidf transformer in the scikit learn and I just wanted to difference between them. I could't find anything that's helpful. 回答1: TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer 回答2: Artem's answer pretty much sums up the difference. To make things

How to run naive Bayes from NLTK with Python Pandas?

人盡茶涼 提交于 2020-01-15 12:16:07
问题 I have a csv file with feature (people's names) and label (people's ethnicities). I am able to set up the data frame using Python Pandas, but when I try to link that with NLTK module to run a naive Bayes, I get the following error: Traceback (most recent call last): File "C:\Users\Desktop\file.py", line 19, in <module> classifier = nbc.train(train_set) File "E:\Program Files Extra\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train for fname, fval in featureset.items()

Word tokeinizing from the list of words in python?

天涯浪子 提交于 2020-01-15 09:36:30
问题 my program has list of words and amongst that i need few specific words to be tokenized as one word. my program would split a string into words eg str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single word." output will be list=['hello','my','name','is','vishal','can','you','please','help','me','with','the','red','blood','cells','and','platelet','count','the','white','blood','cell','is','a','single','word'].. now I

Definition of the CESS_ESP tags

狂风中的少年 提交于 2020-01-14 14:31:53
问题 I'm using the NLTK CESS ESP data package and I've been able to use an adatpation of the spaghetti tagger and a HiddenMarkovModelTagger to pos-tag the sentence, how ever the tags that it produces are not at all like the ones used when tagging en_US sentences, here's a link to the Categorizing and Tagging documentation for NLTK, you'll notice that the tags used are uppercase and don't have any numbers or punctuation, some cess tags: vsip3s0 , da0fs0 . Does some one know a reference that