nltk

Extracting Related Date and Location from a sentence

▼魔方 西西 提交于 2020-08-10 22:58:52
问题 I'm working with written text (paragraphs of articles and books) that includes both locations and dates. I want to extract from the texts pairs that contain locations and dates that are associated with one another. For example, given the following phrase: The man left Amsterdam on January and reached Nepal on October 21st I would have an output such as this: >>>[(Amsterdam, January), (Nepal, October 21st)] I tried splitting the text through "connecting words" (such as "and" for example) and

社交网站的数据挖掘与分析pdf版本|网盘下载地址附提取码|

▼魔方 西西 提交于 2020-08-10 13:44:02
点击此处进入网盘下载地址 提取码:btqx 作者介绍: 马修·罗塞尔(MatthewA.Russell),DigitalReasoningSystems公司的技术副总裁和Zaffra公司的负责人,是热爱数据挖掘、开源和Web应用技术的计算机科学家。他也是《Dojo:TheDofinitiveGuide》(O'Reilly出版社)的作者。在LinkedIn上联系他或在Twitter上关注@ptwobrussell,可随时关注他的最新动态。 简介: 出版社: 机械工业出版社 ISBN:9787111369608 版次:1 商品编码:10922249 品牌:机工出版 包装:平装 丛书名: OReilly精品图书系列 开本:16开 出版时间:2012-02-01 用纸:胶版纸 页数:316 社交网站的数据挖掘与分析目录: 前言第1章 绪论:Twitter 数据的处理 Python 开发工具的安装 Twitter 数据的收集和处理 小结 第2章 微格式:语义标记和常识碰撞 XFN 和朋友 使用XFN 来探讨社交关系 地理坐标:兴趣爱好的共同主线 (以健康的名义)对菜谱进行交叉分析 对餐厅评论的搜集 小结 第3章 邮箱:虽然老套却很好用 mbox:Unix 的入门级邮箱 mbox+CouchDB= 随意的Email 分析 将对话线程化到一起 使用SIMILE Timeline 将邮件"事件

cs224u 自然语言推理:任务和数据集-3

生来就可爱ヽ(ⅴ<●) 提交于 2020-08-10 08:08:51
cs224u 自然语言推理:任务和数据集-3 nli_01_task_and_data.ipynb __author__ = "Christopher Potts" __version__ = "CS224u, Stanford, Fall 2020" 目录 NLIExample 类 Labels 树表示 注释MultiNLI子集 其他NLI数据集 NLIExample 类 所有读取器都有一个读取方法,该方法会产生NLIExample示例实例,这些实例具有以下属性 annotator_labels: list of str captionID: str gold_label: str pairID: str sentence1: str sentence1_binary_parse: nltk.tree.Tree sentence1_parse: nltk.tree.Tre 来源: oschina 链接: https://my.oschina.net/u/4406332/blog/4467703

nltk.probability.FreqDist 自动识别语料库中词汇的频率分布

折月煮酒 提交于 2020-08-07 07:03:03
自动识别语料库中词汇的频率分布 方法 描述 fdist=FreqDist(samples) 创建包含给定样本的频率分布(samples可以是nltk.text.Text、空格分割的字符串、列表或者其他) fdist.inc(sample) 增加样本 fdist[word] word在样本中出现的次数 fdist.freq(word) word在样本中出现的频率 fdist.N() 样本总数 fdist.keys() 样本list for sample in fdist: 以频率递减顺序遍历样本 fdist.max() 数值最大样本 fdist.plot() 绘制频率分布图 fdist.plot(cumulative=True) 绘制累积频率分布图 > > > fdist = FreqDist (text1 ) > > > fdist .plot ( 50 , cumulative = True ) 来源: oschina 链接: https://my.oschina.net/u/4397718/blog/4284388

How to inverse lemmatization process given a lemma and a token?

人走茶凉 提交于 2020-08-06 12:45:45
问题 Generally, in natural language processing, we want to get the lemma of a token. For example, we can map 'eaten' to 'eat' using wordnet lemmatization. Is there any tools in python that can inverse lemma to a certain form? For example, we map 'go' to 'gone' given target form 'eaten'. PS: Someone mentions we have to store such mappings. How to un-stem a word in Python? 回答1: Turning a base form such as a lemma into a situation-appropriate form is called realization (or "surface realization").

How to inverse lemmatization process given a lemma and a token?

独自空忆成欢 提交于 2020-08-06 12:45:17
问题 Generally, in natural language processing, we want to get the lemma of a token. For example, we can map 'eaten' to 'eat' using wordnet lemmatization. Is there any tools in python that can inverse lemma to a certain form? For example, we map 'go' to 'gone' given target form 'eaten'. PS: Someone mentions we have to store such mappings. How to un-stem a word in Python? 回答1: Turning a base form such as a lemma into a situation-appropriate form is called realization (or "surface realization").

cs224u作业 :基于远程监督的关系抽取-3

六眼飞鱼酱① 提交于 2020-07-29 10:35:26
cs224u作业 :基于远程监督的关系抽取-3 hw_rel_ext.ipynb __author__ = "Bill MacCartney and Christopher Potts" __version__ = "CS224u, Stanford, Spring 2020" 目录 原型系统 原型系统 这里有许多选择,这个作业可以很容易地发展成一个项目。以下是一些建议: 尝试不同的分类器模型,从sklearn及其他模型构建 。 增加一个特征来表示中间词的长度。 增加词袋的表示形式,包括bigrams或者trigrams(而不仅仅是unigrams)。 基于实体的特征。 根据两个实体提到的上下文(不是中间词)——也就是第一次提到之前或第二次提到之后的单词——来试验特征。 尝试增加捕获语法信息的特征,比如Mintz等人使用的依赖路径特征,NLTK工具包包含各种可能有帮助的解析算法。 词袋表示法不允许跨单词类别(如人名、地点或公司名称)进行泛化。可以使用GloVe单词嵌入。 #1. try on stacking existing featurizer featurizers_1 来源: oschina 链接: https://my.oschina.net/u/4355739/blog/4443776

Why is the number of stem from NLTK Stemmer outputs different from expected output?

穿精又带淫゛_ 提交于 2020-07-23 06:42:03
问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import

Why is the number of stem from NLTK Stemmer outputs different from expected output?

限于喜欢 提交于 2020-07-23 06:41:07
问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import

Why is the number of stem from NLTK Stemmer outputs different from expected output?

杀马特。学长 韩版系。学妹 提交于 2020-07-23 06:39:29
问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import