nlp

nltk StanfordNERTagger : NoClassDefFoundError: org/slf4j/LoggerFactory (In Windows)

自闭症网瘾萝莉.ら 提交于 2019-12-18 02:47:42
问题 NOTE: I am using Python 2.7 as part of Anaconda distribution. I hope this is not a problem for nltk 3.1. I am trying to use nltk for NER as import nltk from nltk.tag.stanford import StanfordNERTagger #st = StanfordNERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar') st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') print st.tag(str) but i get Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory at edu.stanford.nlp

NLTK: corpus-level bleu vs sentence-level BLEU score

梦想与她 提交于 2019-12-18 02:42:59
问题 I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don't understand how corpus-level BLEU score work. Below is my code for corpus-level BLEU score: import nltk hypothesis = ['This', 'is', 'cat'] reference = ['This', 'is', 'a', 'cat'] BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1]) print(BLEUscore) For some reason, the bleu score is 0 for the above code. I was expecting a corpus

Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format

故事扮演 提交于 2019-12-18 02:42:39
问题 I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this: [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')] Is that possible to chunk things together by using it? What I want is like this: u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION' Thanks! 回答1: You can use the standard NLTK way of representing chunks using nltk.Tree . This might mean that you

tag generation from a small text content (such as tweets)

核能气质少年 提交于 2019-12-17 23:13:14
问题 I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents. With this constrain(working on small set of texts), how can I generate tags ? Regards 回答1: Two Stage Approach for Multiword Tags You could pool all the tweets into a single larger document and then

Finding meaningful sub-sentences from a sentence

一个人想着一个人 提交于 2019-12-17 22:35:10
问题 Is there a way to to find all the sub-sentences of a sentence that still are meaningful and contain at least one subject, verb, and a predicate/object? For example, if we have a sentence like "I am going to do a seminar on NLP at SXSW in Austin next month". We can extract the following meaningful sub-sentences from this sentence: "I am going to do a seminar", "I am going to do a seminar on NLP", "I am going to do a seminar on NLP at SXSW", "I am going to do a seminar at SXSW", "I am going to

Is there a tutorial about giza++? [closed]

隐身守侯 提交于 2019-12-17 22:34:01
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . The Urls in its 'readme' file is not valid (http://www.fjoch.com/mkcls.html and http://www.fjoch.com/GIZA++.html). Is there a good tutorial about giza++? Or is there some alternatives that have complete documentation? 回答1: The following is excerpted from a tutorial I'm putting together for a class. (NB: This

semantic similarity between sentences

我怕爱的太早我们不能终老 提交于 2019-12-17 22:25:47
问题 i am doing project.i need any opensource tool or technique to find the semantic similarity between sentences where i give input as two sentences and output as score (i.e.,semantic similarity).can any one know this information.i hope i will get reply soon.thank you all. 回答1: Salma, I'm afraid this is not the right forum for your question as it's not directly related to programming. I recommend that you ask your question again on corpora list. You also may want to search their archives first.

Is wordnet path similarity commutative?

谁说我不能喝 提交于 2019-12-17 19:33:49
问题 I am using the wordnet API from nltk. When I compare one synset with another I got None but when I compare them the other way around I get a float value. Shouldn't they give the same value? Is there an explanation or is this a bug of wordnet? Example: wn.synset('car.n.01').path_similarity(wn.synset('automobile.v.01')) # None wn.synset('automobile.v.01').path_similarity(wn.synset('car.n.01')) # 0.06666666666666667 回答1: Technically without the dummy root, both car and automobile synsets would

Natural language time parser

ⅰ亾dé卋堺 提交于 2019-12-17 19:06:09
问题 I'm trying to parse strings containing (natural language) times to hh:mm time objects? For example: "ten past five" "quarter to three" "half past noon" "15 past 3" "13:35" "ten fourteen am" I've looked into Chronic for Ruby and Natty for Java (as well as some other libraries) but both seem to focus on parsing dates. Strings like "ten past five" are not parsed correctly by either. Does anyone know of a library which suit my needs? Or should I maybe start working on my own parser? 回答1:

Kaggle spooky NLP

只愿长相守 提交于 2019-12-17 18:54:41
https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial 介绍 在本笔记本中,我将对这个Spooky Author数据集的主题建模进行非常基本的尝试。主题建模是我们尝试根据基础文档和文本语料库中的单词来发现抽象主题或“主题”的过程。我将在这里介绍两种标准的主题建模技术,第一种是称为潜在Dirichlet分配(LDA)的技术,第二种是非负矩阵分解(NMF)。我还将借此机会介绍一些自然语言处理基础知识,例如原始文本的标记化,词干化和向量化,这些也有望在用学习模型进行预测时派上用场。 该笔记本的概述如下: 探索性数据分析(EDA)和Wordclouds-通过生成简单的统计数据(例如,不同作者的词频)以及绘制一些词云(带有图像蒙版)来分析数据。 带有NLTK(自然语言工具包)的自然语言处理(NLP)-引入了基本的文本处理方法,例如标记化,停止单词删除,通过术语频率(TF)和反向文档频率(TF-IDF)提取文本和对向量进行矢量化 使用LDA和NNMF进行主题建模-实现潜在狄利克雷分配(LDA)和非负矩阵分解(NMF)的两种主题建模技术。 根据比赛页面,我们已经提供了三种不同的作者姓名缩写,这些姓名缩写与实际作者的映射如下: (如果单击其名称,则指向其Wikipedia页面配置文件的链接) EAP-埃德加