nlp | 易学教程

Coreference resolution in python nltk using Stanford coreNLP

阅读更多关于 Coreference resolution in python nltk using Stanford coreNLP

问题 Stanford CoreNLP provides coreference resolution as mentioned here, also this thread, this, provides some insights about its implementation in Java. However, I am using python and NLTK and I am not sure how can I use Coreference resolution functionality of CoreNLP in my python code. I have been able to set up StanfordParser in NLTK, this is my code so far. from nltk.parse.stanford import StanfordDependencyParser stanford_parser_dir = 'stanford-parser/' eng_model_path = stanford_parser_dir +

Coreference resolution in python nltk using Stanford coreNLP

阅读更多关于 Coreference resolution in python nltk using Stanford coreNLP

How does language detection work?

阅读更多关于 How does language detection work?

问题 I have been wondering for some time how does Google translate(or maybe a hypothetical translator) detect language from the string entered in the "from" field. I have been thinking about this and only thing I can think of is looking for words that are unique to a language in the input string. The other way could be to check sentence formation or other semantics in addition to keywords. But this seems to be a very difficult task considering different languages and their semantics. I did some

Simple Natural Language Processing Startup for Java [duplicate]

阅读更多关于 Simple Natural Language Processing Startup for Java [duplicate]

问题 This question already has answers here : Is there a good natural language processing library [closed] (3 answers) Closed 5 years ago . I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution. Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the

nltk language model (ngram) calculate the prob of a word from context

阅读更多关于 nltk language model (ngram) calculate the prob of a word from context

问题 I am using Python and NLTK to build a language model as follows: from nltk.corpus import brown from nltk.probability import LidstoneProbDist, WittenBellProbDist estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) lm = NgramModel(3, brown.words(categories='news'), estimator) # Thanks to miku, I fixed this problem print lm.prob("word", ["This is a context which generates a word"]) >> 0.00493261081006 # But I got another program like this one... print lm.prob("b", ["This is a context

Find multi-word terms in a tokenized text in Python

阅读更多关于 Find multi-word terms in a tokenized text in Python

问题 I have a text that I have tokenized, or in general a list of words is ok as well. For example: >>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] If I have a Python dict that contains single word as well as multi-word keys, how can I efficiently and

Split string into sentences using regex

阅读更多关于 Split string into sentences using regex

问题 I have random text stored in $sentences . Using regex, I want to split the text into sentences, see: function splitSentences($text) { $re = '/ # Split sentences on whitespace between them. (?<= # Begin positive lookbehind. [.!?] # Either an end of sentence punct, | [.!?][\'"] # or end of sentence punct and quote. ) # End positive lookbehind. (?<! # Begin negative lookbehind. Mr\. # Skip either "Mr." | Mrs\. # or "Mrs.", | T\.V\.A\. # or "T.V.A.", # or... (you get the idea). ) # End negative

what is the true difference between lemmatization vs stemming?

阅读更多关于 what is the true difference between lemmatization vs stemming?

问题 When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was? 回答1: Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off

How to train the Stanford NLP Sentiment Analysis tool

阅读更多关于 How to train the Stanford NLP Sentiment Analysis tool

问题 Hell everyone! I'm using the Stanford Core NLP package and my goal is to perform sentiment analysis on a live-stream of tweets. Using the sentiment analysis tool as is returns a very poor analysis of text's 'attitude' .. many positives are labeled neutral, many negatives rated positive. I've gone ahead an acquired well over a million tweets in a text file, but I haven't a clue how to actually train the tool and create my own model. Link to Stanford Sentiment Analysis page "Models can be

NLP系列(7)_Transformer详解

阅读更多关于 NLP系列(7)_Transformer详解

Ref https://jalammar.github.io/illustrated-transformer/ ， https://blog.csdn.net/han_xiaoyang/article/details/86560459 编者按：前一段时间谷歌推出的BERT模型在11项NLP任务中夺得SOTA结果，引爆了整个NLP界。而BERT取得成功的一个关键因素是Transformer的强大作用。谷歌的Transformer模型最早是用于机器翻译任务，当时达到了SOTA效果。Transformer改进了RNN最被人诟病的训练慢的缺点，利用self-attention机制实现快速并行。并且Transformer可以增加到非常深的深度，充分发掘DNN模型的特性，提升模型准确率。在本文中，我们将研究Transformer模型，把它掰开揉碎，理解它的工作原理。正文： Transformer由论文《Attention is All You Need》提出，现在是谷歌云TPU推荐的参考模型。论文相关的Tensorflow的代码可以从GitHub获取，其作为Tensor2Tensor包的一部分。哈佛的NLP团队也实现了一个基于PyTorch的版本，并注释该论文。在本文中，我们将试图把模型简化一点，并逐一介绍里面的核心概念，希望让普通读者也能轻易理解。 Attention is All