nlp | 易学教程

Stanford Core NLP - understanding coreference resolution

阅读更多关于 Stanford Core NLP - understanding coreference resolution

问题 I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corresponding CorefChainAnnotation: The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. {1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]} I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either. Thank you 回答1:

“Stop words” list for English? [closed]

阅读更多关于 “Stop words” list for English? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the". Where can I find some lists of these uninteresting words? Is a list of these words the same as a list of the most frequently used words in English? update: these are apparently called

Best Algorithmic Approach to Sentiment Analysis [closed]

阅读更多关于 Best Algorithmic Approach to Sentiment Analysis [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't

How to use Gensim doc2vec with pre-trained word vectors?

阅读更多关于 How to use Gensim doc2vec with pre-trained word vectors?

问题 I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec? Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training? Thanks. 回答1: Note that the "DBOW" ( dm=0 ) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram

Python: How to prepend the string 'ub' to every pronounced vowel in a string?

阅读更多关于 Python: How to prepend the string 'ub' to every pronounced vowel in a string?

问题 Example : Speak -> Spubeak, more info here Don't give me a solution, but point me in the right direction or tell which which python library I could use? I am thinking of regex since I have to find a vowel, but then which method could I use to insert 'ub' in front of a vowel? 回答1: It is more complex then just a simple regex e.g., "Hi, how are you?" → "Hubi, hubow ubare yubou?" Simple regex won't catch that e is not pronounced in are . You need a library that provides a pronunciation dictionary

Java library for keywords extraction from input text

阅读更多关于 Java library for keywords extraction from input text

问题 I'm looking for a Java library to extract keywords from a block of text. The process should be as follows: stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate. Is there a library that performs this task? 回答1: Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one

How to tweak the NLTK sentence tokenizer

阅读更多关于 How to tweak the NLTK sentence tokenizer

问题 I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick : import nltk sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle') ''' (Chapter 16) A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?" ''' sample = 'A clam for supper? a cold clam; is THAT what you

Natural Language Processing in Ruby [closed]

阅读更多关于 Natural Language Processing in Ruby [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby? Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but

LDA model generates different topics everytime i train on the same corpus

阅读更多关于 LDA model generates different topics everytime i train on the same corpus

问题 I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code: from gensim import corpora, models, similarities from

Natural language date/time parser for .NET? [closed]

阅读更多关于 Natural language date/time parser for .NET? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Does anyone know of a .NET date/time parser similar to Chronic for Ruby (handles stuff like "tomorrow" or "3pm next thursday")? Note: I do write Ruby (which is how I know about Chronic) but this project must use .NET. 回答1: We developed exactly what you are looking for on an internal project. We are thinking of