nlp

NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

一笑奈何 提交于 2019-12-19 07:49:48
问题 Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard

How to determine subject, object and other words?

别来无恙 提交于 2019-12-19 05:48:41
问题 I'm trying to implement application that can determine meaning of sentence, by dividing it to smaller pieces. So I need to know what words are subject, object etc. so that my program can know how to handle this sentence. 回答1: This is an open research problem. You can get an overview on Wikipedia, http://en.wikipedia.org/wiki/Natural_language_processing. Consider phrases like "Time flies like an arrow, fruit flies like a banana" - unambiguously classifying words is not easy. 回答2: You should

CFG using POS tags in NLTK [closed]

无人久伴 提交于 2019-12-19 05:00:52
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . I am trying to check if a given sentence is grammatical using NLTK. Ex: OK : The whale licks the sadness NOT OK : The best I ever had

Replace words in corpus according to dictionary data frame

可紊 提交于 2019-12-19 04:56:31
问题 I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word. I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map . Please consider the following MWE library(tm) docs <- c("first text", "second text") corp <- Corpus(VectorSource(docs)) dictionary <- data.frame(word = c(

Remove a verb as a stopword

时光怂恿深爱的人放手 提交于 2019-12-19 04:54:25
问题 There are some words which are used sometimes as a verb and sometimes as other part of speech. Example A sentence with the meaning of the word as verb: I blame myself for what happened And a sentence with the meaning of word as noun: For what happened the blame is yours The word I want to detect is known to me, in the example above is "blame". I would like to detect and remove as stopwords only when it has meaning like a verb. Is there any easy way to make it? 回答1: You can install TreeTagger

Head-finding rules for noun phrases [closed]

ⅰ亾dé卋堺 提交于 2019-12-19 04:46:26
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . The Penn Treebank format does not annotate the internal structure of a noun phrase, e.g. (NP (JJ crude) (NN oil) (NNS prices)) or (NP (NP (DT the) (JJ big) (JJ blue) (NN house)) (SBAR (WHNP (WDT that)) (S (VP (VBD was) (VP (VBN built) (PP (IN near) (NP (DT the) (NN river))))))) I would like to extract the heads

quicker way to detect n-grams in a string?

半世苍凉 提交于 2019-12-19 04:39:32
问题 I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence) import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i =

NER model to recognize Indian names

主宰稳场 提交于 2019-12-19 04:25:50
问题 I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I

NER model to recognize Indian names

白昼怎懂夜的黑 提交于 2019-12-19 04:25:06
问题 I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I

Stanford Parser - Traversing the typed dependencies graph

别等时光非礼了梦想. 提交于 2019-12-19 04:01:35
问题 Basically I want to find a path between two NP tokens in the dependencies graph. However, I can't seem to find a good way to do this in the Stanford Parser. Any help? Thank You Very Much 回答1: The Stanford Parser just returns a list of dependencies between word tokens. (We do this to avoid external library dependencies.) But if you want to manipulate the dependencies, you'll almost certainly want to put them in a graph data structure. We usually use jgrapht: http://jgrapht.sourceforge.net/ 来源: