nlp | 易学教程

Reusable version of DKPro Core pipeline

阅读更多关于 Reusable version of DKPro Core pipeline

问题 I have set up DKPro Core as a web service to take an input and provide a tokenised output. The service itself is set up as a Jersey resource: @Path("/") public class MyResource { public MyResource() { // Nothing here } @GET public String generate(@QueryParam("q") final String input) { try { final JCasIterable en = iteratePipeline( createReaderDescription(StringReader.class, StringReader.PARAM_DOCUMENT_TEXT, input, StringReader.PARAM_LANGUAGE, "en") ,createEngineDescription(StanfordSegmenter

Unscrambling words in a sentence using Natural Language Generation

阅读更多关于 Unscrambling words in a sentence using Natural Language Generation

问题 I have a sentence in English. Now I want to jumble the words up and input that set of words into a program which should unscramble the words according to normal rules of English grammar to output the original sentence. I can vaguely assume it would require Natural Language Generation algorithms. For eg: Sentence: Mary has gone for a walk with her dog. Set of words: {has, for, a, with, her, dog, Mary, gone, walk} The output should be the same sentence. I can assume only the set of words will

Stanford CoreNLP: Use partial existing annotation

阅读更多关于 Stanford CoreNLP: Use partial existing annotation

问题 We are trying to use existing tokenzation sentence splitting and named entity tagging while we would like to use Stanford CoreNlp to additionally provide us with part-of-speech tagging lemmatization and parsing Currently, we are trying it the following way: 1) make an annotator for "pos, lemma, parse" Properties pipelineProps = new Properties(); pipelineProps.put("annotators", "pos, lemma, parse"); pipelineProps.setProperty("parse.maxlen", "80"); pipelineProps.setProperty("pos.maxlen", "80");

How to use Stanford LexParser for Chinese text?

阅读更多关于 How to use Stanford LexParser for Chinese text?

问题 I can't seem to get the correct input encoding for Stanford NLP's LexParser. How do I use the Stanford LexParser for Chinese text? I've done the following to download the tool: $ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip $ unzip stanford-parser-full-2015-04-20.zip $ cd stanford-parser-full-2015-04-20/ And my input text is in UTF-8 : $ echo "应有尽有的丰富选择定将为您的旅程增添无数的赏心乐事。" > input.txt $ echo "应有尽有#VV 的#DEC 丰富#JJ 选择#NN 定#VV 将#AD 为#P 您#PN 的#DEG 旅程#NN 增添

Older versions of spaCy throws “KeyError: 'package'” error when trying to install a model

阅读更多关于 Older versions of spaCy throws “KeyError: 'package'” error when trying to install a model

问题 I use spaCy 1.6.0 on Ubuntu 14.04.4 LTS x64 with python3.5. To install the English model of spaCy, I tried to run: This gives me the error message: ubun@ner-3:~/NeuroNER-master/src$ python3.5 -m spacy.en.download Downloading parsing model Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.5/dist-packages/spacy

Split text file at sentence boundary

阅读更多关于 Split text file at sentence boundary

问题 I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators").

Split text file at sentence boundary

阅读更多关于 Split text file at sentence boundary

Stanford.NLP for .NET not loading models

阅读更多关于 Stanford.NLP for .NET not loading models

问题 I am trying to run the sample code provided here for Stanford.NLP for .NET. I installed the package via Nuget, downloaded the CoreNLP zip archive, and extracted stanford-corenlp-3.7.0-models.jar. After extracting, I located the "models" directory in stanford-corenlp-full-2016-10-31\edu\stanford\nlp\models. Here is the code that I am trying to run: public static void Test1() { // Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar` var jarRoot = @"..\..\..\stanford

Understanding LDA Transformed Corpus in Gensim

阅读更多关于 Understanding LDA Transformed Corpus in Gensim

问题 I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output: DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)] LDA 1 : [(29, 0.80571428571428572)] DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)] LDA 2 : [(29, 0.83809523809523812)] DOC 3 : [(3079, 1), (3395, 1), (4874, 1)] LDA 3 : [(34, 0.75714285714285712)] DOC 4 : [(1482, 1), (2806, 1), (3988, 1)] LDA 4 : [(22,

PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader

阅读更多关于 PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader

问题 As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn. Now this makes me wonder - doesn’t this wash away