stanford-nlp

PCFG vs SR Parser

北城以北 提交于 2019-12-25 16:39:06
问题 It looks like stanfordnlp has these SR models for some time. I am really new to NLP but we are currently using PCFG parser and we are having serious performance issues( that we cut down the parse length to 35) I was thinking if we could try using SR. I tried it with POS tagger from stanford(english-left3words-distsim.tagger) Would you know how SR is on accuracy vs PCFG? I also find sentence root detection issues with SR and dep parse: Example: Michael Jeffrey Jordan, also known by his

PCFG vs SR Parser

大憨熊 提交于 2019-12-25 16:37:35
问题 It looks like stanfordnlp has these SR models for some time. I am really new to NLP but we are currently using PCFG parser and we are having serious performance issues( that we cut down the parse length to 35) I was thinking if we could try using SR. I tried it with POS tagger from stanford(english-left3words-distsim.tagger) Would you know how SR is on accuracy vs PCFG? I also find sentence root detection issues with SR and dep parse: Example: Michael Jeffrey Jordan, also known by his

ConllReader (Like RothCONLL04Reader) throws exception while reading relation training data with custom NER and custom relation

我只是一个虾纸丫 提交于 2019-12-25 14:38:27
问题 In continuation of the following question. How to generate custom training data for Stanford relation extraction Thanks to StanfordNLPHelp i am able to generate relation data with custom ner and on top of it regexner. I had to run my custom model at the end because otherwise it will misclassify lots of ORGANIZATION PERSON etc. Example custom NER classes. "DEGREE", "DESG" Example of relation training data. 0 ELECTEDBODY 0 O NNP/IN/NNP BOARD/OF/DIRECTORS O O O 0 ORGANIZATION 1 O NNP Board O O O

How to set delimiters for PTB tokenizer?

落爺英雄遲暮 提交于 2019-12-25 07:39:41
问题 I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this- go to room no. #2145 or go to room no. *2145 tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter? 回答1: A quick solution is to use this option: (command-line) -tokenize.whitespace (in Java code) props.setProperty("tokenize.whitespace", "true"); This will cause the tokenizer to

Spark Scala - java.util.NoSuchElementException & Data Cleaning

半腔热情 提交于 2019-12-25 07:25:03
问题 I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email) . Thus, when I try to show() beyond a certain point or save() my data frame, I get a java.util.NoSuchElementException because sentiment() must have

Error in Stanford Pos Tagger

和自甴很熟 提交于 2019-12-25 07:10:06
问题 Hello i am trying to do POS tag for a certain sentence using Stanford Pos Tagger. I am using Python 3.4 nltk 3.1 on windows7 Following is the code i used: import nltk from nltk.tag.stanford import POSTagger import os java_path = r"C:\Program Files\Java\jre1.8.0_66\bin\java.exe" os.environ['JAVAHOME'] = java_path St=POSTagger(r"C:\Python34\Scripts\stanford-postagger-2015-12-09\models\english-bidirectional-distsim.tagger",r"C:\Python34\Scripts\stanford-postagger-2015-12-09\stanford-postagger

CoreNLP on Apache Spark

谁都会走 提交于 2019-12-25 06:14:45
问题 I'm not sure if this is related to Spark or NLP. Please help.I'm currently trying to run Stanford CoreNLP Library on Apache Spark and when I try to run it on multiple cores, I get the following exception. I'm using the latest NLP Library which is thread safe. This is happening during the map phase on line. pipeline.annotate(document); java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java

NER CRF, Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory [duplicate]

蹲街弑〆低调 提交于 2019-12-25 04:44:05
问题 This question already has answers here : Why am I getting a NoClassDefFoundError in Java? (23 answers) Closed 3 years ago . I have downloaded the latest version for NER from this link. Then after extracting it, I have run this command. java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop This is not working and getting following exception. CRFClassifier invoked on Mon Jul 25 06:56:22 EDT 2016 with arguments: -prop austen.prop Exception in thread "main" java.lang

Failed to execute goal

老子叫甜甜 提交于 2019-12-25 04:29:11
问题 I'm new to maven . I tired to mvn clean worked sucessfully but after mvn package I got the follwoing error [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3:compile (default-compile) on project L: Compilation failure: Compilation failure: [ERROR] /home/user/L/src/main/java/edu/stanford/nlp/pipeline/NLP.java:[4,34] package edu.stanford.nlp.neural.rnn does not exist [ERROR] [ERROR] /home/user/L/src/main/java/edu/stanford/nlp/pipeline/NLP.java:[6,32] cannot find

Stanford PTBTokenizer token's split delimiter

陌路散爱 提交于 2019-12-25 02:07:37
问题 There is a way to provide to the PTBTokenizer a set of delimiters characters to split a token ? i was testing the behaviour of this tokenizer and i've realized that there are some characters like the vertical bar '|' for which the tokenizer diviedes a substring into two token, and others like the slash or the hypen for which the tokenizer return a single token. 回答1: There's not any simple way to do this with the PTBTokenizer, no. You can do some pre-processing and post-processing to get what