stanford-nlp

nltk cant interpret grammar category PRP$ output by stanford parser

浪子不回头ぞ 提交于 2019-12-02 18:52:17
问题 I want to generate sentence from grammar retrived from stanford parser, but NLTK is not able to interpret PRP$. from nltk.parse.stanford import StanfordParser from nltk.grammar import CFG from nltk.parse.generate import generate sp=StanfordParser(model_path='/home/aman/stanford_resource/stanford-parser-full-2014-06-16/stanford-parser-3.4-models/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz',path_to_jar='/home/aman/stanford_resource/stanford-parser-full-2014-06-16/stanford-parser.jar'

Multi-term named entities in Stanford Named Entity Recognizer

旧城冷巷雨未停 提交于 2019-12-02 18:39:59
I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is List<List<CoreLabel>> out = classifier.classify(text); for (List<CoreLabel> sentence : out) { for (CoreLabel word : sentence) { if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) { namedEntities.add(word.word().trim()); } } } However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term. Could this be achieved

How to detect that two sentences are similar?

拈花ヽ惹草 提交于 2019-12-02 18:09:55
I want to compute how similar two arbitrary sentences are to each other. For example: A mathematician found a solution to the problem. The problem was solved by a young mathematician. I can use a tagger, a stemmer, and a parser, but I don’t know how detect that these sentences are similar. These two sentences are not just similar, they are almost paraphrases , i.e., two alternative ways of expressing the same meaning. It is also a very simple case of paraphrase, in which both utterances use the same words with the only exception of one being in active form while the other is passive. (The two

NLTK Stanford Segmentor, how to set CLASSPATH

﹥>﹥吖頭↗ 提交于 2019-12-02 17:14:48
问题 I'm trying to use the Stanford Segementer bit from the NLTK Tokenize package. However, I run into issues just trying to use the basic test set. Running the following: # -*- coding: utf-8 -*- from nltk.tokenize.stanford_segmenter import StanfordSegmenter seg = StanfordSegmenter() seg.default_config('zh') sent = u'这是斯坦福中文分词器测试' print(seg.segment(sent)) Results in this error: I got as far as to add... import os javapath = "C:/Users/User/Folder/stanford-segmenter-2017-06-09/*" os.environ[

Is it possible to train Stanford NER system to recognize more named entities types?

泄露秘密 提交于 2019-12-02 15:13:05
I'm using some NLP libraries now, (stanford and nltk) Stanford I saw the demo part but just want to ask if it possible to use it to identify more entity types. So currently stanford NER system (as the demo shows) can recognize entities as person(name), organization or location. But the organizations recognized are limited to universities or some, big organizations. I'm wondering if I can use its API to write program for more entity types, like if my input is "Apple" or "Square" it can recognize it as a company. Do I have to make my own training dataset? Further more, if I ever want to extract

Training n-gram NER with Stanford NLP

巧了我就是萌 提交于 2019-12-02 14:10:59
Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set. Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training. What I am stuck with is the following property #structure of your training file;

edu.stanford.nlp.io.RuntimeIOException: Could not connect to server

拟墨画扇 提交于 2019-12-02 14:08:33
问题 I'm trying to annotate multiple sentences using the CoreNLP server. However, if I try to that with too many sentences I'm getting: Exception in thread "Thread-48" edu.stanford.nlp.io.RuntimeIOException: Could not connect to server: 192.168.108.60:9000 at edu.stanford.nlp.pipeline.StanfordCoreNLPClient$2.run(StanfordCoreNLPClient.java:393) Caused by: java.io.IOException: Server returned HTTP response code: 500 for URL: http://192.168.108.60:9000?properties=%7B+%22inputFormat%22%3A+

NLTK Stanford Segmentor, how to set CLASSPATH

心不动则不痛 提交于 2019-12-02 11:54:02
I'm trying to use the Stanford Segementer bit from the NLTK Tokenize package. However, I run into issues just trying to use the basic test set. Running the following: # -*- coding: utf-8 -*- from nltk.tokenize.stanford_segmenter import StanfordSegmenter seg = StanfordSegmenter() seg.default_config('zh') sent = u'这是斯坦福中文分词器测试' print(seg.segment(sent)) Results in this error: I got as far as to add... import os javapath = "C:/Users/User/Folder/stanford-segmenter-2017-06-09/*" os.environ['CLASSPATH'] = javapath ...to the front of my code, but that didn't seem to help. How do I get the segmentor to

How to store ner result in json/ database

偶尔善良 提交于 2019-12-02 09:39:49
import nltk from itertools import groupby def get_continuous_chunks(tagged_sent): continuous_chunk = [] current_chunk = [] for token, tag in tagged_sent: if tag != "O": current_chunk.append((token, tag)) else: if current_chunk: # if the current chunk is not empty continuous_chunk.append(current_chunk) current_chunk = [] # Flush the final current_chunk into the continuous_chunk, if any. if current_chunk: continuous_chunk.append(current_chunk) return continuous_chunk ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), (

Mute Stanford coreNLP logging

天大地大妈咪最大 提交于 2019-12-02 06:20:42
问题 First of all, Java is not my usual language, so I'm quite basic at it. I need to use it for this particular project, so please be patient, and if I have omitted any relevant information, please ask for it, I will be happy to provide it. I have been able to implement coreNLP, and, seemingly, have it working right, but is generating lots of messages like: ene 20, 2017 10:38:42 AM edu.stanford.nlp.process.PTBLexer next ADVERTENCIA: Untokenizable: 【 (U+3010, decimal: 12304) After some research