named-entity-extraction

How to store ner result in json/ database

偶尔善良 提交于 2019-12-02 09:39:49
import nltk from itertools import groupby def get_continuous_chunks(tagged_sent): continuous_chunk = [] current_chunk = [] for token, tag in tagged_sent: if tag != "O": current_chunk.append((token, tag)) else: if current_chunk: # if the current chunk is not empty continuous_chunk.append(current_chunk) current_chunk = [] # Flush the final current_chunk into the continuous_chunk, if any. if current_chunk: continuous_chunk.append(current_chunk) return continuous_chunk ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), (

Name Extraction - CV/Resume - Stanford NER/OpenNLP

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-01 14:51:53
I'm currently on a learning project to extract an individuals name from their CV/Resume. Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody). My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating something akin to a CV corpus? My initial thoughts are that I'd probably have a more success by

Name Extraction - CV/Resume - Stanford NER/OpenNLP

假如想象 提交于 2019-12-01 13:31:59
问题 I'm currently on a learning project to extract an individuals name from their CV/Resume. Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody). My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating

Fast algorithm to extract thousands of simple patterns out of large amounts of text

帅比萌擦擦* 提交于 2019-12-01 06:26:52
I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like: \bBarack\s(Hussein\s)?Obama\b \b(John|J\.)\sBoehner\b etc. My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better? You can use (f)lex to generate a DFA, which recognises all the literals in parallel . This might get tricky if there are too many wildcards

Methods for extracting locations from text?

≡放荡痞女 提交于 2019-11-29 02:57:36
问题 What are the recommended methods for extracting locations from free text? What I can think of is to use regex rules like "words ... in location". But are there better approaches than this? Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table. Does anybody know of better approaches? Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might

How to use DBPedia to extract Tags/Keywords from content?

巧了我就是萌 提交于 2019-11-28 15:09:57
问题 I am exploring how I can use Wikipedia's taxonomy information to extract Tags/Keywords from my content. I found articles about DBPedia. DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. Has anyone used their web services? Do you know how they work and how reliable it is? 回答1: DBpedia is a fantastic, high quality resource. In order to turn your content into a set of relevant DBpedia concepts, however, you will need