named-entity-extraction

How to store ner result in json/ database

阅读更多关于 How to store ner result in json/ database

import nltk from itertools import groupby def get_continuous_chunks(tagged_sent): continuous_chunk = [] current_chunk = [] for token, tag in tagged_sent: if tag != "O": current_chunk.append((token, tag)) else: if current_chunk: # if the current chunk is not empty continuous_chunk.append(current_chunk) current_chunk = [] # Flush the final current_chunk into the continuous_chunk, if any. if current_chunk: continuous_chunk.append(current_chunk) return continuous_chunk ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), (

Name Extraction - CV/Resume - Stanford NER/OpenNLP

阅读更多关于 Name Extraction - CV/Resume - Stanford NER/OpenNLP

I'm currently on a learning project to extract an individuals name from their CV/Resume. Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody). My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating something akin to a CV corpus? My initial thoughts are that I'd probably have a more success by

Name Extraction - CV/Resume - Stanford NER/OpenNLP

阅读更多关于 Name Extraction - CV/Resume - Stanford NER/OpenNLP

问题 I'm currently on a learning project to extract an individuals name from their CV/Resume. Currently I'm working with Stanford-NER and OpenNLP which both perform with a degree of success out of the box on, tending to struggle on "non-western" type names (no offence intended towards anybody). My question is - given the general lack of sentence structure or context in relation to an individuals name in a CV/Resume, am I likely to gain any significant improvement in name identification by creating

Fast algorithm to extract thousands of simple patterns out of large amounts of text

阅读更多关于 Fast algorithm to extract thousands of simple patterns out of large amounts of text

I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like: \bBarack\s(Hussein\s)?Obama\b \b(John|J\.)\sBoehner\b etc. My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better? You can use (f)lex to generate a DFA, which recognises all the literals in parallel . This might get tricky if there are too many wildcards

Methods for extracting locations from text?

阅读更多关于 Methods for extracting locations from text?

问题 What are the recommended methods for extracting locations from free text? What I can think of is to use regex rules like "words ... in location". But are there better approaches than this? Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table. Does anybody know of better approaches? Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might

How to use DBPedia to extract Tags/Keywords from content?

阅读更多关于 How to use DBPedia to extract Tags/Keywords from content?

问题 I am exploring how I can use Wikipedia's taxonomy information to extract Tags/Keywords from my content. I found articles about DBPedia. DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. Has anyone used their web services? Do you know how they work and how reliable it is? 回答1: DBpedia is a fantastic, high quality resource. In order to turn your content into a set of relevant DBpedia concepts, however, you will need