named-entity-recognition

How does spacy use word embeddings for Named Entity Recognition (NER)?

我只是一个虾纸丫 提交于 2020-01-11 18:54:27
问题 I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. However, nowhere in the code does it appear that

How does spacy use word embeddings for Named Entity Recognition (NER)?

99封情书 提交于 2020-01-11 18:52:27
问题 I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. However, nowhere in the code does it appear that

Entities on my gazette are not recognized

大兔子大兔子 提交于 2020-01-11 09:05:10
问题 I would like to create a custom NER model. That's what i did: TRAINING DATA (stanford-ner.tsv): Hello O ! O My O name O is O Damiano PERSON . O PROPERTIES (stanford-ner.prop): trainFile = stanford-ner.tsv serializeTo = ner-model.ser.gz map = word=0,answer=1 maxLeft=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true

Why is a self trained NER-Model incompatible with the version of OpenNLP?

*爱你&永不变心* 提交于 2020-01-03 05:48:28
问题 I trained OpenNLP NER-Model to detect a new Entity but when I am using this model I encountered the following Exception: Exception in thread "main" java.lang.IllegalArgumentException: opennlp.tools.util.InvalidFormatException: Model version 1.6.0 is not supported by this (1.5.3) version of OpenNLP! I am using OpenNLP version 1.6.0 and my source code is this: String [] sentences = Fragmentation.getSentences(Document); InputStream modelIn = new FileInputStream("Models/en-ner-cvskill.bin");

How I classify a word of a text in things like names, number, money, date,etc?

人走茶凉 提交于 2020-01-01 07:30:52
问题 I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do. The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites. What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money,

Methods for Geotagging or Geolabelling Text Content

℡╲_俬逩灬. 提交于 2019-12-30 00:38:07
问题 What are some good algorithms for automatically labeling text with the city / region or origin? That is, if a blog is about New York, how can I tell programatically. Are there packages / papers that claim to do this with any degree of certainty? I have looked at some tfidf based approaches, proper noun intersections, but so far, no spectacular successes, and I'd appreciate ideas! The more general question is about assigning texts to topics, given some list of topics. Simple / naive approaches

Stanford NER tagger generates 'file not found' exception with provided models

风流意气都作罢 提交于 2019-12-29 08:25:28
问题 I downloaded stanford NER 3.4.1, unpacked it, and tried to run named entity recognition on a local file using the default (provided) trained model. I got this: `java.io.FileNotFoundException: /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters (No such file or directory) at edu.stanford.nlp.io.IOUtils.inputStreamFromFile(IOUtils.java:481)` What's wrong and how can I fix it? 回答1: It turns out that the provided models use "distributional similarity features" that require a .clusters file at

Learnig NER using category list

狂风中的少年 提交于 2019-12-25 05:19:17
问题 In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category. Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category. In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning

Training Stanford-NER-CRF, control number of iterations and regularisation (L1,L2) parameters

China☆狼群 提交于 2019-12-24 23:15:53
问题 I was looking through StanfordNER documentation/FAQ but I can't find anything related to specifying the maximum number of iterations in training and also the value of the regularisation parameters L1 and L2. I saw an answer on which is suggested to set, for instance: maxIterations=10 in the properties file, but that did not gave any results. Is it possible to set these parameters? 回答1: I had to dig in the code but found it, so basically StanfordNER supports many different numerical

How to feed CoreNLP some pre-labeled Named Entities?

假装没事ソ 提交于 2019-12-24 20:14:19
问题 I want to use Standford CoreNLP to pull out Coreferences and start working on the Dependencies of pre-labeled text. I eventually hope to build graph nodes and edges between related Named Entities. I am working in python, but using nltk's java functions to call the "edu.stanford.nlp.pipeline.StanfordCoreNLP" jar directly (which is what nltk does behind the scenes anyway). My pre-labeled text is in this format: PRE-LABELED: During his youth, [PERSON: Alexander III of Macedon] was tutored by