named-entity-recognition

How does spacy use word embeddings for Named Entity Recognition (NER)?

阅读更多关于 How does spacy use word embeddings for Named Entity Recognition (NER)?

问题 I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. However, nowhere in the code does it appear that

How does spacy use word embeddings for Named Entity Recognition (NER)?

阅读更多关于 How does spacy use word embeddings for Named Entity Recognition (NER)?

Entities on my gazette are not recognized

阅读更多关于 Entities on my gazette are not recognized

问题 I would like to create a custom NER model. That's what i did: TRAINING DATA (stanford-ner.tsv): Hello O ! O My O name O is O Damiano PERSON . O PROPERTIES (stanford-ner.prop): trainFile = stanford-ner.tsv serializeTo = ner-model.ser.gz map = word=0,answer=1 maxLeft=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true

Why is a self trained NER-Model incompatible with the version of OpenNLP?

阅读更多关于 Why is a self trained NER-Model incompatible with the version of OpenNLP?

问题 I trained OpenNLP NER-Model to detect a new Entity but when I am using this model I encountered the following Exception: Exception in thread "main" java.lang.IllegalArgumentException: opennlp.tools.util.InvalidFormatException: Model version 1.6.0 is not supported by this (1.5.3) version of OpenNLP! I am using OpenNLP version 1.6.0 and my source code is this: String [] sentences = Fragmentation.getSentences(Document); InputStream modelIn = new FileInputStream("Models/en-ner-cvskill.bin");

How I classify a word of a text in things like names, number, money, date,etc?

阅读更多关于 How I classify a word of a text in things like names, number, money, date,etc?

问题 I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do. The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites. What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money,

Methods for Geotagging or Geolabelling Text Content

阅读更多关于 Methods for Geotagging or Geolabelling Text Content

问题 What are some good algorithms for automatically labeling text with the city / region or origin? That is, if a blog is about New York, how can I tell programatically. Are there packages / papers that claim to do this with any degree of certainty? I have looked at some tfidf based approaches, proper noun intersections, but so far, no spectacular successes, and I'd appreciate ideas! The more general question is about assigning texts to topics, given some list of topics. Simple / naive approaches

Stanford NER tagger generates 'file not found' exception with provided models

阅读更多关于 Stanford NER tagger generates 'file not found' exception with provided models

问题 I downloaded stanford NER 3.4.1, unpacked it, and tried to run named entity recognition on a local file using the default (provided) trained model. I got this: `java.io.FileNotFoundException: /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters (No such file or directory) at edu.stanford.nlp.io.IOUtils.inputStreamFromFile(IOUtils.java:481)` What's wrong and how can I fix it? 回答1: It turns out that the provided models use "distributional similarity features" that require a .clusters file at

Learnig NER using category list

阅读更多关于 Learnig NER using category list

问题 In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category. Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category. In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning

Training Stanford-NER-CRF, control number of iterations and regularisation (L1,L2) parameters

阅读更多关于 Training Stanford-NER-CRF, control number of iterations and regularisation (L1,L2) parameters

问题 I was looking through StanfordNER documentation/FAQ but I can't find anything related to specifying the maximum number of iterations in training and also the value of the regularisation parameters L1 and L2. I saw an answer on which is suggested to set, for instance: maxIterations=10 in the properties file, but that did not gave any results. Is it possible to set these parameters? 回答1: I had to dig in the code but found it, so basically StanfordNER supports many different numerical

How to feed CoreNLP some pre-labeled Named Entities?

阅读更多关于 How to feed CoreNLP some pre-labeled Named Entities?

问题 I want to use Standford CoreNLP to pull out Coreferences and start working on the Dependencies of pre-labeled text. I eventually hope to build graph nodes and edges between related Named Entities. I am working in python, but using nltk's java functions to call the "edu.stanford.nlp.pipeline.StanfordCoreNLP" jar directly (which is what nltk does behind the scenes anyway). My pre-labeled text is in this format: PRE-LABELED: During his youth, [PERSON: Alexander III of Macedon] was tutored by