grouping all Named entities in a Document

问题

I would like to group all named entities in a given document. For Example,

**Barack Hussein Obama** II  is the 44th and current President of the United States, and the first African American to hold the office.

I do not want to use OpenNLP APIs as it might not be able to recognize all named entities. Is there any way to generate such n-grams using other services or may be a way to group all noun terms together.

回答1:

If you want to avoid using NER, you could use a sentence chunker or parser. This will extract noun phrases generically. OpenNLP has a sentence chunker and parser, but if you are for some reason adverse to using OpenNLP, you can try others. If you are interested in using the OpenNLP chunker i will post some code that extracts noun phrases using OpenNLP.

Here is the code. You will need to download the models from sourceforge here

http://opennlp.sourceforge.net/models-1.5/

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

/**
 *
 * Extracts noun phrases from a sentence. To create sentences using OpenNLP use
 * the SentenceDetector classes.
 */
public class OpenNLPNounPhraseExtractor {

  static final int N = 2;

  public static void main(String[] args) {

    try {
      String modelPath = "c:\\temp\\opennlpmodels\\";
      TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
      TokenizerME wordBreaker = new TokenizerME(tm);
      POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
      POSTaggerME posme = new POSTaggerME(pm);
      InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
      ChunkerModel chunkerModel = new ChunkerModel(modelIn);
      ChunkerME chunkerME = new ChunkerME(chunkerModel);
      //this is your sentence
      String sentence = "Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
      //words is the tokenized sentence
      String[] words = wordBreaker.tokenize(sentence);
      //posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
      String[] posTags = posme.tag(words);
      //chunks are the start end "spans" indices to the chunks in the words array
      Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
      //chunkStrings are the actual chunks
      String[] chunkStrings = Span.spansToStrings(chunks, words);
      for (int i = 0; i < chunks.length; i++) {
        if (chunks[i].getType().equals("NP")) {
          System.out.println("NP: \n\t" + chunkStrings[i]);
          String[] split = chunkStrings[i].split(" ");

          List<String> ngrams = ngram(Arrays.asList(split), N, " ");
          System.out.println("ngrams:");
          for (String gram : ngrams) {
            System.out.println("\t" + gram);
          }

        }
      }


    } catch (IOException e) {
    }
  }

  public static List<String> ngram(List<String> input, int n, String separator) {
    if (input.size() <= n) {
      return input;
    }
    List<String> outGrams = new ArrayList<String>();
    for (int i = 0; i < input.size() - (n - 2); i++) {
      String gram = "";
      if ((i + n) <= input.size()) {
        for (int x = i; x < (n + i); x++) {
          gram += input.get(x) + separator;
        }
        gram = gram.substring(0, gram.lastIndexOf(separator));
        outGrams.add(gram);
      }
    }
    return outGrams;
  }
}

the output I get with your sentence is this (with N set to 2 (bigram)

NP: 
    Barack Hussein Obama II
ngrams:
    Barack Hussein
    Hussein Obama
    Obama II
NP: 
    the 44th and current President
ngrams:
    the 44th
    44th and
    and current
    current President
NP: 
    the United States
ngrams:
    the United
    United States
NP: 
    the first African American
ngrams:
    the first
    first African
    African American
NP: 
    the office
ngrams:
    the
    office

this does not explicitly handle the case of when an adjective falls outside of the NP... if so you can get this info from the POS tags and integrate it. What I gave you should send you in the right direction.

来源：https://stackoverflow.com/questions/21552647/grouping-all-named-entities-in-a-document

标签

n-gram

named-entity-recognition

part-of-speech