How I classify a word of a text in things like names, number, money, date,etc?

问题

I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.

The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.

What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.

What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don't know what use or what do.

WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

Somebody can help me? I'm confused with various APIs, classifiers and algorithms.

回答1:

You should try Apache OpenNLP. It is easy to use and customize.

If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:

Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.

Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file amazonia.ad to the apache-opennlp-1.5.1-incubating folder.

Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:

bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt

Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):
```
bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
```

Executing it from command line (You should execute only one sentence and the tokens should be separated):

$ bin/opennlp TokenNameFinder pt-ner.bin 
Loading Token Name Finder model ... done (1,112s)
Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .

Executing it using the API:

InputStream modelIn = new FileInputStream("pt-ner.bin");

try {
  TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
       modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the name finder
NameFinderME nameFinder = new NameFinderME(model);

// pass the token array to the name finder
String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};

// the Span objects will show the start and end of each name, also the type
Span[] nameSpans = nameFinder.find(toks);

To evaluate your model you can use 10-fold cross validation: (only available in 1.5.2-INCUBATOR, to use it today you need to use the SVN trunk) (it can take several hours)
```
bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
```
Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.

回答2:

You can use a Named Entity Recognizer (NER) approach for this task, I would highly recommend you to take a look at Stanford Core NLP page and use the ner functionality in the modules for your task. You can break up your sentences into tokens and then pass them to the Stanford NER system. I think the Stanford Core NLP page has lot of examples that can help you otherwise, please let me know if you need a toy code.

Here goes the sample code this is just the snippet of the whole code:

// creates a StanfordCoreNLP object, with NER
    Properties props = new Properties();
    props.put("annotators", "ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
     Annotation document = new Annotation(word);
     pipeline.annotate(document);
     System.out.println(Annotation);
}

回答3:

This problem falls at the intersection of several ideas from different areas. You mention named entity recognition, that is one. However, you are probably looking at a mixture of part of speech tagging (for nouns, names and the like) and information extraction (for numbers, phone numbers, emails).

Unfortunately doing this and making it work on real work data will require some effort, and it is not as simple as use this or that API.

回答4:

You have to create specific functions for extracting and detecting each data type and their errors.

Or as its well known name object orientated way. I.e. for detecting currency what we do is checking for a dollar sign at the beginning or end and check if there are attached non-numeric characters which means error.

You should write what you already do with your mind. It's not that hard if you follow the rules. There are 3 golden rules in Robotics/AI: