问题
I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.
Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-
How to install / set up a nlp system which can-
- parse a English sentence words
- identify the different parts of speech
回答1:
You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.
If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).
The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.
If you do decide you need parse structure, a few recommendations:
I think the Stanford parser suggested by @aab includes both a constituent parser and a dependency parser.
The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).
The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.
Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.
And a couple Wikipedia references for background if needed:
Context-free grammars: http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
Dependency grammars: http://en.wikipedia.org/wiki/Dependency_grammar
回答2:
Generally you'd do these two tasks in the other order:
- Do part-of-speech tagging
- Run a parser using the POS tags as input
OpenNLP's documentation isn't that thorough and some of it's gotten hard to find due to the switch to apache. Some (potentially slightly out-of-date) tutorials are available in the old SF wiki.
You might want to take a look at the Stanford NLP tools, in particular the Stanford POS Tagger and the Stanford Parser. Both have downloads that include pre-trained model files and they also have demo files in the top-level directory that show how to get started with the API and short shell scripts that show how to use the tools from the command-line.
LingPipe might be another good toolkit to check out. A quick search here will lead you to a number of similar questions with links to other alternatives, too!
回答3:
See Illinois-Curator: http://cogcomp.cs.illinois.edu/page/software_view/Curator
Demo: http://cogcomp.cs.illinois.edu/curator/demo/
It gives you almost everything at one place.
回答4:
The most popular are:
- GATE: easy to use and fairly quick to start with
- UIMA: slow learning curve but more efficient and more generic
来源:https://stackoverflow.com/questions/5833030/simple-natural-language-processing-startup-for-java