问题
spaCy tags up each of the Tokens in a Document with a part of speech (in two different formats, one stored in the pos and pos_ properties of the Token and the other stored in the tag and tag_ properties) and a syntactic dependency to its .head token (stored in the dep and dep_ properties).
Some of these tags are self-explanatory, even to somebody like me without a linguistics background:
>>> import spacy
>>> en_nlp = spacy.load('en')
>>> document = en_nlp("I shot a man in Reno just to watch him die.")
>>> document[1]
shot
>>> document[1].pos_
'VERB'
Others... are not:
>>> document[1].tag_
'VBD'
>>> document[2].pos_
'DET'
>>> document[3].dep_
'dobj'
Worse, the official docs don't contain even a list of the possible tags for most of these properties, nor the meanings of any of them. They sometimes mention what tokenization standard they use, but these claims aren't currently entirely accurate and on top of that the standards are tricky to track down.
What are the possible values of the tag_, pos_, and dep_ properties, and what do they mean?
回答1:
Part of speech tokens
The spaCy docs currently claim:
The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.
More precisely, the .tag_ property exposes Treebank tags, and the pos_ property exposes tags based upon the Google Universal POS Tags (although spaCy extends the list).
spaCy's docs seem to recommend that users who just want to dumbly use its results, rather than training their own models, should ignore the tag_ attribute and use only the pos_ one, stating that the tag_ attributes...
are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent.
That is to say, if spaCy releases an improved model trained on a new treebank, the tag_ attribute may have different values to that which it had before. This clearly makes it unhelpful for users who want a consistent API across version upgrades. However, since the current tags are a variant of Penn Treebank, they are likely to mostly intersect with the set described in any Penn Treebank POS tag documentation, like this: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
The more useful pos_ tags are
A coarse-grained, less detailed tag that represents the word-class of the token
based upon the Google Universal POS Tag set. For English, a list of the tags in the Universal POS Tag set can be found here, complete with links to their definitions: http://universaldependencies.org/en/pos/index.html
The list is as follows:
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary verb
- CONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other
However, we can see from spaCy's parts of speech module that it extends this schema with three additional POS constants, EOL, NO_TAG and SPACE, that are not part of the Universal POS Tag set. Of these:
- From searching the source code, I don't think
EOLgets used at all, although I'm not sure NO_TAGis an error code. If you try parsing a sentence with a model you don't have installed, allTokens get assigned this POS. For instance, I don't have spaCy's German model installed, and I see this on my local if I try to use it:>>> import spacy >>> de_nlp = spacy.load('de') >>> document = de_nlp('Ich habe meine Lederhosen verloren') >>> document[0] Ich >>> document[0].pos_ '' >>> document[0].pos 0 >>> document[0].pos == spacy.parts_of_speech.NO_TAG True >>> document[1].pos == spacy.parts_of_speech.NO_TAG True >>> document[2].pos == spacy.parts_of_speech.NO_TAG TrueSPACEis used for any spacing besides single normal ASCII spaces (which don't get their own token):>>> document = en_nlp("This\nsentence\thas some weird spaces in\n\n\n\n\t\t it.") >>> for token in document: ... print('%r (%s)' % (str(token), token.pos_)) ... 'This' (DET) '\n' (SPACE) 'sentence' (NOUN) '\t' (SPACE) 'has' (VERB) ' ' (SPACE) 'some' (DET) 'weird' (ADJ) 'spaces' (NOUN) 'in' (ADP) '\n\n\n\n\t\t ' (SPACE) 'it' (PRON) '.' (PUNCT)
Dependency tokens
As noted in the docs, the dependency tag scheme is based upon the ClearNLP project; the meanings of the tags (as of version 3.2.0 of ClearNLP, released in 2015, which remains the latest release and seems to be what spaCy uses) can be found at https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md. That document lists these tokens:
ACL: Clausal modifier of nounACOMP: Adjectival complementADVCL: Adverbial clause modifierADVMOD: Adverbial modifierAGENT: AgentAMOD: Adjectival modifierAPPOS: Appositional modifierATTR: AttributeAUX: AuxiliaryAUXPASS: Auxiliary (passive)CASE: Case markerCC: Coordinating conjunctionCCOMP: Clausal complementCOMPOUND: Compound modifierCONJ: ConjunctCSUBJ: Clausal subjectCSUBJPASS: Clausal subject (passive)DATIVE: DativeDEP: Unclassified dependentDET: DeterminerDOBJ: Direct ObjectEXPL: ExpletiveINTJ: InterjectionMARK: MarkerMETA: Meta modifierNEG: Negation modifierNOUNMOD: Modifier of nominalNPMOD: Noun phrase as adverbial modifierNSUBJ: Nominal subjectNSUBJPASS: Nominal subject (passive)NUMMOD: Number modifierOPRD: Object predicatePARATAXIS: ParataxisPCOMP: Complement of prepositionPOBJ: Object of prepositionPOSS: Possession modifierPRECONJ: Pre-correlative conjunctionPREDET: Pre-determinerPREP: Prepositional modifierPRT: ParticlePUNCT: PunctuationQUANTMOD: Modifier of quantifierRELCL: Relative clause modifierROOT: RootXCOMP: Open clausal complement
The linked ClearNLP documentation also contains brief descriptions of what each of the terms above means.
In addition to the above documentation, if you'd like to see some examples of these dependencies in real sentences, you may be interested in the 2012 work of Jinho D. Choi: either his Optimization of Natural Language Processing Components for Robustness and Scalability or his Guidelines for the CLEAR Style Constituent to Dependency Conversion (which seems to just be a subsection of the former paper). Both list all the CLEAR dependency labels that existed in 2012 along with definitions and example sentences. (Unfortunately, the set of CLEAR dependency labels has changed a little since 2012, so some of the modern labels are not listed or exemplified in Choi's work - but it remains a useful resource despite being slightly outdated.)
回答2:
Just a quick tip about getting the detail meaning of the short forms. You can use explain method like following:
spacy.explain('pobj')
which will give you output like:
'object of preposition'
回答3:
The official documentation now provides much more details for all those annotations at https://spacy.io/api/annotation (and the list of other attributes for tokens can be found at https://spacy.io/api/token).
As the documentation shows, their parts-of-speech (POS) and dependency tags have both Universal and specific variations for different languages and the explain() function is a very useful shortcut to get a better description of a tag's meaning without the documentation, e.g.
spacy.explain("VBD")
which gives "verb, past tense".
回答4:
At present, dependency parsing and tagging in SpaCy appears to be implemented only at the word level, and not at the phrase (other than noun phrase) or clause level. This means SpaCy can be used to identify things like nouns (NN, NNS), adjectives (JJ, JJR, JJS), and verbs (VB, VBD, VBG, etc.), but not adjective phrases (ADJP), adverbial phrases (ADVP), or questions (SBARQ, SQ).
For illustration, when you use SpaCy to parse the sentence "Which way is the bus going?", we get the following tree.
By contrast, if you use the Stanford parser you get a much more deeply structured syntax tree.
来源:https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean