Is it possible to find uncertainties of spaCy POS tags?

眉间皱痕 提交于 2021-01-05 09:01:16

问题


I am trying to build a non-English spell checker that relies on classification of sentences by spaCy, which allows my algorithm to then use the POS tags and the grammatical dependencies of the individual tokens to determine incorrect spelling (in my case more specifically: incorrect splits in Dutch compound words).

However, spaCy appears to classify sentences incorrectly if they contain grammatical errors, for example classifying a noun as a verb, even though the classified word doesn't even look like a verb.

Because of this I'm wondering if it is possible to obtain the uncertainties of spaCy's classification, to make it possible to tell if spaCy is struggling with a sentence. After all, if spaCy is struggling with a classification, that would provide my spell checker with more confidence that the sentence contains errors.

Is there any way to know whether spaCy thinks a sentence is grammatically correct (without having to specify patterns of all correct sentence structures in my language), or to obtain classification certainties?


Edit, based on suggestions in the comments by @Sergey Bushmanov:

I found https://spacy.io/api/tagger#predict, which might be useful to get the probabilities for the tags. However, I'm not really sure what I am looking at, and I'm not really following what the docs mean about the output. I'm using the following code:

import spacy

nlp = spacy.load('en_core_web_sm')
text = "This is an example sentence for the Spacy tagger."
doc = nlp(text)

docs = nlp(text, disable=['tagger'])
scores, tensors = nlp.tagger.predict([docs])

print(scores)
probs = tensors[0]
for p in probs:
    print(p, max(p), p.tolist().index(max(p)))

This prints what I am guessing is some integer representations of the predictions (considering that 'integer' and 'representation' get the same scores), and then an array of 96 floats for every word in the sentence. It also lists the highest score and the position of that highest score, but it seems like for most words, there are multiple items in the p array that get a similar value. Now I'm wondering what these arrays mean, and how to extract probabilities for each classification from it.


The question is: How can I interpret this output to get the specific probabilities for specific tags found by spaCy's tagger? Or another way to put this same question is: What does the output generated by the above code mean?

来源:https://stackoverflow.com/questions/65218606/is-it-possible-to-find-uncertainties-of-spacy-pos-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!