How to find invalid Link Grammar tokens?

爱⌒轻易说出口 提交于 2019-12-13 02:25:42

问题


I'd like to use the Link Grammar Python3 bindings for a simple grammar checker. While the linkage API is relatively well-documented, there doesn't seem to be way to access all tokens that prevent linkages.

This is what I have so far:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from linkgrammar import Sentence, ParseOptions, Dictionary, __version__
print('Link Grammar Version:', __version__)

for sentence in ['This is a valid sample sentence.', 'I Can Has Cheezburger?']:
    sent = Sentence(sentence, Dictionary(), ParseOptions())
    linkages = sent.parse()
    if len(linkages) > 0:
        print('Valid:', sentence)
    else:
        print('Invalid:', sentence)

(I used link-grammar-5.4.3 for my tests.)

When I analyzed the invalid sample sentence using the Link Parser command line tool, I got the following output:

linkparser> I Can Has Cheezburger?
No complete linkages found.
Found 1 linkage (1 had no P.P. violations) at null count 1
    Unique linkage, cost vector = (UNUSED=1 DIS= 0.10 LEN=7)

    +------------------Xp------------------+
    +------------->Wa--------------+       |
    |            +---G--+-----G----+       |
    |            |      |          |       |
LEFT-WALL [I] Can[!] Has[!] Cheezburger[!] ?

How do I get all potentially invalid tokens marked with [!] or [?] with Python3?


回答1:


See how it is done in bindings/python-examples/sentence-check.py. It is better to look at the latest repo version (the current one is here), as there was a bug in this demo program at 5.4.3.

Specifically, the following extracts the word list:

words = list(linkage.words())

Unlinked words are wrapped within []. Words which have [] appended to them are guessed ones. For example, [!] means that the word has been classified by a regex (that appears in the file 4.0.regex) and this classification has then been looked up in the dictionary. If you set the parse-option display_morphology to True, the classifying regex name appears after the !.

Here is the full legend of the word output format:

 [word]            Null-linked word
 word[!]           word classified by a regex
 word[!REGEX_NAME] word classified by REGEX_NAME (turn on by morphology=1)
 word[~]           word generated by a spell guess (unknown original word)
 word[&]           word run-on separated by a spell guess
 word[?]           word is unknown (looked up in the dict as UNKNOWN-WORD)
 word.POS          word found in the dictionary as word.POS
 word.#CORRECTION  word is probably a typo - got linked as CORRECTION

For dictionaries that support morphology (turn on by morphology=1):
 word=             A prefix morpheme
 =word             A suffix morpheme
 word.=            A stem

It may be useful to match the output words to the original sentence words, especially in case of spell corrections or when morphology is turned on. The said demo program sentence-check.py does that when you call it with -p - see the code under if arg.position:.

In the case of your demo sentence I Can Has Cheezburger?, only the word I has no linkage, and the other words have been classified as capitalized-words and got linked as proper nouns (the G link type).

You can find more information on the link types in summarize-links.



来源:https://stackoverflow.com/questions/49335828/how-to-find-invalid-link-grammar-tokens

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!