ANTLR on a noisy data stream

对着背影说爱祢 提交于 2019-12-20 03:35:13

问题


I'm very new in the ANTLR world and I'm trying to figure out how can I use this parsing tool to interpret a set of "noisy" string. What I would like to achieve is the following.

let's take for example this phrase : It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV

What I would like to extract is CAT, SLEEPING and SOFA and have a grammar that match easily the following pattern : SUBJECT - VERB - INDIRECT OBJECT... where I could define

VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';

etc.. I don't want to ends up with a permanent "NoViableException" as I can't describe all the possibilities around the language structure. I just want to tear apart useless words and just keep the one that are interesting.

It's more like if I had a tokeniser and asked the parser "Ok, read the stream until you find a SUBJECT, then ignore the rest until you find a VERB, etc.."

I need to extract an organized structure in an un-organized set... For example, I would like to be able to interpret (I'm not judging the pertinence of this utterly basic and incorrect view of 'english grammar')
SUBJECT - VERB - INDIRECT OBJECT
INDIRECT OBJECT - SUBJECT - VERB

so I will parse sentences like

It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV

or

It's 10PM and, on the SOFA in front of the TV, the Lazy CAT is currently SLEEPING heavily


回答1:


You could create only a couple of lexer rules (the ones you posted, for example), and as a last lexer rule, you could match any character and skip() it:

VERB            : 'SLEEPING' | 'WALKING';
SUBJECT         : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY             : . {skip();};

The order is important here: the lexer tries to match tokens from top to bottom, so if it can't match any of the tokens VERB, SUBJECT or INDIRECT_OBJECT, it "falls through" to the ANY rule and skips this token. You can then use these parser rules to filter your input stream:

parse
  :  sentenceParts+ EOF
  ;

sentenceParts
  :  SUBJECT VERB INDIRECT_OBJECT
  ;  

which will parse the input text:

It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. The DOG is WALKING on the SOFA.

as follows:



来源:https://stackoverflow.com/questions/4310699/antlr-on-a-noisy-data-stream

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!