Parsing latex-like language in Java

我是研究僧i 提交于 2019-12-18 18:28:06

问题


I'm trying to write a parser in Java for a simple language similar to Latex, i.e. it contains lots of unstructured text with a couple of \commands[with]{some}{parameters} in between. Escape sequences like \\ also have to be taken into account.

I've tried to generate a parser for that with JavaCC, but it looks as if compiler-compilers like JavaCC were only suitable for highly structured code (typical for general-purpose programming languages), not for messy Latex-like markup. So far, it seems I have to go low level and write my own finite state machine.

So my question is, what's the easiest way to parse input that is mostly unstructured, with only a few Latex-like commands in between?

EDIT: Going low level with a finite state machine is difficult because the Latex commands can be nested, e.g. \cmd1{\cmd2{\cmd3{...}}}


回答1:


You can define a grammar to accept the Latex input, using just characters as tokens in the worst cast. JavaCC should be just fine for this purpose.

The good thing about a grammar and a parser generator is that it can parse things that FSAs have trouble with, especially nested structures.

A first cut at your grammar could be (I'm not sure this is valid JavaCC, but it is reasonable EBNF):

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;


来源:https://stackoverflow.com/questions/3495019/parsing-latex-like-language-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!