ANTLR4 Tokenizing a Huge Set of Keywords

只愿长相守 提交于 2019-12-12 00:28:21

问题


I want to embed some known identifier names into my grammar e.g. the class names of my project are known and I want to tell the lexer what identifiers are known keywords that actually belongs to the class-name token. But since I have a long list of class names (hundreds of names), I don't want to create a class-name lexer rule by listing all the known class name keywords in the rule, that will make my grammar file too large.

Is it possible to place my keywords into a separate file? One possibility I am thinking about is to place the keywords in a java class that will be subclassed by the generated lexer class. In that case, my lexer's semantic predicate can just call a method in custom lexer superclass to verify if the input token matches my long list of names. And my long list can be placed inside that superclass src code.

However, in the ANTLR4 book it says grammar options 'superClass' for combined grammar only set the parser's superclass. How can I set my lexer's superclass if I still want to use combined grammar. Or is there any other better method to put my long list of keywords into a separate "keyword file".


回答1:


If you want each keyword to have its own token type, you can do the following:

  1. Add a tokens{} block to the grammar to create tokens for each keyword. This ensures unique token types are created for each of your keywords.

    tokens {
        Keyword1,
        Keyword2,
        ...
    }
    
  2. Create a separate class MyLanguageKeywords similar to the following:

    private static final Map<String, Integer> KEYWORDS =
        new HashMap<String, Integer>();
    static {
        KEYWORDS.put("keyword1", MyLanguageParser.Keyword1);
        KEYWORDS.put("keyword2", MyLanguageParser.Keyword2);
        ...
    }
    
    public static int getKeywordOrIdentifierType(String text) {
         Integer type = KEYWORDS.get(text);
         if (type == null) {
             return MyLanguageParser.Identifier;
         }
    
         return type;
    }
    
  3. Add an Identifier lexer rule to your grammar that handles keywords and identifiers.

    Identifier
        :   [a-zA-Z_] [a-zA-Z0-9_]*
            {_type = MyLanguageKeywords.getKeywordOrIdentifierType(getText());}
        ;
    


来源:https://stackoverflow.com/questions/16419707/antlr4-tokenizing-a-huge-set-of-keywords

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!