lexer | 易学教程

ANTLR: Unicode Character Scanning

阅读更多关于 ANTLR: Unicode Character Scanning

问题 Problem: Can't get Unicode character to print correctly. Here is my grammar: options { k=1; filter=true; // Allow any char but \uFFFF (16 bit -1) charVocabulary='\u0000'..'\uFFFE'; } ANYCHAR :'$' | '_' { System.out.println("Found underscore: "+getText()); } | 'a'..'z' { System.out.println("Found alpha: "+getText()); } | '\u0080'..'\ufffe' { System.out.println("Found unicode: "+getText()); } ; Code snippet of main method invoking the lexer: public static void main(String[] args) { SimpleLexer

Different lexer rules in different state

阅读更多关于 Different lexer rules in different state

问题 I've been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here: ${abc} <html> <head> <title>Welcome!</title> </head> <body> <h1> Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>! </h1> <p>Our latest product: <a href="${latestProduct}">${latestProduct}</a>! </body> </html> The template language is between some specific tags, e.g. '${' '}', '<#' '>'. Other raw texts in between can be treated like as the same tokens (RAW). The key

Where can I learn the basics of writing a lexer?

阅读更多关于 Where can I learn the basics of writing a lexer?

问题 I want to learn how to write a lexer. My university course had an assignment where we had to write a parser (and a lexer to go along with it) but this was given to us with no instruction or feedback (beyond the mark) so I didn't really learn much from it. After searching for this topic, I can only find fairly advanced write ups which focus on areas which I feel are a few steps ahead of where I am at. I want a discussion on the basics of writing a lexer for a very simple language which I can

error[E0507]: Cannot move out of borrowed content

阅读更多关于 error[E0507]: Cannot move out of borrowed content

问题 I'm trying to make a lexer in Rust while being relatively new to it but with a background in C/C++. I'm having problems with how Rust allocates memory in the following code, which generates the error "Cannot move out of borrowed content". I've read cargo --explain E0507 which details possible solutions, but I'm struggling to grasp the underlying differences between how Rust and C/C++ manage memory. In essence, I want to understand how to manage dynamic memory in Rust (or a better way to

Writing a custom Xtext/ANTLR lexer without a grammar file

阅读更多关于 Writing a custom Xtext/ANTLR lexer without a grammar file

I'm writing an Eclipse/Xtext plugin for CoffeeScript, and I realized I'll probably need to write a lexer for it by hand. CoffeeScript parser also uses a hand-written lexer to handle indentation and other tricks in the grammar. Xtext generates a class that extends org.eclipse.xtext.parser.antlr.Lexer which in turn extends org.antlr.runtime.Lexer . So I suppose I'll have extend it. I can see two ways to do that Override mTokens() . This is done by the generated code, changing the internal state. Override nextToken() which seems a natural approach, but then I'll have to keep track of the internal

Parsing optional semicolon at statement end

阅读更多关于 Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars. First, it could now parse code like: a = 1; b = 2; Now I want to make the semicolon at the end of line optional. The original YACC rule was: stmt: expr ';' { ... } Where the new line is processed by the lexer that written by myself(the code are simplified): rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS } the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces. Because of this, I can't just simply change my YACC rule into:

Is there a working C++ grammar file for ANTLR?

阅读更多关于 Is there a working C++ grammar file for ANTLR?

Are there any existing C++ grammar files for ANTLR? I'm looking to lex, not parse some C++ source code files. I've looked on the ANTLR grammar page and it looks like there is one listed created by Sun Microsystems here . However, it seems to be a generated Parser. Can anyone point me to a C++ ANTLR lexer or grammar file? C++ parsers are tough to build. I can't speak with experience about using ANTLR's C++ grammars. Here I discuss what I learned by reading the notes attached to the the one I did see at the ANTLR site; in essence, the author produced an incomplete grammar. And that was for just

Where should I draw the line between lexer and parser?

阅读更多关于 Where should I draw the line between lexer and parser?

I'm writing a lexer for the IMAP protocol for educational purposes and I'm stumped as to where I should draw the line between lexer and parser. Take this example of an IMAP server response: * FLAGS (\Answered \Deleted) This response is defined in the formal syntax like this: mailbox-data = "FLAGS" SP flag-list flag-list = "(" [flag *(SP flag)] ")" flag = "\Answered" / "\Deleted" Since they are specified as string literals (aka "terminal" tokens) would it be more correct for the lexer to emit a unique token for each, like: (TknAnsweredFlag) (TknSpace) (TknDeletedFlag) Or would it be just as

Examples / tutorials for usage of javax.lang.model or ANTLR JavaParser to get information on Java Source Code

阅读更多关于 Examples / tutorials for usage of javax.lang.model or ANTLR JavaParser to get information on Java Source Code

I would like to create an automatic Flowchart-like visualization to simple Java Logic, for this I need to parse Java Source code, I have 2 candidates, ANTLR and javax.lang.model of Java 6. Neither are easy. I have yet to find a single working example that will be even remotely close to what I want to achieve. I want to find simple variable declarations, assignments, and flows (if, for, switch, boolean conditions etc) Is there a simple example or tutorial for either of these? I found very few ANTLR examples (non of them are working out of the box without significant "homework") and absolutely

How to make lex/flex recognize tokens not separated by whitespace?

阅读更多关于 How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if . Simultaneously, the lexer must also exit(1) when it encounters invalid input. A simplified version of the code I have: %{ #include <stdio.h> %} %option main warn debug %% if | then | else printf("keyword: %s\n", yytext); [[:digit:]]+