I currently have a working, simple language implemented in Java using ANTLR. What I want to do is embed it in plain text, in a similar fashion to PHP.
For example:
Lorem ipsum dolor sit amet
<% print('consectetur adipiscing elit'); %>
Phasellus volutpat dignissim sapien.
I anticipate that the resulting token stream would look something like:
CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA
How can I achieve this, or is there a better way?
There is no restriction on what might be outside the <% block. I assumed something like <% print('%>'); %>, as per Michael Mrozek's answer, would be possible, but outside of a situation like that, <% would always indicate the start of a code block.
Sample Implementation
I developed a solution based on ideas given in Michael Mrozek's answer, simulating Flex's start conditions using ANTLR's gated semantic predicates:
lexer grammar Lexer;
@members {
    boolean codeMode = false;
}
OPEN    : {!codeMode}?=> '<%' { codeMode = true; } ;
CLOSE   : {codeMode}?=> '%>' { codeMode = false;} ;
LPAREN  : {codeMode}?=> '(';
//etc.
CHAR    : {!codeMode}?=> ~('<%');
parser grammar Parser;
options {
    tokenVocab = Lexer;
    output = AST;
}
tokens {
    VERBATIM;
}
program :
    (code | verbatim)+
    ;   
code :
    OPEN statement+ CLOSE -> statement+
    ;
verbatim :
    CHAR -> ^(VERBATIM CHAR)
    ;
The actual concept looks fine, although it's unlikely you'd have a PRINT token; the lexer would probably emit something like IDENTIFIER, and the parser would be responsible for figuring out that it's a function call (e.g. by looking for IDENTIFIER OPAREN ... CPAREN) and doing the appropriate thing.
As for how to do it, I don't know anything about ANTLR, but it probably has something like flex's start conditions. If so, you can have the INITIAL start condition do nothing but look for <%, which would switch to the CODE state where all the actual tokens are defined; then '%>' would switch back. In flex it would be:
%s CODE
%%
<INITIAL>{
    "<%"      {BEGIN(CODE);}
    .         {}
}
 /* All these are implicitly in CODE because it was declared %s,
    but you could wrap it in <CODE>{} too
  */
"%>"          {BEGIN(INITIAL);}
"("           {return OPAREN;}
"'"           {return APOS;}
...
You need to be careful about things like matching %> in a context where it's not a closing marker, like within a string; it's up to you if you want to allow <% print('%>'); %>, but most likely you do
but outside of a situation like that, <% would always indicate the start of a code block.
In that case, first scan the file for your embedded code, and once you have those, parse your embedded code with a dedicated parser (without the noise before the <% and after the %> tags).
ANTLR has the option to let the lexer parse just a (small) part of an input file and ignore the rest. Note that you cannot create a "combined grammar" (parser and lexer in one) in that case. Here's how you can create such a "partial lexer":
// file EmbeddedCodeLexer.g
lexer grammar EmbeddedCodeLexer;
options{filter=true;} // <- enables the partial lexing!
EmbeddedCode
  :  '<%'                            // match an open tag
     (  String                       // ( match a string literal
     |  ~('%' | '\'')                //   OR match any char except `%` and `'`
     |  {input.LT(2) != '>'}?=> '%'  //   OR only match a `%` if `>` is not ahead of it
     )*                              // ) <- zero or more times
     '%>'                            // match a close tag
  ;
fragment
String
  :  '\'' ('\\' . | ~('\'' | '\\'))* '\''
  ;
If you now create a lexer from it:
java -cp antlr-3.2.jar org.antlr.Tool EmbeddedCodeLexer.g 
and create a little test harness:
import org.antlr.runtime.*;
public class Main {
    public static void main(String[] args) throws Exception {
        String source = "Lorem ipsum dolor sit amet       \n"+
                "<%                                       \n"+
                "a = 2 > 1 && 10 % 3;                     \n"+
                "print('consectetur %> adipiscing elit'); \n"+
                "%>                                       \n"+
                "Phasellus volutpat dignissim sapien.     \n"+
                "foo <% more code! %> bar                 \n";
        ANTLRStringStream in = new ANTLRStringStream(source);
        EmbeddedCodeLexer lexer = new EmbeddedCodeLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object o : tokens.getTokens()) {
            System.out.println("=======================================\n"+
                    "EmbeddedCode = "+((Token)o).getText());
        }
    }
}
compile it all:
javac -cp antlr-3.2.jar *.java
and finally run the Main class by doing:
// *nix/MacOS
java -cp .:antlr-3.2.jar Main
// Windows
java -cp .;antlr-3.2.jar Main 
it will produce the following output:
=======================================
EmbeddedCode = <%                                       
a = 2 > 1 && 10 % 3;                     
print('consectetur %> adipiscing elit'); 
%>
=======================================
EmbeddedCode = <% more code! %>
来源:https://stackoverflow.com/questions/2798545/how-do-i-lex-this-input