Output of Lexer

后端 未结 3 1212
隐瞒了意图╮
隐瞒了意图╮ 2021-01-15 11:30

I am currently writing a compiler and I\'m in the Lexer phase.

I know that the lexer tokenizes the input stream.

However, consider the following stream:

3条回答
  •  情歌与酒
    2021-01-15 11:54

    In general, your lexer should produce a stream of structs that contain language elements: operators, identifiers, keywords, comments, etc. These structs should be marked with type of the lexeme, and carry content relevant to the type of lexeme it represents.

    To enable good error reporting, it is good if each lexeme carries information about starting line and column, endline line and column (some lexemes span multiple lines), and the originating source file (sometimes a parser has to handle included files as well as the main file).

    For those language elements that contain variable content (numbers, identifiers, etc.), the struct should contain the variable content.

    For compiling or program analysis, the lexer can throw whitespace and comments away. If you intend to parse/modify the code, you'll need to capture comments.

    An example output can be instructive. For a variant of OP's example:

    /* My test file */
    
    int foo
        = 0; // a declaration
    

    ... DMS's C front end produces the following lexemes (this is a debug output, really handy to have when designing a complex lexer):

    C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>run ../domainlexer C:\temp\test.c
    Lexer Stream Display 1.5.1
    Using encoding Unicode-UTF-8?ANSI +CRLF +1 /^I
    !! Lexer:ResetLexicalModeStack
    !! after Lexer:PushLexicalMode:
    Lexical Mode Stack:
    1 C
    File "C:/temp/test.c", line 1: /* My test file */
    File "C:/temp/test.c", line 2:
    File "C:/temp/test.c", line 3: int foo
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23: 'int' [VOID]=0000
      <<< PreComments:
    Comment 1 Type 1 Line 1 Column 1 `/* My test file */'
    !! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210: IDENTIFIER [STRING]=`foo'
    File "C:/temp/test.c", line 4:     = 0; // a declaration
    !! Lexer:GotoLexicalMode 1 C
    !! Lexeme @ Line 3 Col 8 ELine 4 ECol 5 Token 2: whitespace [VOID]=0000
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 4 Col 5 ELine 4 ECol 6 Token 113: '=' [VOID]=0000
    !! Lexeme @ Line 4 Col 6 ELine 4 ECol 7 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 4 Col 7 ELine 4 ECol 8 Token 138: INT_LITERAL [NATURAL]=0
    File "C:/temp/test.c", line 5:
    !! Lexeme @ Line 4 Col 8 ELine 4 ECol 9 Token 98: ';' [VOID]=0000
      >>> PostComments:
    Comment 1 Type 2 Line 4 Column 10 `// a declaration'
    File "C:/temp/test.c", line 5:
    File "C:/temp/test.c", line 6:
    File "C:/temp/test.c", line 7:
    !! Lexer:GotoLexicalMode 1 C
    !! Lexeme @ Line 4 Col 26 ELine 7 ECol 1 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 4: end_of_input_stream [VOID]=0000
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 0: EndOfFile
    11 lexemes processed.
    0 lexical errors detected.
    
    C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>
    

    The main output are lines marked !!, each of which represents the contents of a lexeme struct produced by the lexer. Each lexeme carries:

    • source file location information (for the main file, "test.c" in this case, that is not printed to make the debug output a bit more readable)
    • a "token number" (lexeme type) and the human-readable token name (makes debugging a lot easier)
    • the type of value carried by the token: [VOID] means "none", [STRING] means the token carries a string values, [NATURAL] means it carries an integral value, etc.
    • precomments: Comments preceding the token. This is unusual for classic lexers, but necessary if one is trying to transform source code. You can't lose the comments! Note the precomment is attached to a token; because comments are not semantically meaningful, one can argue where they should be placed. This is our particular choice.
    • postcomment: Comments that follow the token that belong to it.

    The last "token" EndOfFile is implicit defined in every DMS lexer.

    This debug trace also notes transitions of the lexer across lexical modes (many lexer generators have multiple modes in which they lex various parts of a language). It shows source lines as they are read.

提交回复
热议问题