I am currently writing a compiler and I\'m in the Lexer phase.
I know that the lexer tokenizes the input stream.
However, consider the following stream:
In general, your lexer should produce a stream of structs that contain language elements: operators, identifiers, keywords, comments, etc. These structs should be marked with type of the lexeme, and carry content relevant to the type of lexeme it represents.
To enable good error reporting, it is good if each lexeme carries information about starting line and column, endline line and column (some lexemes span multiple lines), and the originating source file (sometimes a parser has to handle included files as well as the main file).
For those language elements that contain variable content (numbers, identifiers, etc.), the struct should contain the variable content.
For compiling or program analysis, the lexer can throw whitespace and comments away. If you intend to parse/modify the code, you'll need to capture comments.
An example output can be instructive. For a variant of OP's example:
/* My test file */
int foo
= 0; // a declaration
... DMS's C front end produces the following lexemes (this is a debug output, really handy to have when designing a complex lexer):
C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>run ../domainlexer C:\temp\test.c
Lexer Stream Display 1.5.1
Using encoding Unicode-UTF-8?ANSI +CRLF +1 /^I
!! Lexer:ResetLexicalModeStack
!! after Lexer:PushLexicalMode:
Lexical Mode Stack:
1 C
File "C:/temp/test.c", line 1: /* My test file */
File "C:/temp/test.c", line 2:
File "C:/temp/test.c", line 3: int foo
!! Lexer:GotoLexicalMode 2 CMain
!! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23: 'int' [VOID]=0000
<<< PreComments:
Comment 1 Type 1 Line 1 Column 1 `/* My test file */'
!! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2: whitespace [VOID]=0000
!! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210: IDENTIFIER [STRING]=`foo'
File "C:/temp/test.c", line 4: = 0; // a declaration
!! Lexer:GotoLexicalMode 1 C
!! Lexeme @ Line 3 Col 8 ELine 4 ECol 5 Token 2: whitespace [VOID]=0000
!! Lexer:GotoLexicalMode 2 CMain
!! Lexeme @ Line 4 Col 5 ELine 4 ECol 6 Token 113: '=' [VOID]=0000
!! Lexeme @ Line 4 Col 6 ELine 4 ECol 7 Token 2: whitespace [VOID]=0000
!! Lexeme @ Line 4 Col 7 ELine 4 ECol 8 Token 138: INT_LITERAL [NATURAL]=0
File "C:/temp/test.c", line 5:
!! Lexeme @ Line 4 Col 8 ELine 4 ECol 9 Token 98: ';' [VOID]=0000
>>> PostComments:
Comment 1 Type 2 Line 4 Column 10 `// a declaration'
File "C:/temp/test.c", line 5:
File "C:/temp/test.c", line 6:
File "C:/temp/test.c", line 7:
!! Lexer:GotoLexicalMode 1 C
!! Lexeme @ Line 4 Col 26 ELine 7 ECol 1 Token 2: whitespace [VOID]=0000
!! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 4: end_of_input_stream [VOID]=0000
!! Lexer:GotoLexicalMode 2 CMain
!! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 0: EndOfFile
11 lexemes processed.
0 lexical errors detected.
C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>
The main output are lines marked !!, each of which represents the contents of a lexeme struct produced by the lexer. Each lexeme carries:
The last "token" EndOfFile is implicit defined in every DMS lexer.
This debug trace also notes transitions of the lexer across lexical modes (many lexer generators have multiple modes in which they lex various parts of a language). It shows source lines as they are read.