I\'m working with a Lexical Analyzer program right now and I\'m using Java. I\'ve been researching for answers on this problem but until now I failed to find any. Here\'s my
You can use libraries like Lex & Bison
in C or Antlr
in Java. Lexical analysis can be done through making automata. I'll give you small example:
Suppose you need to tokenize a string where keywords (language) are {'echo', '.', ' ', 'end')
. By keywords I mean language consists of following keywords only. So if I input
echo .
end .
My lexer should output
echo ECHO
SPACE
. DOT
end END
SPACE
. DOT
Now to build automata for such a tokenizer, I can start by
->(SPACE) (Back)
|
(S)-------------E->C->H->O->(ECHO) (Back)
| |
.->(DOT)(Back) ->N->D ->(END) (Back to Start)
Above diagram is prolly very bad, but idea is that you have a start state represented by S
now you consume E
and go to some other state, now you expect N
or C
to come for END
and ECHO
respectively. You keep consuming characters and reach different states within this simple finite state machine. Ultimately, you reach certain Emit
state, for example after consuming E
, N
, D
you reach emit state for END
which emits the token out and then you go back to start
state. This cycle continues forever as far as you have characters stream coming to your tokenizer. On invalid character you can either thrown an error or ignore depending on the design.