For argument\'s sake lets assume a HTML parser.
I\'ve read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Tokenizing can be composed of a few steps, for example, if you have this html code:
My HTML Page
This paragraph has special style
This paragraph is not special
the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):
["<", "html", ">",
"<", "head", ">",
"<", "title", ">", "My HTML Page", "", "title", ">",
"", "head", ">",
"<", "body", ">",
"<", "p", "style", "=", "\"", "special", "\"", ">",
"This paragraph has special style",
"", "p", ">",
"<", "p", ">",
"This paragraph is not special",
"", "p", ">",
"", "body", ">",
"", "html", ">"
]
there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):
[("", {}),
("", {}),
("", {}), "My HTML Page", " ",
"",
("", {}),
("", {"style": "special"}),
"This paragraph has special style",
"
",
("", {}),
"This paragraph is not special",
"
",
"",
""
]
then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:
("", {}, [
("", {}, [
("", {}, ["My HTML Page"]),
]),
(" ", {}, [
("", {"style": "special"}, ["This paragraph has special style"]),
("
", {}, ["This paragraph is not special"]),
]),
])
at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.