Accessing tokenization of a C++ source file

问题

My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like

x
+=
fun
(
nullptr
)
;

Is this true? If so, is there a way to have access to this tokenization of a C++ source code?

I'm asking this question mostly for curiosity, and I do not intend to write a lexer myself

And the reason I'm curious to know whether one can leverage the compiler is that, to give an example, before meeting [[noreturn]] & Co. I wouldn't have ever considered [[ as a valid token, if I was to write a lexer myself.

Do we necessarily need a true, actual use case? I think we don't, if I am curious about whether there's an existing tool or not to do something.

However, if we really need a use case,

let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of. Clearly, a requirement is that concatenating the elments of the output should make up the whole text again, including line breakers and every other byte of it.

回答1:

With the restriction mentioned in the comment (tokenization keeping __DATE__) it seems rather manageable. You need the preprocessing tokens. The Boost::Wave preprocessor necessarily creates a token list, because it has to work on those tokens.

Basile correctly points out that it's hard to assign a meaning to those tokens.

回答2:

C++ is a very complex programming language.

Be sure to read the C++11 draft standard n3337 before even attempting to parse C++ code.

Look inside the source code of existing open source C++ compilers, such as GCC (at least GCC 10 in October 2020) or Clang (at least Clang 10 in October 2020)

If you have to write your C++ parser from scratch, be sure to have the budget for at least a full person year of work.

Look also into existing C++ static source code analyzers, such as Frama-C++ or Clang static analyzer. Consider adapting one of them to your needs, but do document in writing your needs before starting coding. Be aware of Rice's theorem.

I would suggest instead writing your own GCC plugin.

Indeed, it would be tied to some major version of GCC, but you'll win months of work.

Is this true? If so, is there a way to have access to this tokenization of a C++ source code?

Yes, by patching some existing opensource C++ compiler, or extending it with your plugin (there are licensing conditions related to both approaches).

let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of.

The above specification is ambiguous.

Do you want the lexeme before or after the C++ preprocessing phase? In other words, what would be the lexeme for e.g. __DATE__ or __TIME__ ? Read e.g. the documentation of GNU cpp ... If you happen to use GCC on Linux (see gcc(1)) and have some C++ translation unit foo.cc, try running g++ -C -E -Wall foo.cc > foo.ii and look (using less(1)...) into the generated preprocessed form foo.ii ? And what about template expansion, or preprocessor conditionals or preprocessor stringizing ?

I would suggest writing your GCC plugin working on GENERIC representations. You could also start a PhD work related to your goals.

Notice that generating C++ code is a lot easier than parsing it.

Look inside Qt for an example of software generating C++ code. Yo could consider using GNU m4, or GNU gawk, or GNU autoconf, or GPP, or your own C++ source generator (perhaps with the help of GNU bison or of ANTLR) to generate some of your C++ code.

PS. On my home page you'll find an hyperlink to some draft report related to your question, and another hyperlink to an open source program generating C++ code. It sadly seems that I am forbidden here to give these hyperlinks, but you could find them in two mouse clicks. You might also look into two European H2020 projects funding that draft report: CHARIOT & DECODER.

来源：https://stackoverflow.com/questions/64220138/accessing-tokenization-of-a-c-source-file

标签

c++

parsing

compilation

token

lexer