Is it possible to generate ansi C functions with type information for a moving GC implementation?

问题

I am wondering what methods there are to add typing information to generated C methods. I'm transpiling a higher-level programming language to C and I'd like to add a moving garbage collector. However to do that I need the method variables to have typing information, otherwise I could modify a primitive value that looks like a pointer.

An obvious approach would be to encapsulate all (primitive and non-primitive) variables in a struct that has an extra (enum) variable for typing information, however this would cause memory and performance overhead, the transpiled code is namely meant for embedded platforms. If I were to accept the memory overhead the obvious option would be to use a heap handle for all objects and then I'd be able to freely move heap blocks. However I'm wondering if there's a more efficient better approach.

I've come up with a potential solution, namely to predeclare and group variables based whether they're primitives or not (I can do that in the transpiler), and add an offset variable to each method at the end (I need to be able to find it accurately when scanning the stack area), that tells me where the non-primitive variables begin and where they end, so I can only scan those. This means that each method will use an additional 16/32-bit (depending on arch) of memory, however this should still be more memory efficient than the heap handle approach.

Example:

void my_func() {
  int i = 5;
  int z = 3;
  bool b = false;
  void* person;
  void* person_info = ...;
  .... // logic
  volatile int offset = 0x034;
}

My aim is for something that works universally across GCC compilers, thus my concerns are:

Can the compiler reorder the variables from how they're declared in the source code?
Can I force the compiler to put some data in the method's stack frame (using volatile)?
Can I find the offset accurately when scanning the stack?

I'd like to avoid assembly so this approach can work (by default) across multiple platforms, however I'm open for methods even if they involve assembly (if they're reliable).

回答1:

Typing information could be somehow encoded in the C function name; this is done by C++ and other implementations and called name mangling.

Actually, you could decide, since all your C code is generated, to adopt a different convention: generate long C identifiers which are practically unique and sort-of random program-wide, such as tiziw_7oa7eIzzcxv03TmmZ and keep their typing information elsewhere (e.g. some database). On Linux, such an approach is friendly to both libbacktrace and dlsym(3) + dladdr(3) (and of course nm(1) or readelf(1) or gdb(1)), so used in both bismon and RefPerSys projects.

Typing information is practically tied to calling conventions and ABIs. For example, the x86-64 ABI for Linux mandates different processor registers for passing floating points or pointers.

Read the Garbage Collection handbook or at least P.Wilson Uniprocessor Garbage Collection Techniques survey. You could decide to use tagged integers instead of boxing them, and you could decide to have a conservative GC (e.g. Boehm's GC) instead of a precise one. In my old GCC MELT project I generated C or C++ code for a generational copying GC. Similar techniques are used both in Bismon and in RefPerSys.

Since you are transpiling to C, consider also alternatives, such as libgccjit or LLVM. Look into libjit and asmjit.

Study also the implementation of other transpilers (compilers to C), including Chicken/Scheme and Bigloo.

Can the GCC compiler reorder the variables from how they're declared in the source code?

Of course yes, depending upon the optimizations you are asking. Some variables won't even exist in the binary (e.g. those staying in registers).

Can I force the compiler to put some data in the method's stack frame (using volatile)?

Better generate a single struct variable containing all your language variables, and leave optimizations to the compiler. You will be surprised (see this draft report).

Can I find the offset accurately when scanning the stack?

This is the most difficult, and depends a lot of compiler optimizations (e.g. if you run gcc with -O1 or -O3 on the generated C code; in some cases a recent GCC -e.g GCC 9 or GCC 10 on x86-64 for Linux- is capable of tail-call optimizations; check by compiling using gcc -O3 -S -fverbose-asm then looking into the produced assembler code). If you accept some small target processor and compiler specific tricks, this is doable. Study the implementation of the Ocaml compiler.

Send me (to basile@starynkevitch.net) an email for discussion. Please mention the URL of your question in it.

If you want to have an efficient generational copying GC with multi-threading, things become extremely tricky. The question is then how many years of development can you afford spending.

If you have exceptions in your language, take also a great care. You could with great caution generate calls to longjmp.

See of course this answer of mine.

With transpiling techniques, the evil is in the details

On Linux (specifically!) see also my manydl.c program. It demonstrates that on a Linux x86-64 laptop you could generate, in practice, hundred of thousands of dlopen(3)-ed plugins. Read then How to write shared libraries

Study also the implementation of SBCL and of GNU Prolog, at least for inspiration.

PS. The dream of a totally architecture-neutral and operating-system independent transpiler is an illusion.

来源：https://stackoverflow.com/questions/61648498/is-it-possible-to-generate-ansi-c-functions-with-type-information-for-a-moving-g

标签

gcc

compiler-construction

garbage-collection

c99