How to create a lightweight C code sandbox?

杀马特。学长 韩版系。学妹 提交于 2019-11-27 02:52:55
Rutger Nijlunsing

Since the C standard is much too broad to be allowed, you would need to go the other way around: specify the minimum subset of C which you need, and try to implement that. Even ANSI C is already too complicated and allows unwanted behaviour.

The aspect of C which is most problematic are the pointers: the C language requires pointer arithmitic, and those are not checked. For example:

char a[100];
printf("%p %p\n", a[10], 10[a]);

will both print the same address. Since a[10] == 10[a] == *(10 + a) == *(a + 10).

All these pointer accesses cannot be checked at compile time. That's the same complexity as asking the compiler for 'all bugs in a program' which would require solving the halting problem.

Since you want this function to be able to run in the same process (potentially in a different thread) you share memory between your application and the 'safe' module since that's the whole point of having a thread: share data for faster access. However, this also means that both threads can read and write the same memory.

And since you cannot prove compile time where pointers end up, you have to do that at runtime. Which means that code like 'a[10]' has to be translated to something like 'get_byte(a + 10)' at which point I wouldn't call it C anymore.

Google Native Client

So if that's true, how does google do it then? Well, in contrast to the requirements here (cross-platform (including embedded systems)), Google concentrates on x86, which has in additional to paging with page protections also segment registers. Which allows it to create a sandbox where another thread does not share the same memory in the same way: the sandbox is by segmentation limited to changing only its own memory range. Furthermore:

  • a list of safe x86 assembly constructs is assembled
  • gcc is changed to emit those safe constructs
  • this list is constructed in a way that is verifiable.
  • after loading a module, this verification is done

So this is platform specific and is not a 'simple' solution, although a working one. Read more at their research paper.

Conclusion

So whatever route you go, you need to start out with something new which is verifiable and only then you can start by adapting an existing a compiler or generating a new one. However, trying to mimic ANSI C requires one to think about the pointer problem. Google modelled their sandbox not on ANSI C but on a subset of x86, which allowed them to use existing compilers to a great extend with the disadvantage of being tied to x86.

I think you would get a lot out of reading about some of the implementation concerns and choices Google made when designing Native Client, a system for executing x86 code (safely, we hope) in the browser. You may need to do some source-rewriting or source-to-source compilation to make the code safe if it's not, but you should be able to rely on the NaCL sandbox to catch your generated assembly code if it tries to do anything too funky.

If I were going to do this, I would investigate one of two approaches:

  • Use CERN's CINT to run sandboxed code in an interpreter and see about restricting what the interpreter permits. This would probably not give terribly good performance.
  • Use LLVM to create an intermediate representation of the C++ code and then see if it's feasible to run that bytecode in a sandboxed Java-style VM.

However, I agree with others that this is probably a horribly involved project. Look at the problems that web browsers have had with buggy or hung plugins destabilizing the entire browser. Or look at the release notes for the Wireshark project; almost every release, it seems, contains security fixes for problems in one of its protocol dissectors that then affect the entire program. If a C/C++ sandbox were feasible, I'd expect these projects to have latched onto one by now.

This isn't trivial, but it's not that hard.

You can run binary code in a sand box. Every operating system does this all day long.

They're going to have to use your standard library (vs a generic C lib). Your standard library will enforce whatever controls you want to impose.

Next, you'll want ensure that they can not create "runnable code" at run time. That is, the stack isn't executable, they can't allocate any memory that's executable, etc. That means that only the code generated by the compiler (YOUR compiler) will be executable.

If your compiler signs its executable cryptographically, your runtime will be able to detect tampered binaries, and simply not load them. This prevents them from "poking" things in to the binaries that you simply don't want them to have.

With a controlled compiler generating "safe" code, and a controlled system library, that should give a reasonably controlled sandbox, even with actual machine language code.

Want to impose memory limits? Put a check in to malloc. Want to restrict how much stack is allocated? Limit the stack segment.

Operating systems create these kinds of constrained environments using their Virtual Memory managers all day long, so you can readily do these things on modern OS's.

Whether the effort to do this is worthwhile vs using an off the shelf Virtual Machine and byte code runtime, I can't say.

I stumbled upon Tiny C Compiler (TCC). This may be what I need:

*  SMALL! You can compile and execute C code everywhere, for example on rescue disks (about 100KB for x86 TCC executable, including C preprocessor, C compiler, assembler and linker).
* FAST! tcc generates x86 code. No byte code overhead. Compile, assemble and link several times faster than GCC.
* UNLIMITED! Any C dynamic library can be used directly. TCC is heading torward full ISOC99 compliance. TCC can of course compile itself.
* SAFE! tcc includes an optional memory and bound checker. Bound checked code can be mixed freely with standard code.
* Compile and execute C source directly. No linking or assembly necessary. Full C preprocessor and GNU-like assembler included.
* C script supported : just add '#!/usr/local/bin/tcc -run' at the first line of your C source, and execute it directly from the command line.
* With libtcc, you can use TCC as a backend for dynamic code generation.

It's a very small program which makes hacking on it a viable option (hack GCC?, not in this lifetime!). I suspect it will make an excellent base to build my own restricted compiler from. I'll remove support for language features I can't make safe and wrap or replace the memory allocation and loop handling.

TCC can already do bounds checking on memory accesses, which is one of my requirements.

libtcc is also a great feature, since I can then manage code compilation internally.

I don't expect it to be easy but it gives me hope I can get performance close to C with less risks.

Still want to hear other ideas though.

Perfectly impossible. The language just doesn't work this way. The concept of classes is lost very early in most compilers, including GCC. Even if it was, there would be no way to associate each memory allocation with a live object, let alone a "module".

I haven't investigated this very closely, but the guys working on Chromium (aka Google Chrome) are working on a sandbox almost like this already, which might be worth looking into.

http://dev.chromium.org/developers/design-documents/sandbox/Sandbox-FAQ

It's open source, so should be possible to use it.

It is impossible to make a static code verifier that can determine that for all possible codes, that a set of code is safe or unsafe, if the language is Turing complete. It is equivalent to the halting problem.

Of course this point is moot if you have supervisor code running at a lower ring level or being an interpreted language (ie. emulating machine resources).

The best way to do this is to start the code in another process (ipc is not that bad), and trap system calls like Ptrace in linuxes http://linux.die.net/man/2/ptrace

SpliFF

Liran pointed out codepad.org in a comment above. It isn't suitable because it relies on a very heavy environment (consisting of ptrace, chroot, and an outbound firewall) however I found there a few g++ safety switches which I thought I'd share here:

gcc 4.1.2 flags: -O -fmessage-length=0 -fno-merge-constants -fstrict-aliasing -fstack-protector-all

g++ 4.1.2 flags: -O -std=c++98 -pedantic-errors -Wfatal-errors -Werror -Wall -Wextra -Wno-missing-field-initializers -Wwrite-strings -Wno-deprecated -Wno-unused -Wno-non-virtual-dtor -Wno-variadic-macros -fmessage-length=0 -ftemplate-depth-128 -fno-merge-constants -fno-nonansi-builtins -fno-gnu-keywords -fno-elide-constructors -fstrict-aliasing -fstack-protector-all -Winvalid-pch

The options are explained in the GCC manual

What really caught my eye was the stack-protector flag. I believe it is a merge of this IBM research project (Stack-Smashing Protector) with the official GCC.

The protection is realized by buffer overflow detection and the variable reordering feature to avoid the corruption of pointers. The basic idea of buffer overflow detection comes from StackGuard system.

The novel features are (1) the reordering of local variables to place buffers after pointers to avoid the corruption of pointers that could be used to further corrupt arbitrary memory locations, (2) the copying of pointers in function arguments to an area preceding local variable buffers to prevent the corruption of pointers that could be used to further corrupt arbitrary memory locations, and the (3) omission of instrumentation code from some functions to decrease the performance overhead.

8 years later and I've discovered a new platform that meets all of my original requirements. Web Assembly allows you to run a C/C++ subset safely inside a browser and comes with similar safety restrictions to my requirements such as restricting memory access and preventing unsafe operations on the OS and parent process. It's been implemented in Firefox 52 and there are promising signs other browsers will support it in the future.

Nice idea, but I'm fairly sure what you're trying to do is impossible with C or C++. If you dropped the sandbox idea it might work.

Java's already got a similar (as in a large library of 3rd party code) system in Maven2

If you want to be really sure, I think the best and perhaps only way to do this is do go down the line of seperate processes and let the O/S handle the access control. It's not that painful to write a generic threaded loader and once you have it, you can override some functions to load specific libraries.

Youy appear to be trying to solve two non-problems. In my own code I have no memory alocation problems or issues with recursion or infinite loops.

What you seem to be proposingh is a different, more limited language than C++. This is something you can pursue of course, but as others have noted you will have to write a compiler for it - simple textual processing will not give you what you want.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!