seccomp — how to EXIT_SUCCESS?

后端 未结 2 526
孤城傲影
孤城傲影 2020-12-18 19:18

Ηow to EXIT_SUCCESS after strict mode seccomp is set. Is it the correct practice, to call syscall(SYS_exit, EXIT_SUCCESS); at the end of main?

#         


        
2条回答
  •  醉酒成梦
    2020-12-18 19:44

    The problem occurs, because the GNU C library uses the exit_group syscall, if it is available, in Linux instead of exit, for the _exit() function (see sysdeps/unix/sysv/linux/_exit.c for verification), and as documented in the man 2 prctl, the exit_group syscall is not allowed by the strict seccomp filter.

    Because the _exit() function call occurs inside the C library, we cannot interpose it with our own version (that would just do the exit syscall). (The normal process cleanup is done elsewhere; in Linux, the _exit() function only does the final syscall that terminates the process.)

    We could ask the GNU C library developers to use the exit_group syscall in Linux only when there are more than one thread in the current process, but unfortunately, it would not be easy, and even if added right now, would take quite some time for the feature to be available on most Linux distributions.

    Fortunately, we can ditch the default strict filter, and instead define our own. There is a small difference in behaviour: the apparent signal that kills the process will change from SIGKILL to SIGSYS. (The signal is not actually delivered, as the kernel does kill the process; only the apparent signal number that caused the process to die changes.)

    Furthermore, this is not even that difficult. I did waste a bit of time looking into some GCC macro trickery that would make it trivial to manage the allowed syscalls' list, but I decided it would not be a good approach: the list of allowed syscalls should be carefully considered -- we only add exit_group() compared to the strict filter, here! -- so making it a bit difficult is okay.

    The following code, say example.c, has been verified to work on a 4.4 kernel (should work on kernels 3.5 or later) on x86-64 (for both x86 and x86-64, i.e. 32-bit and 64-bit binaries). It should work on all Linux architectures, however, and it does not require or use the libseccomp library.

    #define  _GNU_SOURCE
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    static const struct sock_filter  strict_filter[] = {
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof (struct seccomp_data, nr))),
    
        BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_rt_sigreturn, 5, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_read,         4, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_write,        3, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit,         2, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit_group,   1, 0),
    
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
    };
    
    static const struct sock_fprog  strict = {
        .len = (unsigned short)( sizeof strict_filter / sizeof strict_filter[0] ),
        .filter = (struct sock_filter *)strict_filter
    };
    
    int main(void)
    {
        /* To be able to set a custom filter, we need to set the "no new privs" flag.
           The Documentation/prctl/no_new_privs.txt file in the Linux kernel
           recommends this exact form: */
        if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
            fprintf(stderr, "Cannot set no_new_privs: %m.\n");
            return EXIT_FAILURE;
        }
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &strict)) {
            fprintf(stderr, "Cannot install seccomp filter: %m.\n");
            return EXIT_FAILURE;
        }
    
        /* The seccomp filter is now active.
           It differs from SECCOMP_SET_MODE_STRICT in two ways:
             1. exit_group syscall is allowed; it just terminates the
                process
             2. Parent/reaper sees SIGSYS as the killing signal instead of
                SIGKILL, if the process tries to do a syscall not in the
                explicitly allowed list
        */
    
        return EXIT_SUCCESS;
    }
    

    Compile using e.g.

    gcc -Wall -O2 example.c -o example
    

    and run using

    ./example
    

    or under strace to see the syscalls and library calls done;

    strace ./example
    

    The strict_filter BPF program is really trivial. The first opcode loads the syscall number into the accumulator. The next five opcodes compare it to an acceptable syscall number, and if found, jump to the final opcode that allows the syscall. Otherwise the second-to-last opcode kills the process.

    Note that although the documentation refers to sigreturn being the allowed syscall, the actual name of the syscall in Linux is rt_sigreturn. (sigreturn was deprecated in favour of rt_sigreturn ages ago.)

    Furthermore, when the filter is installed, the opcodes are copied to kernel memory (see kernel/seccomp.c in the Linux kernel sources), so it does not affect the filter in any way if the data is modified later. Having the structures static const has zero security impact, in other words.

    I used static since there is no need for the symbols to be visible outside this compilation unit (or in a stripped binary), and const to put the data into the read-only data section of the ELF binary.

    The form of a BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs) is simple: the accumulator (the syscall number) is compared to nr. If they are equal, then the next equals opcodes are skipped. Otherwise, the next differs opcodes are skipped.

    Since the equals cases jump to the very final opcode, you can add new opcodes at the top (that is, just after the initial opcode), incrementing the equals skip count for each one.

    Note that printf() will not work after the seccomp filter is installed, because internally, the C library wants to do a fstat syscall (on standard output), and a brk syscall to allocate some memory for a buffer.

提交回复
热议问题