Why realloc deadlock after clone syscall?

问题

I have a problem that realloc() deadlocks sometime after clone() syscall.

My code is:

#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/types.h>
#define CHILD_STACK_SIZE 4096*4
#define gettid() syscall(SYS_gettid)
#define log(str) fprintf(stderr, "[pid:%d tid:%d] "str, getpid(),gettid())

int clone_func(void *arg){
    int *ptr=(int*)malloc(10);
    int i;
    for (i=1; i<200000; i++)
        ptr = realloc(ptr, sizeof(int)*i);
    free(ptr);
    return 0;
}

int main(){
    int flags = 0;
    flags = CLONE_VM;
    log("Program started.\n");
    int *ptr=NULL;
    ptr = malloc(16);
    void *child_stack_start = malloc(CHILD_STACK_SIZE);
    int ret = clone(clone_func, child_stack_start +CHILD_STACK_SIZE, flags, NULL, NULL, NULL, NULL);
    int i;
    for (i=1; i<200000; i++)
        ptr = realloc(ptr, sizeof(int)*i);

    free(ptr);
    return 0;
}

the callstack in gdb is:

[pid:13268 tid:13268] Program started.
^Z[New LWP 13269]

Program received signal SIGTSTP, Stopped (user).
0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) bt
#0  0x000000000040ba0e in __lll_lock_wait_private ()
#1  0x0000000000408630 in _L_lock_11249 ()
#2  0x000000000040797f in realloc ()
#3  0x0000000000400515 in main () at test-realloc.c:36
(gdb) i thr
  2 LWP 13269  0x000000000040ba0e in __lll_lock_wait_private ()
* 1 LWP 13268  0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) thr 2
[Switching to thread 2 (LWP 13269)]#0  0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) bt
#0  0x000000000040ba0e in __lll_lock_wait_private ()
#1  0x0000000000408630 in _L_lock_11249 ()
#2  0x000000000040797f in realloc ()
#3  0x0000000000400413 in clone_func (arg=0x7fffffffe53c) at test-realloc.c:20
#4  0x000000000040b889 in clone ()
#5  0x0000000000000000 in ?? ()

My OS is debian linux-2.6.32-5-amd64, with GNU C Library (Debian EGLIBC 2.11.3-4) stable release version 2.11.3. I deeply suspect that eglibc is the criminal on this bug. On clone() syscall, is it not enough before using realloc()?

回答1:

You cannot use clone with CLONE_VM yourself -- or if you do, you have to at least make sure you restrict yourself from invoking any function from the standard library after calling clone in either the parent or the child. In order for multiple threads or processes to share the same memory, the implementations of any functions which access shared resources (like the heap) need to

be aware of the fact that multiple flows of control are potentially accessing it so they can arrange to perform the appropriate synchronization, and
be able to obtain information about their own identities via the thread pointer, usually stored in a special machine register. This is completely implementation-internal, and thus you cannot arrange for a new "thread" which you create yourself via clone to have a properly setup thread pointer.

The proper solution is to use pthread_create, not clone.

回答2:

You cannot do this:

for (i=0; i<200000; i++)
        ptr = realloc(ptr, sizeof(int)*i);
free(ptr);

The first time through the loop, i is zero. realloc( ptr, 0 ) is equivalent to free( ptr ), and you cannot free twice.

回答3:

I add a flag, CLONE_SETTLS, in clone() syscall. Then the deadlock is gone. So I think eglibc's realloc() used some TLS data. When new thread create without a new TLS, some locks (in TLS) shared between this thread and his father, and realloc() using those locks stucked. So, if somebody want to use clone() directly, the best way is alloc a new TLS to new thread.

code snippet likes this:

flags = CLONE_VM | CLONE_SETTLS;
struct user_desc* p_tls_desc = malloc(sizeof(struct user_desc));
clone(clone_func, child_stack_start +CHILD_STACK_SIZE, flags, NULL, NULL, p_tls_desc, NULL);

来源：https://stackoverflow.com/questions/13736088/why-realloc-deadlock-after-clone-syscall

标签

clone

deadlock

realloc