How to use PTRACE to get a consistent view of multiple threads?

前端 未结 4 662
温柔的废话
温柔的废话 2020-12-04 10:50

While I was working on this question, I\'ve come across a possible idea that uses ptrace, but I\'m unable to get a proper understanding of how ptrace

4条回答
  •  广开言路
    2020-12-04 10:54

    Can I attach to a specific thread?

    Yes, at least on current kernels.

    Does that mean that single-stepping only steps through that one thread's instructions? Does it stop all the process's threads?

    Yes. It does not stop the other threads, only the attached one.

    Is there a way to step forward only in one single thread but guarantee that the other threads remain stopped?

    Yes. Send SIGSTOP to the process (use waitpid(PID,,WUNTRACED) to wait for the process to be stopped), then PTRACE_ATTACH to every thread in the process. Send SIGCONT (using waitpid(PID,,WCONTINUED) to wait for the process to continue).

    Since all threads were stopped when you attached, and attaching stops the thread, all threads stay stopped after the SIGCONT signal is delivered. You can single-step the threads in any order you prefer.


    I found this interesting enough to whip up a test case. (Okay, actually I suspect nobody will take my word for it anyway, so I decided it's better to show proof you can duplicate on your own instead.)

    My system seems to follow the man 2 ptrace as described in the Linux man-pages project, and Kerrisk seems to be pretty good at maintaining them in sync with kernel behaviour. In general, I much prefer kernel.org sources wrt. the Linux kernel to other sources.

    Summary:

    • Attaching to the process itself (TID==PID) stops only the original thread, not all threads.

    • Attaching to a specific thread (using TIDs from /proc/PID/task/) does stop that thread. (In other words, the thread with TID == PID is not special.)

    • Sending a SIGSTOP to the process will stop all threads, but ptrace() still works absolutely fine.

    • If you sent a SIGSTOP to the process, do not call ptrace(PTRACE_CONT, TID) before detaching. PTRACE_CONT seems to interfere with the SIGCONT signal.

      You can first send a SIGSTOP, then PTRACE_ATTACH, then send SIGCONT, without any issues; the thread will stay stopped (due to the ptrace). In other words, PTRACE_ATTACH and PTRACE_DETACH mix well with SIGSTOP and SIGCONT, without any side effects I could see.

    • SIGSTOP and SIGCONT affect the entire process, even if you try using tgkill() (or pthread_kill()) to send the signal to a specific thread.

    • To stop and continue a specific thread, PTHREAD_ATTACH it; to stop and continue all threads of a process, send SIGSTOP and SIGCONT signals to the process, respectively.

    Personally, I believe this validates the approach I suggested in that another question.

    Here is the ugly test code you can compile and run to test it for yourself, traces.c:

    #define  GNU_SOURCE
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    #ifndef   THREADS
    #define   THREADS  3
    #endif
    
    static int tgkill(int tgid, int tid, int sig)
    {
        int retval;
    
        retval = syscall(SYS_tgkill, tgid, tid, sig);
        if (retval < 0) {
            errno = -retval;
            return -1;
        }
    
        return 0;
    }
    
    volatile unsigned long counter[THREADS + 1] = { 0UL };
    
    volatile sig_atomic_t run = 0;
    volatile sig_atomic_t done = 0;
    
    void handle_done(int signum)
    {
        done = signum;
    }
    
    int install_done(int signum)
    {
        struct sigaction act;
        sigemptyset(&act.sa_mask);
        act.sa_handler = handle_done;
        act.sa_flags = 0;
        if (sigaction(signum, &act, NULL))
            return errno;
        return 0;
    }
    
    void *worker(void *data)
    {
        volatile unsigned long *const counter = data;
    
        while (!run)
            ;
    
        while (!done)
            (*counter)++;
    
        return (void *)(*counter);
    }
    
    pid_t *gettids(const pid_t pid, size_t *const countptr)
    {
        char           dirbuf[128];
        DIR           *dir;
        struct dirent *ent;
    
        pid_t         *data = NULL, *temp;
        size_t         size = 0;
        size_t         used = 0;
    
        int            tid;
        char           dummy;
    
        if ((int)pid < 2) {
            errno = EINVAL;
            return NULL;
        }
    
        if (snprintf(dirbuf, sizeof dirbuf, "/proc/%d/task/", (int)pid) >= (int)sizeof dirbuf) {
            errno = ENAMETOOLONG;
            return NULL;
        }
    
        dir = opendir(dirbuf);
        if (!dir)
            return NULL;
    
        while (1) {
            errno = 0;
            ent = readdir(dir);
            if (!ent)
                break;
    
            if (sscanf(ent->d_name, "%d%c", &tid, &dummy) != 1)
                continue;
    
            if (tid < 2)
                continue;
    
            if (used >= size) {
                size = (used | 127) + 129;
                temp = realloc(data, size * sizeof data[0]);
                if (!temp) {
                    free(data);
                    closedir(dir);
                    errno = ENOMEM;
                    return NULL;
                }
                data = temp;
            }
    
            data[used++] = (pid_t)tid;
        }
        if (errno) {
            free(data);
            closedir(dir);
            errno = EIO;
            return NULL;
        }
        if (closedir(dir)) {
            free(data);
            errno = EIO;
            return NULL;
        }
    
        if (used < 1) {
            free(data);
            errno = ENOENT;
            return NULL;
        }
    
        size = used + 1;
        temp = realloc(data, size * sizeof data[0]);
        if (!temp) {
            free(data);
            errno = ENOMEM;
            return NULL;
        }
        data = temp;
    
        data[used] = (pid_t)0;
    
        if (countptr)
            *countptr = used;
    
        errno = 0;
        return data;
    }
    
    int child_main(void)
    {
        pthread_t   id[THREADS];
        int         i;
    
        if (install_done(SIGUSR1)) {
            fprintf(stderr, "Cannot set SIGUSR1 signal handler.\n");
            return 1;
        }
    
        for (i = 0; i < THREADS; i++)
            if (pthread_create(&id[i], NULL, worker, (void *)&counter[i])) {
                fprintf(stderr, "Cannot create thread %d of %d: %s.\n", i + 1, THREADS, strerror(errno));
                return 1;
            }
    
        run = 1;
    
        kill(getppid(), SIGUSR1);
    
        while (!done)
            counter[THREADS]++;
    
        for (i = 0; i < THREADS; i++)
            pthread_join(id[i], NULL);
    
        printf("Final counters:\n");
        for (i = 0; i < THREADS; i++)
            printf("\tThread %d: %lu\n", i + 1, counter[i]);
        printf("\tMain thread: %lu\n", counter[THREADS]);
    
        return 0;
    }
    
    int main(void)
    {
        pid_t   *tid = NULL;
        size_t   tids = 0;
        int      i, k;
        pid_t    child, p;
    
        if (install_done(SIGUSR1)) {
            fprintf(stderr, "Cannot set SIGUSR1 signal handler.\n");
            return 1;
        }
    
        child = fork();
        if (!child)
            return child_main();
    
        if (child == (pid_t)-1) {
            fprintf(stderr, "Cannot fork.\n");
            return 1;
        }
    
        while (!done)
            usleep(1000);
    
        tid = gettids(child, &tids);
        if (!tid) {
            fprintf(stderr, "gettids(): %s.\n", strerror(errno));
            kill(child, SIGUSR1);
            return 1;
        }
    
        fprintf(stderr, "Child process %d has %d tasks.\n", (int)child, (int)tids);
        fflush(stderr);
    
        for (k = 0; k < (int)tids; k++) {
            const pid_t t = tid[k];
    
            if (ptrace(PTRACE_ATTACH, t, (void *)0L, (void *)0L)) {
                fprintf(stderr, "Cannot attach to TID %d: %s.\n", (int)t, strerror(errno));
                kill(child, SIGUSR1);
                return 1;
            }
    
            fprintf(stderr, "Attached to TID %d.\n\n", (int)t);
    
            fprintf(stderr, "Peeking the counters in the child process:\n");
            for (i = 0; i <= THREADS; i++) {
                long v;
                do {
                    errno = 0;
                    v = ptrace(PTRACE_PEEKDATA, t, &counter[i], NULL);
                } while (v == -1L && (errno == EIO || errno == EFAULT || errno == ESRCH));
                fprintf(stderr, "\tcounter[%d] = %lu\n", i, (unsigned long)v);
            }
            fprintf(stderr, "Waiting a short moment ... ");
            fflush(stderr);
    
            usleep(250000);
    
            fprintf(stderr, "and another peek:\n");
            for (i = 0; i <= THREADS; i++) {
                long v;
                do {
                    errno = 0;
                    v = ptrace(PTRACE_PEEKDATA, t, &counter[i], NULL);
                } while (v == -1L && (errno == EIO || errno == EFAULT || errno == ESRCH));
                fprintf(stderr, "\tcounter[%d] = %lu\n", i, (unsigned long)v);
            }
            fprintf(stderr, "\n");
            fflush(stderr);
    
            usleep(250000);
    
            ptrace(PTRACE_DETACH, t, (void *)0L, (void *)0L);
        }
    
        for (k = 0; k < 4; k++) {
            const pid_t t = tid[tids / 2];
    
            if (k == 0) {
                fprintf(stderr, "Sending SIGSTOP to child process ... ");
                fflush(stderr);
                kill(child, SIGSTOP);
            } else
            if (k == 1) {
                fprintf(stderr, "Sending SIGCONT to child process ... ");
                fflush(stderr);
                kill(child, SIGCONT);
            } else
            if (k == 2) {
                fprintf(stderr, "Sending SIGSTOP to TID %d ... ", (int)tid[0]);
                fflush(stderr);
                tgkill(child, tid[0], SIGSTOP);
            } else
            if (k == 3) {
                fprintf(stderr, "Sending SIGCONT to TID %d ... ", (int)tid[0]);
                fflush(stderr);
                tgkill(child, tid[0], SIGCONT);
            }
            usleep(250000);
            fprintf(stderr, "done.\n");
            fflush(stderr);
    
            if (ptrace(PTRACE_ATTACH, t, (void *)0L, (void *)0L)) {
                fprintf(stderr, "Cannot attach to TID %d: %s.\n", (int)t, strerror(errno));
                kill(child, SIGUSR1);
                return 1;
            }
    
            fprintf(stderr, "Attached to TID %d.\n\n", (int)t);
    
            fprintf(stderr, "Peeking the counters in the child process:\n");
            for (i = 0; i <= THREADS; i++) {
                long v;
                do {
                    errno = 0;
                    v = ptrace(PTRACE_PEEKDATA, t, &counter[i], NULL);
                } while (v == -1L && (errno == EIO || errno == EFAULT || errno == ESRCH));
                fprintf(stderr, "\tcounter[%d] = %lu\n", i, (unsigned long)v);
            }
            fprintf(stderr, "Waiting a short moment ... ");
            fflush(stderr);
    
            usleep(250000);
    
            fprintf(stderr, "and another peek:\n");
            for (i = 0; i <= THREADS; i++) {
                long v;
                do {
                    errno = 0;
                    v = ptrace(PTRACE_PEEKDATA, t, &counter[i], NULL);
                } while (v == -1L && (errno == EIO || errno == EFAULT || errno == ESRCH));
                fprintf(stderr, "\tcounter[%d] = %lu\n", i, (unsigned long)v);
            }
            fprintf(stderr, "\n");
            fflush(stderr);
    
            usleep(250000);
    
            ptrace(PTRACE_DETACH, t, (void *)0L, (void *)0L);
        }
    
        kill(child, SIGUSR1);
    
        do {
            p = waitpid(child, NULL, 0);
            if (p == -1 && errno != EINTR)
                break;
        } while (p != child);
    
        return 0;
    }
    

    Compile and run using e.g.

    gcc -DTHREADS=3 -W -Wall -O3 traces.c -pthread -o traces
    ./traces
    

    The output is a dump of the child process counters (each one incremented in a separate thread, including the original thread which uses the final counter). Compare the counters across the short wait. For example:

    Child process 18514 has 4 tasks.
    Attached to TID 18514.
    
    Peeking the counters in the child process:
        counter[0] = 0
        counter[1] = 0
        counter[2] = 0
        counter[3] = 0
    Waiting a short moment ... and another peek:
        counter[0] = 18771865
        counter[1] = 6435067
        counter[2] = 54247679
        counter[3] = 0
    

    As you can see above, only the initial thread (whose TID == PID), which uses the final counter, is stopped. The same happens for the other three threads, too, which use the first three counters in order:

    Attached to TID 18515.
    
    Peeking the counters in the child process:
        counter[0] = 25385151
        counter[1] = 13459822
        counter[2] = 103763861
        counter[3] = 560872
    Waiting a short moment ... and another peek:
        counter[0] = 25385151
        counter[1] = 69116275
        counter[2] = 120500164
        counter[3] = 9027691
    
    Attached to TID 18516.
    
    Peeking the counters in the child process:
        counter[0] = 25397582
        counter[1] = 105905400
        counter[2] = 155895025
        counter[3] = 17306682
    Waiting a short moment ... and another peek:
        counter[0] = 32358651
        counter[1] = 105905400
        counter[2] = 199601078
        counter[3] = 25023231
    
    Attached to TID 18517.
    
    Peeking the counters in the child process:
        counter[0] = 40600813
        counter[1] = 111675002
        counter[2] = 235428637
        counter[3] = 32298929
    Waiting a short moment ... and another peek:
        counter[0] = 48727731
        counter[1] = 143870702
        counter[2] = 235428637
        counter[3] = 39966259
    

    The next two cases examine the SIGCONT/SIGSTOP wrt. the entire process:

    Sending SIGSTOP to child process ... done.
    Attached to TID 18516.
    
    Peeking the counters in the child process:
        counter[0] = 56887263
        counter[1] = 170646440
        counter[2] = 235452621
        counter[3] = 48077803
    Waiting a short moment ... and another peek:
        counter[0] = 56887263
        counter[1] = 170646440
        counter[2] = 235452621
    counter[3] = 48077803
    
    Sending SIGCONT to child process ... done.
    Attached to TID 18516.
    
    Peeking the counters in the child process:
        counter[0] = 64536344
        counter[1] = 182359343
        counter[2] = 253660731
        counter[3] = 56422231
    Waiting a short moment ... and another peek:
        counter[0] = 72029244
        counter[1] = 182359343
        counter[2] = 288014365
        counter[3] = 63797618
    

    As you can see, sending SIGSTOP will stop all threads, but not hinder with ptrace(). Similarly, after SIGCONT, the threads continue running as normal.

    The final two cases examine the effects of using tgkill() to send the SIGSTOP/SIGCONT to a specific thread (the one that corresponds to the first counter), while attaching to another thread:

    Sending SIGSTOP to TID 18514 ... done.
    Attached to TID 18516.
    
    Peeking the counters in the child process:
        counter[0] = 77012930
        counter[1] = 183059526
        counter[2] = 344043770
        counter[3] = 71120227
    Waiting a short moment ... and another peek:
        counter[0] = 77012930
        counter[1] = 183059526
        counter[2] = 344043770
        counter[3] = 71120227
    
    Sending SIGCONT to TID 18514 ... done.
    Attached to TID 18516.
    
    Peeking the counters in the child process:
        counter[0] = 88082419
        counter[1] = 194059048
        counter[2] = 359342314
        counter[3] = 84887463
    Waiting a short moment ... and another peek:
        counter[0] = 100420161
        counter[1] = 194059048
        counter[2] = 392540525
        counter[3] = 111770366
    

    Unfortunately, but as expected, the disposition (stopped/running) is process-wide, not thread-specific, as you can see above. This means that to stop a specific threads and let the other threads run normally, you need to separately PTHREAD_ATTACH to the threads you wish to stop.

    To prove all my statements above, you may have to add test cases; I ended up having quite a few copies of the code, all slightly edited, to test it all, and I'm not sure I picked the most complete set. I'd be happy to expand the test program, if you find omissions.

    Questions?

提交回复
热议问题