Starting from fork(...)

作为计算机程序的基本单位，一切五花八门，新奇古怪的程序都源于一个fork。亚当夏娃之后，人类繁衍生息便出现了社会，fork繁衍生息之后便出现了windows，或者Linux，又或者你手中的iPhone5，双卡双待，大屏加超长待机，还有标配的炫酷铃声——《爱情买卖》。

fork不是一个C函数，而是一个系统调用。c通常是用户层的语言，比如简单的加减法，若要解决复杂的问题，比如申请一段内存，开多进程，这显然不是c 能办到的，或者你也不知如何实现这样一个函数。不同的操作系统有自己的标准，亦有自己定义的API，fork一个进程更不会是一套相同的代码。这种C自己办不到的事情，只能量力而行，通知系统（内核）帮自己处理下咯，内核处理好，将结果返回给c，这便是合作的道理。

创建一个进程

#include <unistd.h>pid_t fork(void);

系统调用的过程

--> 应用程序函数，也就是上面的pid fork(void)

--> libc里的封装例程 , 向内核发送系统调用号

--> 系统调用处理函数，接收到系统调用号，通过sys_call_table找到相应服务例程地址

/* 0 */     CALL(sys_restart_syscall)
                CALL(sys_exit)
                CALL(sys_fork_wrapper)    //-->
                CALL(sys_read)
                CALL(sys_write)

/*-------------------------------------------------------------*/


sys_fork_wrapper:
        add r0, sp, #S_OFF
        b   sys_fork//调用sys_fork

--> 系统调用的服务例程，也就是系统调用的真正干活的函数，在这里就是sys_fork()。

asmlinkage int sys_fork(struct pt_regs *regs)
{
#ifdef CONFIG_MMU
    return do_fork(SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);
#else
    /* can not support in nommu mode */
    return(-EINVAL);
#endif
}

“内核态”与“用户态"

系统调用的过程中出现了两个概念“内核态”和“用户态”。同一个CPU，同一块内存，是从哪里看出分出了两态？
这涉及到处理器的硬件常识，具体到arm处理器，处理器本身就有多种模式：

六种特权模式

- abort模式
- interrupt request模式
- fast interrupt request模式
- supervisor模式
- system模式
- undefined模式

一种非特权模式

- user模式

模式的解释

- 当访问内存（存储器）失败，进入abort模式；
- 处理器响应中断，进入interrupt request模式或者 fast interrupt request模式；
- 处理器复位，supervisor模式，内核便经常运行在这种模式；
- 通常情况下，也就是非内核态，一般运行在user模式，或者system模式；
- 如果遇到错误指令，或者不认识的指令，则进入undefined模式。

模式的寄存器

有这么多模式，当然就该有表示模式的寄存器。
arm处理器有37个寄存器。不同的模式下，一些寄存器工作，一些寄存器隐藏。即不同模式各自有属于自己的寄存器们。当然了，有些寄存器是公共的。

有必要隆重介绍下cpsr寄存器，中文名：程序状态寄存器。寄存器有32位，低五位便表示不同的模式。比如：10011 表示supervisor模式。

关于划分不同模式的意义，就拿user模式与supervisor模式举例。
当系统处于user模式，也就是非内核态时，我们可以访问自己的内存空间，但绝不被允许访问内核代码。但我们将指针指向3G～4G的空间，会怎样。

处理器接收到该取值信号，然后查看当前模式，哦？处理器该模式下没有访问该地址空间的能力。这样一来，内核代码保护从硬件的角度采取禁止措施，也就保护了内核空间的安全，多么无敌的黑客，即使强如凤姐，从用户空间想要破快内核这块碉堡也是徒劳。只能待碉堡自己内部崩溃了。
若用户进程要进入内核态，也就是由user模式转化为supervisor模式。首次，进入特权模式下的system模式，该模式于user模式共用寄存器，唯一的区别是处于system模式下用户态的进程可以有权改变cspr寄存器，也就是改变cspr寄存器的低五位为：10011，进入supervisor模式，然后进程便有权访问3G～4G的内存空间。

先提这么些，有所了解以便能继续策下去。通过系统调用这么个过程，现在我们终于处于内核态了，也就是：arm处理器的cspr寄存器的低五位为10011，开始执行do_fork函数。

“处理器级别”进入内核态，可以“系统调用”

创造子进程

正如注释所言，do_fork 便是展现进程创建细节的函数

View Code

/* *  Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */

参数分析

参数解析如下：

clone_flags:
低八位，用于子进程结束时发送到父进程的信号代码。

View Code

#define CSIGNAL　　　　　　0x000000ff  /* signal mask to be sent at exit */#define CLONE_VM　　　　　 0x00000100  /* set if VM shared between processes */#define CLONE_FS     　　 0x00000200  /* set if fs info shared between processes */#define CLONE_FILES  　　 0x00000400  /* set if open files shared between processes */#define CLONE_SIGHAND    0x00000800  /* set if signal handlers and blocked signals shared */#define CLONE_PTRACE     0x00002000  /* set if we want to let tracing continue on the child too */#define CLONE_VFORK  　   0x00004000  /* set if the parent wants the child to wake it up on mm_release */#define CLONE_PARENT     0x00008000  /* set if we want to have the same parent as the cloner */#define CLONE_THREAD     0x00010000  /* Same thread group? */#define CLONE_NEWNS  　 　0x00020000  /* New namespace group? */#define CLONE_SYSVSEM    0x00040000  /* share system V SEM_UNDO semantics */#define CLONE_SETTLS     0x00080000  /* create a new TLS for the child */#define CLONE_PARENT_SETTID  　　0x00100000  /* set the TID in the parent */#define CLONE_CHILD_CLEARTID    0x00200000  /* clear the TID in the child */#define CLONE_DETACHED      0x00400000  /* Unused, ignored */#define CLONE_UNTRACED      0x00800000  /* set if the tracing process can't force CLONE_PTRACE on this clone */#define CLONE_CHILD_SETTID  0x01000000  /* set the TID in the child */#define CLONE_NEWUTS        0x04000000  /* New utsname group? */#define CLONE_NEWIPC        0x08000000  /* New ipcs */#define CLONE_NEWUSER       0x10000000  /* New user namespace */#define CLONE_NEWPID        0x20000000  /* New pid namespace */#define CLONE_NEWNET        0x40000000  /* New network namespace */#define CLONE_IO            0x80000000 /* Clone io context */

stack_start:
用户态堆栈指针赋给子进程。

stack_size:
未使用。

parent_tidptr:
父进程的用户态变量地址。

child_tidptr:
子进程的用户态变量地址。

理解do_fork其实不难，生成子进程，然后插入进程调度队列，等待调度，分配时间片，最后运行。

long do_fork(unsigned long clone_flags,             unsigned long stack_start,             struct pt_regs *regs,             unsigned long stack_size,             int __user *parent_tidptr,  //             int __user *child_tidptr)  //{    struct task_struct *p;    int trace = 0;     long nr;    if (clone_flags & CLONE_NEWUSER) {        if (clone_flags & CLONE_THREAD)            return -EINVAL;        if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||                !capable(CAP_SETGID))            return -EPERM;    }    if (likely(user_mode(regs)))        trace = tracehook_prepare_clone(clone_flags);    p = copy_process(clone_flags, stack_start, regs, stack_size,                     child_tidptr, NULL, trace);    if (!IS_ERR(p))    {        ... ...        wake_up_new_task(p);　　//-->        ... ...    }    ... ...    return nr;}

主要是两个过程：

1. 待子进程有血有肉后，
2. 将其地址交给wake_up_new_task，准备将其唤醒。

wake_up_new_task：

void wake_up_new_task(struct task_struct *p){    ... ...    rq = __task_rq_lock(p);    activate_task(rq, p, 0);　　//-->    p->on_rq = 1;    ... ...}

activate_task：

static void activate_task(struct rq *rq, struct task_struct *p, int flags){    if (task_contributes_to_load(p))        rq->nr_uninterruptible--;    enqueue_task(rq, p, flags); //加入队列    inc_nr_running(rq);}

管理子进程

Linux的世界里没有“计划生育”，直接导致了无数的子进程们，这当然要管理，怎么管理嘞，排队嘛。

rq:

View Code

/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues * (such as the load balancing or the thread migration code), lock * acquire operations must be ordered by ascending &runqueue. */struct rq rq;

以上便是fork的大致过程，现在我们来稍微深入一下。

创建一个进程，要晓得进程这东西到底是个啥构造。先有骨后有肉，撑起进程的骨骼，剩下的便是在相应的部位填充器官而已。

先介绍copy_process函数的几个重要部分，

（1）设置进程的重要结构：进程描述符和 thread_info

    p = dup_task_struct(current);    if (!p)        goto fork_out;

View Code

static struct task_struct *dup_task_struct(struct task_struct *orig){    ... ...    int node = tsk_fork_get_node(orig);    int err;    prepare_to_copy(orig);    tsk = alloc_task_struct_node(node);  //struct task_struct    if (!tsk)        return NULL;    ti = alloc_thread_info_node(tsk, node); //struct thread_info    if (!ti) {        free_task_struct(tsk);        return NULL;    }    err = arch_dup_task_struct(tsk, orig);    if (err)        goto out;    tsk->stack = ti;    ... ...    setup_thread_stack(tsk, orig);    ... ...}

（2）初始化调度相关。

    sched_fork(p);

View Code

void sched_fork(struct task_struct *p){    unsigned long flags;    int cpu = get_cpu();    __sched_fork(p);    //初始化该进程的调度单元结构体sched_entity。    /*     * We mark the process as running here. This guarantees that     * nobody will actually run it, and a signal or other external     * event cannot wake it up and insert it on the runqueue either.     */    p->state = TASK_RUNNING;    /*     * Revert to default priority/policy on fork if requested.     */    if (unlikely(p->sched_reset_on_fork)) {        if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) {            p->policy = SCHED_NORMAL;            p->normal_prio = p->static_prio;        }        if (PRIO_TO_NICE(p->static_prio) < 0) {            p->static_prio = NICE_TO_PRIO(0);            p->normal_prio = p->static_prio;            set_load_weight(p);        }        /*         * We don't need the reset flag anymore after the fork. It has         * fulfilled its duty:         */        p->sched_reset_on_fork = 0;    }    /*     * Make sure we do not leak PI boosting priority to the child.     */    p->prio = current->normal_prio;    if (!rt_prio(p->prio))        p->sched_class = &fair_sched_class;    //设置调度模式：绝对公平调度算法    if (p->sched_class->task_fork)        p->sched_class->task_fork(p);            /*     * The child is not yet in the pid-hash so no cgroup attach races,     * and the cgroup is pinned to this child due to cgroup_fork()     * is ran before sched_fork().     *     * Silence PROVE_RCU.     */    raw_spin_lock_irqsave(&p->pi_lock, flags);    set_task_cpu(p, cpu);    raw_spin_unlock_irqrestore(&p->pi_lock, flags);#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)    if (likely(sched_info_on()))        memset(&p->sched_info, 0, sizeof(p->sched_info));#endif#if defined(CONFIG_SMP)    p->on_cpu = 0;#endif#ifdef CONFIG_PREEMPT    /* Want to start with kernel preemption disabled. */    task_thread_info(p)->preempt_count = 1;#endif#ifdef CONFIG_SMP    plist_node_init(&p->pushable_tasks, MAX_PRIO);#endif    put_cpu();}

（3）设置子进程的寄存器初始值，包括内核堆栈位置。

    retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);    if (retval)        goto bad_fork_cleanup_io;

View Code

int copy_thread(unsigned long clone_flags, unsigned long stack_start,                unsigned long stk_sz,      struct task_struct *p,     struct pt_regs *regs){    struct thread_info *thread = task_thread_info(p);    struct pt_regs *childregs = task_pt_regs(p);/* struct pt_regs {     unsigned long uregs[18]; };  #define ARM_cpsr    uregs[16] #define ARM_pc      uregs[15] #define ARM_lr      uregs[14] #define ARM_sp      uregs[13] #define ARM_ip      uregs[12] #define ARM_fp      uregs[11] #define ARM_r10     uregs[10] #define ARM_r9      uregs[9] #define ARM_r8      uregs[8] #define ARM_r7      uregs[7] #define ARM_r6      uregs[6] #define ARM_r5      uregs[5] #define ARM_r4      uregs[4] #define ARM_r3      uregs[3] #define ARM_r2      uregs[2] #define ARM_r1      uregs[1] #define ARM_r0      uregs[0] #define ARM_ORIG_r0 uregs[17]*/    *childregs = *regs;    childregs->ARM_r0 = 0;    childregs->ARM_sp = stack_start;/* struct cpu_context_save {            __u32   r4;          __u32   r5;     __u32   r6;          __u32   r7;          __u32   r8;     __u32   r9;     __u32   sl;     __u32   fp;     __u32   sp;     __u32   pc;     __u32   extra[2];       // Xscale 'acc' register, etc  };*/    memset(&thread->cpu_context, 0, sizeof(struct cpu_context_save));    thread->cpu_context.sp = (unsigned long)childregs;    thread->cpu_context.pc = (unsigned long)ret_from_fork;    clear_ptrace_hw_breakpoint(p);    if (clone_flags & CLONE_SETTLS)        thread->tp_value = regs->ARM_r3;    thread_notify(THREAD_NOTIFY_COPY, thread);    return 0;}

（4）将pid插入到hlist。

    attach_pid(p, PIDTYPE_PID, pid);    nr_threads++;

View Code

void attach_pid(struct task_struct *task, enum pid_type type,                struct pid *pid){    struct pid_link *link;        link = &task->pids[type];    link->pid = pid;    hlist_add_head_rcu(&link->node, &pid->tasks[type]);}

struct pid_link{       struct hlist_node node;    struct pid *pid;};

struct pid{    atomic_t count;    unsigned int level;    struct hlist_head tasks[PIDTYPE_MAX];    struct rcu_head rcu;    struct upid numbers[1];};

以上便是创建一个进程（用户态）的大致过程，也是“写时复制”的特点，子进程初始化时大部分继承父进程资源，以便使创建过程轻量化。

起初学习操作系统，接触的仅仅是“进程“、“线程”两个简单明了、不痛不痒的词。谁知在实际的操作系统当中却又冒出了“内核线程”、“轻量级进程”、“用户线程”、“LWP“。

“人的第一印象很重要”这大家都晓得，其实“概念的第一印象也很重要“。不是进程就是大的，线程就是小的。这么一大一小就把全世界的X程给归类了。有些东西需要再细抠一下，才能明白其产生的原因。

一、线程类型

内核线程

首先，关于“内核进程”的问题，引用csdn论坛的回答：

没有“内核进程”。“内核线程”本身就是一种特殊的进程，它只在内核空间中运行，因此没有与之相关联的“虚拟地址空间”，也就永远不会被切换到用户空间中执行。但跟一般的进程一样，它们也是可调度的、可抢占的。这一点跟中断处理程序不一样。
Linux一般用内核线程来执行一些特殊的操作。比如负责page cache回写的pdflush内核线程。
另外，在Linux内核中，可调度的东西都对应一个thread_info以及一个task_struct，同一个进程中的线程，跟进程的区别仅仅是它们共享了一些资源，比如地址空间（mm_struct成员指向同一位置）。所以，如果非要觉得内核线程应该被称为“内核进程”，那也没啥不可以，只是这样说的话，就成了文字游戏了。毕竟官方的叫法就是“内核线程”。

用户“轻量级线程 LWP"

轻量级线程(LWP)是一种由内核支持的用户线程。它是基于内核线程的高级抽象，因此只有先支持内核线程，才能有LWP。

每一个进程有一个或多个LWPs，每个LWP由一个内核线程支持。这种模型实际上就是恐龙书上所提到的一对一线程模型。在这种实现的操作系统中，LWP就是用户线程。
由于每个LWP都与一个特定的内核线程关联，因此每个LWP都是一个独立的线程调度单元。即使有一个LWP在系统调用中阻塞，也不会影响整个进程的执行。
轻量级进程具有局限性。首先，大多数LWP的操作，如建立、析构以及同步，都需要进行系统调用。系统调用的代价相对较高：需要在user mode和kernel mode中切换。其次，每个LWP都需要有一个内核线程支持，因此LWP要消耗内核资源（内核线程的栈空间）。因此一个系统不能支持大量的LWP。LWP虽然本质上属于用户线程，但LWP线程库是建立在内核之上的，LWP的许多操作都要进行系统调用，因此效率不高。

用户线程

我们常用的”线程“实则完全建立在用户空间，用一套库去实现。用户线程在用户空间中实现，内核并没有直接对用户线程进行调度。内核并不知道用户线程的存在。
其缺点是一个用户线程如果阻塞在系统调用中，则整个进程都将会阻塞。

绑定模式：用户线程+LWP

介于“轻量级线程”可与内核交互的特点和 “用户线程”效率高的特点，两者结合便衍生出“加强版的用户线程——用户线程+LWP“模式。
用户线程库还是完全建立在用户空间中，因此用户线程的操作还是很廉价，因此可以建立任意多需要的用户线程。操作系统提供了LWP作为用户线程和内核线程之间的桥梁。
LWP还是和前面提到的一样，具有内核线程支持，是内核的调度单元，并且用户线程的系统调用要通过LWP，因此进程中某个用户线程的阻塞不会影响整个进程的执行。用户线程库将建立的用户线程关联到LWP上，LWP与用户线程的数量不一定一致。当内核调度到某个LWP上时，此时与该LWP关联的用户线程就被执行。

二、创建一个简单的 “内核线程”

/* * Create a kernel thread. */pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags){    struct pt_regs regs;    memset(&regs, 0, sizeof(regs));    regs.ARM_r4 = (unsigned long)arg;    regs.ARM_r5 = (unsigned long)fn;    regs.ARM_r6 = (unsigned long)kernel_thread_exit;    regs.ARM_r7 = SVC_MODE | PSR_ENDSTATE | PSR_ISETSTATE;    regs.ARM_pc = (unsigned long)kernel_thread_helper;    regs.ARM_cpsr = regs.ARM_r7 | PSR_I_BIT;    return do_fork(flags|CLONE_VM|CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);}

有了之前对arm处理器寄存器的了解后，对以上的宏就不会陌生。

寄存器中我们提到了cspr寄存器，寄存器的32位并未全有意义，低地址即表示处理器模式，还有出现的异常标志。高地址表示汇编运算中的需要的标志，比如条件判断是否相等，加减运算是否为零等。倘若你稍加学习arm汇编，便对以下的各种宏再熟悉不过。

View Code

/* * PSR bits */#define USR26_MODE  0x00000000#define FIQ26_MODE  0x00000001#define IRQ26_MODE  0x00000002#define SVC26_MODE  0x00000003#define USR_MODE    0x00000010#define FIQ_MODE    0x00000011#define IRQ_MODE    0x00000012#define SVC_MODE    0x00000013#define ABT_MODE    0x00000017#define UND_MODE    0x0000001b#define SYSTEM_MODE 0x0000001f#define MODE32_BIT  0x00000010#define MODE_MASK   0x0000001f#define PSR_T_BIT   0x00000020#define PSR_F_BIT   0x00000040#define PSR_I_BIT   0x00000080#define PSR_A_BIT   0x00000100#define PSR_E_BIT   0x00000200#define PSR_J_BIT   0x01000000#define PSR_Q_BIT   0x08000000#define PSR_V_BIT   0x10000000#define PSR_C_BIT   0x20000000#define PSR_Z_BIT   0x40000000#define PSR_N_BIT   0x80000000

“懂硬件的程序员才是好程序员”。

祖宗进程：INIT_TASK(), kernel_init()

最熟悉的内核线程莫过于进程0，进程1。人总有“认祖归宗”的天性，那这么些个进程的老祖宗到底是谁，当然就是进程0。

“宇宙形成之初，一切归于虚无”，在你开机的刹那，没有进程，更没有什么例程为你服务。一切都需自力更生，数据结构只能自己静态分配。

一、进程0

/* * Initial task structure. * * All other task structs will be allocated on slabs in fork.c */struct task_struct init_task = INIT_TASK(init_task);/* *  INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) */#define INIT_TASK(tsk)  \{                                   \    .state      = 0,                        \    .stack      = &init_thread_info,                \    .usage      = ATOMIC_INIT(2),               \    .flags      = PF_KTHREAD,                   \    .prio       = MAX_PRIO-20,                  \    .static_prio    = MAX_PRIO-20,                  \    .normal_prio    = MAX_PRIO-20,                  \    .policy     = SCHED_NORMAL,                 \    .cpus_allowed   = CPU_MASK_ALL,                 \    .mm     = NULL,                     \    .active_mm  = &init_mm,                 \    .se     = {                     \        .group_node     = LIST_HEAD_INIT(tsk.se.group_node),    \    },                              \    .rt     = {                     \        .run_list   = LIST_HEAD_INIT(tsk.rt.run_list),  \        .time_slice = HZ,                   \        .nr_cpus_allowed = NR_CPUS,             \    },                              \    .tasks      = LIST_HEAD_INIT(tsk.tasks),            \    ... ...}

进程0执行start_kernel函数初始化内核需要的所有数据结构，激活中断，然后创建进程1(init进程)。

asmlinkage void __init start_kernel(void){    ... ...    /* Do the rest non-__init'ed, we're now alive */    rest_init();}

最后进入假死状态，若某时刻突然没了孩子运行，便诈尸收拾局面。

static noinline void __init_refok rest_init(void){    int pid;    rcu_scheduler_starting();    /*     * We need to spawn init first so that it obtains pid 1, however     * the init task will end up wanting to create kthreads, which, if     * we schedule it before we create kthreadd, will OOPS.     */    kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);  //create init task...    numa_default_policy();    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); //    rcu_read_lock();    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);    rcu_read_unlock();    complete(&kthreadd_done); //解锁kthreadd_done    /*     * The boot idle thread must execute schedule()     * at least once to get things moving:     */    init_idle_bootup_task(current);    preempt_enable_no_resched();    schedule();    preempt_disable();    /* Call into cpu_idle with preempt disabled */    cpu_idle();  //进程0进入假死状态，当没有其他进程处于TASK_RUNNING状态时，调度程序才选择进程0}

二、进程1

static int __init kernel_init(void * unused){    /*     * Wait until kthreadd is all set-up.     */    wait_for_completion(&kthreadd_done); //等待kthreadd_done解锁    /*     * init can allocate pages on any node     */    set_mems_allowed(node_states[N_HIGH_MEMORY]);    /*     * init can run on any cpu.     */    set_cpus_allowed_ptr(current, cpu_all_mask);    cad_pid = task_pid(current);    smp_prepare_cpus(setup_max_cpus);    do_pre_smp_initcalls();    lockup_detector_init();    smp_init();    sched_init_smp();    do_basic_setup();    /* Open the /dev/console on the rootfs, this should never fail */    if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)        printk(KERN_WARNING "Warning: unable to open an initial console.\n");    (void) sys_dup(0);    (void) sys_dup(0);    /*     * check if there is an early userspace init.  If yes, let it do all     * the work     */    if (!ramdisk_execute_command)        ramdisk_execute_command = "/init"; //一般为空，然后赋值"/init"    if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {        ramdisk_execute_command = NULL;        prepare_namespace();    }    /*     * Ok, we have completed the initial bootup, and     * we're essentially up and running. Get rid of the     * initmem segments and start the user-mode stuff..     */    init_post();　　//-->   return 0;}

static noinline int init_post(void){    /* need to finish all async __init code before freeing the memory */    async_synchronize_full();    free_initmem();    mark_rodata_ro();    system_state = SYSTEM_RUNNING;    numa_default_policy();    current->signal->flags |= SIGNAL_UNKILLABLE;    if (ramdisk_execute_command) {        run_init_process(ramdisk_execute_command); /* * 装入可执行程序init_filename, init内核线程变为一个普通进程 *  * static void run_init_process(const char *init_filename) * {  *     argv_init[0] = init_filename; *     kernel_execve(init_filename, argv_init, envp_init); * } *  */        printk(KERN_WARNING "Failed to execute %s\n",                ramdisk_execute_command);    }    /*     * We try each of these until one succeeds.     *     * The Bourne shell can be used instead of init if we are     * trying to recover a really broken machine.     */    if (execute_command) {        run_init_process(execute_command);        printk(KERN_WARNING "Failed to execute %s.  Attempting "                    "defaults...\n", execute_command);    }    run_init_process("/sbin/init");    run_init_process("/etc/init");    run_init_process("/bin/init");    run_init_process("/bin/sh");    panic("No init found.  Try passing init= option to kernel. "          "See Linux Documentation/init.txt for guidance.");}

Oyeah，算是终于创出了个进程；Linux调度器如何调度孩子们呢？

“生孩子容易，养孩子难，何况又逢如今高房价、高物价、高血压的年代。”

fork了一堆子进程，如何管理，又轮到谁执行。 This's a big problem!

说到选择，就不得不提运筹学，没有谁是重要到可以忽视整个团队，只有合理的分配组合才能发挥最大的效用。

Linux内核乃抢占式内核众所周知，“大家每人占一会儿，VIP要躲占一会儿”，问题来了，这“一会儿”该是多长？谁又该是VIP?

“一会儿”若是太长，后面的人等的急，便会反应迟钝。
“一会儿”若是太短，还没做什么，就会被换下去。

要让大伙都能受到照顾，不会产生怨言，这便是“调度器”的使命。

LInux 调度器

Linux调度器的算法思想可参见：

http://hi.baidu.com/kebey2004/blog/item/3f96250803662a3de8248841.html

目前内核使用的调度算法是 CFS，模糊了传统的时间片和优先级的概念。

基于调度器模块管理器，可以加入其它调度算法，不同的进程可选择不同的调度算法。

调度的首要问题便是：何时调度。似乎有个定了时的闹钟，闹铃一响，考虑是否切换进程。

时间滴滴嗒嗒，又是谁掌控着内核的生物钟。

时钟中断是一种I/O中断，每中断一次，一次滴答。时钟滴答来源于pclk的分频。

一、单处理器的时间中断

-- arch/arm/plat-samsung/time.c --/* * IRQ handler for the timer */static irqreturn_ts3c2410_timer_interrupt(int irq, void *dev_id){    timer_tick();    return IRQ_HANDLED;}

static struct irqaction s3c2410_timer_irq = {     .name       = "S3C2410 Timer Tick",    .flags      = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,    .handler    = s3c2410_timer_interrupt,};

-- arch/arm/kernel/time.c --/* * Kernel system timer support. */void timer_tick(void){       profile_tick(CPU_PROFILING);    do_leds();    xtime_update(1); //初始化墙上时钟 --> b#ifndef CONFIG_SMP    update_process_times(user_mode(get_irq_regs())); //更新一些内核统计数 --> d#endif}

void xtime_update(unsigned long ticks){    write_seqlock(&xtime_lock);    do_timer(ticks); //-->bb    write_sequnlock(&xtime_lock);}

bb:

View Code

void do_timer(unsigned long ticks){    jiffies_64 += ticks;    update_wall_time();    calc_global_load(ticks); }

/*更新一些内核统计数*/void update_process_times(int user_tick){    struct task_struct *p = current;    int cpu = smp_processor_id();    /* Note: this timer irq context must be accounted for as well. */    account_process_tick(p, user_tick); //检查当前进程运行了多长时间 -->e

    run_local_timers();   //激活本地TIMER_SOFTIRQ任务队列-->f

    rcu_check_callbacks(cpu, user_tick);    printk_tick();#ifdef CONFIG_IRQ_WORK    if (in_irq())        irq_work_run();#endif    scheduler_tick(); //-->g    run_posix_cpu_timers(p);}

View Code

/* * Account a single tick of cpu time. * @p: the process that the cpu time gets accounted to * @user_tick: indicates if the tick is a user or a system tick */注释写的很清楚，结合实参便明白：当时钟中断发生时，会根据当前进程运行在用户态还是系统态而执行不同的函数。这个 user_mode(get_irq_regs() 就是查看当前是个什么态。对于arm处理器就是查看cpsr寄存器。void account_process_tick(struct task_struct *p, int user_tick){    cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);    struct rq *rq = this_rq();    if (sched_clock_irqtime) {        irqtime_account_process_tick(p, user_tick, rq);        return;    }    if (user_tick)        account_user_time(p, cputime_one_jiffy, one_jiffy_scaled); //-->ee    else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))        account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,                            one_jiffy_scaled);    else        account_idle_time(cputime_one_jiffy);}

ee:

View Code

/* * Account user cpu time to a process. * @p: the process that the cpu time gets accounted to * @cputime: the cpu time spent in user space since the last update * @cputime_scaled: cputime scaled by cpu frequency */#define cputime_one_jiffy       jiffies_to_cputime(1)#define jiffies_to_cputime(__hz)    (__hz)/*主要是进程相关的时间更新，以及相应的优先级变化*/void account_user_time(struct task_struct *p, cputime_t cputime,                       cputime_t cputime_scaled){    struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;    cputime64_t tmp;        /* Add user time to process. */    p->utime = cputime_add(p->utime, cputime);    p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);    account_group_user_time(p, cputime);    /* Add user time to cpustat. */    tmp = cputime_to_cputime64(cputime);    if (TASK_NICE(p) > 0)        cpustat->nice = cputime64_add(cpustat->nice, tmp); //nice值增加，优先级降低    else        cpustat->user = cputime64_add(cpustat->user, tmp);    cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);    /* Account for user time used */    acct_update_integrals(p);}

二、硬中断

这里涉及一个新东西，软中断。

/* * Called by the local, per-CPU timer interrupt on SMP. */void run_local_timers(void){    hrtimer_run_queues();    raise_softirq(TIMER_SOFTIRQ); //激活软中断，软中断下标：TIMER_SOFTIRQ()}

linux3.0就是用了这么些个软中断。

View Code

enum{    HI_SOFTIRQ=0, 　　  //处理高优先级的tasklet    TIMER_SOFTIRQ, 　　 //和时钟中断相关的tasklet    NET_TX_SOFTIRQ, 　　//把数据包发送到网卡    NET_RX_SOFTIRQ, 　　 //从网卡上接收数据包    BLOCK_SOFTIRQ,    BLOCK_IOPOLL_SOFTIRQ,    TASKLET_SOFTIRQ, 　 //处理常规的tasklet    SCHED_SOFTIRQ,    HRTIMER_SOFTIRQ,    RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */    NR_SOFTIRQS};

路人甲问了：“中断有软也该有硬，那软硬之间到底有啥区别哩？”

曾经的路人甲就是我，忽软忽硬搞得晕头转向。硬中断好理解，好比你和CPU之间牵着根拉紧的绳子，你一按key，绳子（电平）拉低，放出一个

正弦波，顺绳子发散开去，cpu忽觉手一沉，正弦波到了。cpu知道你发了信号，不自觉的走了神，中断了cpu的思维。

硬中断当然要有个实实在在的硬家伙“中断硬件控制器”，

arm模式下的寄存器组织（还有另外的thumb模式）。小三角表示各个模式自己私有的寄存器，可见R8~R14为快中断模式下的私有私有寄存器，有7个，为最多，这使得快速中断模式执行很大部分程序时，不需要保存太多的寄存器的值（将寄存器入栈写入内存），节省处理时间，使之能快速的反应。

stmdb   sp!, {r0-r12, lr}       @保存寄存器... ... ldmia   sp!, {r0-r12, pc} ^     @恢复寄存器

处理过程：

1，汇集各类外设发出的中断信号，然后告诉CPU。
2，CPU保存寄存器（当前环境），调用Interrupt Service Routine。
3，识别中断
4，清除中断
5，恢复现场

如果许多中断同时发生该怎么办？快速中断为何反应更快？
随着处理器的升级，中断的种类也会不多的增多，若一个中断位对应一种中断，那么中断寄存器这区区32位就早就溢出不够用了么。
结论，有些中断位能表示多个中断。而这多个中断势必有共同点。
比如串口，发送数据完毕，INT_TXD0；接收到数据，INT_RXD0；所以，只要其中有一个发生，则SRCPND寄存器中的INT_UART0位被置1。
也就是说，处理中断使用了深度为三的树型结构。根结点即为cpu处理当前的中断。
每种中断都有一个优先级，在硬件方面，中断在进入cpu之前先要经过仲裁器的审批，先后被处理，也就是先后进入根结点。
选择GPIO作为中断引脚，通常捡一个空闲的GPIO，将EINT0，EINT1后的数字只是作为简单的编号，实则不然。不仅是一个编号，同时也反应着它作为中断线在仲裁器中的优先级。显然，EINT0是个优先级很高的GPIO。

最后，中断中如果有快中断，它的处理必须是及时的，凭什么能如此迅速，因为人家有快速通道，不用仲裁，直达根结点。

三、软中断

软中断，系统调用便属于此。虽说都在同一片内存上运行，同吃一口饭，但内核守护着自己1G的空间，封闭的像个碉堡。用户程序在城墙外叫内核开门，忙碌的内核被你的吵闹声中断。

raise_softirq(TIMER_SOFTIRQ);/*开始激活软中断队列*/void raise_softirq(unsigned int nr){       unsigned long flags;    local_irq_save(flags); //保存寄存器状态，并禁止本地中断    raise_softirq_irqoff(nr); //-->i    local_irq_restore(flags);}

View Code

ii:

View Code

iii:

每个cpu都有一个32位的位掩码，描述挂起的软中断。

View Code

typedef struct {    unsigned int __softirq_pending;#ifdef CONFIG_LOCAL_TIMERS    unsigned int local_timer_irqs;#endif#ifdef CONFIG_SMP    unsigned int ipi_irqs[NR_IPI];#endif} ____cacheline_aligned irq_cpustat_t;#define or_softirq_pending(x)  (local_softirq_pending() |= (x)) //设置软中断位掩码#define local_softirq_pending() \    __IRQ_STAT(smp_processor_id(), __softirq_pending)#define __IRQ_STAT(cpu, member) (irq_stat[cpu].member)

软中断通过raise_softirq设置，之后在适当的时候被执行。
也就是说，这里挂起了一个时间相关的软中断，在某些特殊的时间点会周期性的检查软中断位掩码，突然发现：“哦？竟然有人挂起！” 于是内核调用do_softirq来处理。
这里冒出个问题：“某些特殊的时间点” 指的是哪些点？

asmlinkage void do_softirq(void){    __u32 pending;    unsigned long flags;    if (in_interrupt()) //-->in        return;        local_irq_save(flags);    pending = local_softirq_pending();    if (pending)        __do_softirq(); //有挂起，则执行-->pend    local_irq_restore(flags);}

in：

/*检查当前是否处于中断当中*/#define in_interrupt()      (irq_count())#define irq_count() ( preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK | NMI_MASK) )/* * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 * HARDIRQ_MASK: 0x03ff0000 *     NMI_MASK: 0x04000000 */#define preempt_count() (current_thread_info()->preempt_count)static inline struct thread_info *current_thread_info(void){    register unsigned long sp asm ("sp");    return (struct thread_info *)(sp & ~(THREAD_SIZE - 1)); //8k-1: 0001 1111 1111 1111}

pend:

数组元素对应软中断各自的处理函数。

static struct softirq_action softirq_vec[NR_SOFTIRQS]struct softirq_action{       void    (*action)(struct softirq_action *);};

asmlinkage void __do_softirq(void){    struct softirq_action *h;    __u32 pending;    int max_restart = MAX_SOFTIRQ_RESTART;    int cpu;    pending = local_softirq_pending(); //获得掩码    account_system_vtime(current);    __local_bh_disable((unsigned long)__builtin_return_address(0),                       SOFTIRQ_OFFSET);    lockdep_softirq_enter();    cpu = smp_processor_id();restart:    set_softirq_pending(0); //清除软中断位图    local_irq_enable();  //然后激活本中断    h = softirq_vec;    do {        if (pending & 1) { //根据掩码按优先级顺序依次执行软中断处理函数            unsigned int vec_nr = h - softirq_vec;            int prev_count = preempt_count();            kstat_incr_softirqs_this_cpu(vec_nr);            trace_softirq_entry(vec_nr);            h->action(h); //执行-->action            trace_softirq_exit(vec_nr);            if (unlikely(prev_count != preempt_count())) {                printk(KERN_ERR "huh, entered softirq %u %s %p"                       "with preempt_count %08x,"                       " exited with %08x?\n", vec_nr,                       softirq_to_name[vec_nr], h->action,                       prev_count, preempt_count());                preempt_count() = prev_count;            }            rcu_bh_qs(cpu);        }        h++;        pending >>= 1;    } while (pending);    local_irq_disable();    pending = local_softirq_pending(); //再次获得掩码    if (pending && --max_restart) //正在执行一个软中断函数时可能出现新挂起的软中断        goto restart;    if (pending)   //若还有，则启动ksoftirqd内核线程处理        wakeup_softirqd();  //-->softirqd    lockdep_softirq_exit();    account_system_vtime(current);    __local_bh_enable(SOFTIRQ_OFFSET);}

softirqd:

每个cpu都有自己的softirqd内核线程。

它的出现解决了一个纠结的问题：在软中断执行的过程中，刚好在此时又有软中断被挂起，怎么办。好比公车司机见最后一位乘客上车，关门。刚脚踩油门，后视镜竟瞧见有人狂奔而来，开不开门？

内核里，如果已经执行的软中断又被激活，do_softirq()则唤醒内核线程，并终止自己。剩下的事，交给有较低优先级的内核线程处理。这样，用户程序才会有机会运行。

可见，如此设计的原因是防止用户进程“饥饿”，毕竟内核优先级高于用户，无数的软中断若段时间突袭，用户岂不会卡死。

static int run_ksoftirqd(void * __bind_cpu){    set_current_state(TASK_INTERRUPTIBLE);    while (!kthread_should_stop()) {        preempt_disable();        if (!local_softirq_pending()) {            preempt_enable_no_resched();            schedule();            preempt_disable();        }           __set_current_state(TASK_RUNNING);        while (local_softirq_pending()) {            /* Preempt disable stops cpu going offline.               If already offline, we'll be on wrong CPU:               don't process */            if (cpu_is_offline((long)__bind_cpu))                goto wait_to_die;            local_irq_disable();            if (local_softirq_pending())                __do_softirq();  //内核线程被唤醒，在必要是调用。            local_irq_enable();            preempt_enable_no_resched();            cond_resched();  //进程切换            preempt_disable();            rcu_note_context_switch((long)__bind_cpu);        }           preempt_enable();        set_current_state(TASK_INTERRUPTIBLE); //没有挂起的软中断，则休眠    }       __set_current_state(TASK_RUNNING);    return 0;wait_to_die:    preempt_enable();    /* Wait for kthread_stop */    set_current_state(TASK_INTERRUPTIBLE);    while (!kthread_should_stop()) {        schedule();        set_current_state(TASK_INTERRUPTIBLE);    }    __set_current_state(TASK_RUNNING);    return 0;}

action:

数组元素对应软中断各自的处理函数。

static struct softirq_action softirq_vec[NR_SOFTIRQS]struct softirq_action{       void    (*action)(struct softirq_action *);};

与TIMER_SOFTIRQ相关的软中断处理函数到底是个甚样子。

-- kernel/softirq.c --void open_softirq(int nr, void (*action)(struct softirq_action *)) {    softirq_vec[nr].action = action;}

再往下看，原来早在init_timers函数中便初始化。

-- kernel/timer.c --void __init init_timers(void){    int err = timer_cpu_notify(&timers_nb, (unsigned long)CPU_UP_PREPARE,                (void *)(long)smp_processor_id());    init_timer_stats();    BUG_ON(err != NOTIFY_OK);    register_cpu_notifier(&timers_nb);    open_softirq(TIMER_SOFTIRQ, run_timer_softirq);　　//中断处理函数赋值}

TIMER_SOFTIRQ软中断处理函数：

static void run_timer_softirq(struct softirq_action *h){    struct tvec_base *base = __this_cpu_read(tvec_bases);    hrtimer_run_pending();    if (time_after_eq(jiffies, base->timer_jiffies))        __run_timers(base); //-->run}

说到此，有必要简单的提下定时器

定时器什么功能大家都晓得，唯一注意一下的是，定时器不一定准。

内核是个大忙人，比定时器重要的任务还有许多，所以出现定时到了却迟迟没有动静的情况也不要大惊小怪。

一、定时器就是一个闹钟

struct timer_list {    struct list_head entry;    unsigned long expires;     struct tvec_base *base;        void (*function)(unsigned long);    unsigned long data;            int slack;#ifdef CONFIG_TIMER_STATS    int start_pid;    void *start_site;    char start_comm[16];#endif#ifdef CONFIG_LOCKDEP    struct lockdep_map lockdep_map;#endif};

闹钟们都挂在链子上：

struct tvec_base {    spinlock_t lock;    struct timer_list *running_timer;    unsigned long timer_jiffies;    unsigned long next_timer;    struct tvec_root tv1;    struct tvec tv2;    struct tvec tv3;    struct tvec tv4;    struct tvec tv5;} ____cacheline_aligned;

看来有五条链子（tv1~tv5）, 分别挂着剩余时间相同的闹钟。
有人问了，时间总在不停的前进，闹钟也该不停的换链表吧。
那是当然，这交给了cascade函数。

run:

static inline void __run_timers(struct tvec_base *base){    struct timer_list *timer;    spin_lock_irq(&base->lock);    while (time_after_eq(jiffies, base->timer_jiffies)) {        struct list_head work_list;        struct list_head *head = &work_list;        int index = base->timer_jiffies & TVR_MASK;        /*         * Cascade timers:         */        if (!index &&            (!cascade(base, &base->tv2, INDEX(0))) &&                (!cascade(base, &base->tv3, INDEX(1))) &&                    !cascade(base, &base->tv4, INDEX(2)))            cascade(base, &base->tv5, INDEX(3)); //过滤动态定时器        ++base->timer_jiffies;        list_replace_init(base->tv1.vec + index, &work_list);        while (!list_empty(head)) {            void (*fn)(unsigned long);            unsigned long data;            timer = list_first_entry(head, struct timer_list,entry);            fn = timer->function;            data = timer->data;            timer_stats_account_timer(timer);            base->running_timer = timer;            detach_timer(timer, 1);            spin_unlock_irq(&base->lock);            call_timer_fn(timer, fn, data); //执行定时器函数            spin_lock_irq(&base->lock);        }    }    base->running_timer = NULL;    spin_unlock_irq(&base->lock);}

过去总有这么个疑惑，cpu不停的更新jiffies，咋会有足够的时间做其他的么。仔细一想，才知杞人忧天。cpu在一个jiffies内至少还有一万次以上的震荡，上千条指令的空余。区区改个时间能用到多少指令。

当然了，也会有任务多到一个jiffies搞不定的时候，怎么办，就把任务推后到下一个jiffies的空余时间里处理，这也就是所谓的中断下半部的思想，而这里的软中断，还有tasklet就亦如此。

二、调度算法模块

之前说到哪了……哦，还剩下update_process_times的最后一个函数scheduler_tick。

void scheduler_tick(void){       int cpu = smp_processor_id();    struct rq *rq = cpu_rq(cpu);    struct task_struct *curr = rq->curr;            sched_clock_tick();        raw_spin_lock(&rq->lock);    update_rq_clock(rq);    update_cpu_load_active(rq);    curr->sched_class->task_tick(rq, curr, 0);　　//-->    raw_spin_unlock(&rq->lock);    perf_event_task_tick();#ifdef CONFIG_SMP    rq->idle_at_tick = idle_cpu(cpu);    trigger_load_balance(rq, cpu);#endif}

这里出现的sched_class结构体便是调度算法模块化的体现，内核通过该结构体管理调度问题。

关于绝对公平调度：

View Code

-->

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued){    struct cfs_rq *cfs_rq;    struct sched_entity *se = &curr->se;    for_each_sched_entity(se) {  //宏：有子到父逐渐向上遍历sched_entity        cfs_rq = cfs_rq_of(se);        entity_tick(cfs_rq, se, queued); //-->    }}

static voidentity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued){    /*     * Update run-time statistics of the 'current'.     */    update_curr(cfs_rq);    /*     * Update share accounting for long-running entities.     */    update_entity_shares_tick(cfs_rq);#ifdef CONFIG_SCHED_HRTICK    /*     * queued ticks are scheduled to match the slice, so don't bother     * validating it and just reschedule.     */    if (queued) {        resched_task(rq_of(cfs_rq)->curr); //        return;    }    /*     * don't let the period tick interfere with the hrtick preemption     */    if (!sched_feat(DOUBLE_TICK) &&            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))        return;#endif    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))        check_preempt_tick(cfs_rq, curr);}

sched_entity

struct sched_entity {    struct load_weight  load;       /* for load-balancing */    struct rb_node      run_node;    struct list_head    group_node;    unsigned int        on_rq;    u64         exec_start;    u64         sum_exec_runtime;    u64         vruntime;    u64         prev_sum_exec_runtime;    u64         nr_migrations;#ifdef CONFIG_SCHEDSTATS    struct sched_statistics statistics;#endif#ifdef CONFIG_FAIR_GROUP_SCHED    struct sched_entity *parent;    /* rq on which this entity is (to be) queued: */    struct cfs_rq       *cfs_rq;    /* rq "owned" by this entity/group: */    struct cfs_rq       *my_q;#endif};

进程优先级之红黑树

CFS引入了红黑树结构，树的结点即代表一个进程，优先级越高，位置越靠左。新加入进程，进程红黑树需要调整，待调整完毕，剩下进程调度的精髓函数：schedule

Goto: Linux内核CFS进程调度策略

一、pick up one

从运行队列的链表中找到一个进程，并将CPU分配给它。

/* * schedule() is the main scheduler function. */asmlinkage void __sched schedule(void){    struct task_struct *prev, *next; //关键在于为next赋值，也就是找到要调度的下一个进程    unsigned long *switch_count;    struct rq *rq;    int cpu;need_resched:    preempt_disable();    cpu = smp_processor_id(); //获得current的cpu ID    rq = cpu_rq(cpu);  //找到属于该cpu的队列    rcu_note_context_switch(cpu);    prev = rq->curr;    schedule_debug(prev);    if (sched_feat(HRTICK))        hrtick_clear(rq);    raw_spin_lock_irq(&rq->lock); //在寻找可运行进程之前，必须关掉本地中断    switch_count = &prev->nivcsw;    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {        if (unlikely(signal_pending_state(prev->state, prev))) { //若为非阻塞挂起信号，则给予prev一次运行的机会            prev->state = TASK_RUNNING;        } else {            deactivate_task(rq, prev, DEQUEUE_SLEEP); //休眠当前进程            prev->on_rq = 0;            /*             * If a worker went to sleep, notify and ask workqueue             * whether it wants to wake up a task to maintain             * concurrency.             */            if (prev->flags & PF_WQ_WORKER) {                struct task_struct *to_wakeup;                to_wakeup = wq_worker_sleeping(prev, cpu);                if (to_wakeup)                    try_to_wake_up_local(to_wakeup);            }            /*             * If we are going to sleep and we have plugged IO             * queued, make sure to submit it to avoid deadlocks.             */            if (blk_needs_flush_plug(prev)) {                raw_spin_unlock(&rq->lock);                blk_schedule_flush_plug(prev);                raw_spin_lock(&rq->lock);            }        }        switch_count = &prev->nvcsw;    }    pre_schedule(rq, prev);    if (unlikely(!rq->nr_running)) //若当前cpu队列没有了可运行的进程，咋办？        idle_balance(cpu, rq);  //向另一个cpu上借几个好了，这涉及到CPU间的负载平衡问题    put_prev_task(rq, prev);  //安置好prev进程    next = pick_next_task(rq);  //寻找一个新进程    clear_tsk_need_resched(prev);    rq->skip_clock_update = 0;    if (likely(prev != next)) {  //准备进程切换        rq->nr_switches++;        rq->curr = next;        ++*switch_count;        context_switch(rq, prev, next); /* unlocks the rq */        /*         * The context switch have flipped the stack from under us         * and restored the local variables which were saved when         * this task called schedule() in the past. prev == current         * is still correct, but it can be moved to another cpu/rq.         */        cpu = smp_processor_id();        rq = cpu_rq(cpu);    } else        raw_spin_unlock_irq(&rq->lock);    post_schedule(rq);    preempt_enable_no_resched();    if (need_resched()) //查看是否一些其他的进程设置了当前进程的TIF_NEED_RESCHED标志        goto need_resched;}

二、寻找一个新进程

static inline struct task_struct *pick_next_task(struct rq *rq){    const struct sched_class *class;    struct task_struct *p;    /*     * Optimization: we know that if all tasks are in     * the fair class we can call that function directly:     */    if (likely(rq->nr_running == rq->cfs.nr_running)) {        p = fair_sched_class.pick_next_task(rq);　　　　　　// 若是CFS调度，next task就是红黑树的最左端的进程，很方便        if (likely(p))            return p;    }    for_each_class(class) {        p = class->pick_next_task(rq);        if (p)            return p;    }    BUG(); /* the idle class will always have a runnable task */}

三、放飞子进程

得到新进程的标识后，开始对内存等重要指标设置。一切设置完毕，子进程长大了，翅膀硬了，就让它自己飞了。

/* * context_switch - switch to the new MM and the new * thread's register state. */static inline voidcontext_switch(struct rq *rq, struct task_struct *prev,               struct task_struct *next){    struct mm_struct *mm, *oldmm;    prepare_task_switch(rq, prev, next);    mm = next->mm;    oldmm = prev->active_mm;    /*     * For paravirt, this is coupled with an exit in switch_to to     * combine the page table reload and the switch backend into     * one hypercall.     */    arch_start_context_switch(prev);    if (!mm) { //若为内核线程，则使用prev的地址空间        next->active_mm = oldmm;        atomic_inc(&oldmm->mm_count);        enter_lazy_tlb(oldmm, next);    } else        switch_mm(oldmm, mm, next);    if (!prev->mm) {        prev->active_mm = NULL;        rq->prev_mm = oldmm;    }    /*     * Since the runqueue lock will be released by the next     * task (which is an invalid locking op but in the case     * of the scheduler it's an obvious special-case), so we     * do an early lockdep release here:     */#ifndef __ARCH_WANT_UNLOCKED_CTXSW    spin_release(&rq->lock.dep_map, 1, _THIS_IP_);#endif    /* Here we just switch the register state and the stack. */    switch_to(prev, next, prev);    barrier(); //保证任何汇编语言指令都不能通过    /*     * this_rq must be evaluated again because prev may have moved     * CPUs since it called schedule(), thus the 'rq' on its stack     * frame will be invalid.     */    finish_task_switch(this_rq(), prev);}

轮到了这个进程，放飞它

一、开始起飞

#define switch_to(prev,next,last)                   \do {                                    \    last = __switch_to(prev,task_thread_info(prev), task_thread_info(next));    \} while (0)

二、汇编代码细节

-- arch/arm/kernel/entry-armv.S --/* * Register switch for ARMv3 and ARMv4 processors * r0 = previous task_struct, r1 = previous thread_info, r2 = next thread_info * previous and next are guaranteed not to be the same. */ENTRY(__switch_to) UNWIND(.fnstart    )     UNWIND(.cantunwind )    add ip, r1, #TI_CPU_SAVE    ldr r3, [r2, #TI_TP_VALUE] ARM(   stmia   ip!, {r4 - sl, fp, sp, lr} )    @ Store most regs on stack THUMB( stmia   ip!, {r4 - sl, fp}     )    @ Store most regs on stack THUMB( str sp, [ip], #4           )     THUMB( str lr, [ip], #4           )    #ifdef CONFIG_CPU_USE_DOMAINS    ldr r6, [r2, #TI_CPU_DOMAIN]#endif    set_tls r3, r4, r5#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)    ldr r7, [r2, #TI_TASK]    ldr r8, =__stack_chk_guard    ldr r7, [r7, #TSK_STACK_CANARY]#endif#ifdef CONFIG_CPU_USE_DOMAINS    mcr p15, 0, r6, c3, c0, 0       @ Set domain register#endif    mov r5, r0    add r4, r2, #TI_CPU_SAVE    ldr r0, =thread_notify_head    mov r1, #THREAD_NOTIFY_SWITCH    bl  atomic_notifier_call_chain#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)    str r7, [r8] #endif THUMB( mov ip, r4             )    mov r0, r5 ARM(   ldmia   r4, {r4 - sl, fp, sp, pc}  )    @ Load all regs saved previously THUMB( ldmia   ip!, {r4 - sl, fp}     )    @ Load all regs saved previously THUMB( ldr sp, [ip], #4           ) THUMB( ldr pc, [ip]           ) UNWIND(.fnend      )ENDPROC(__switch_to)

到此为止，进程切换完毕。

来源：https://www.cnblogs.com/jesse123/archive/2011/10/13/2200163.html

标签

状态寄存器

进程调度

cpu参数

线程阻塞

fork

cpu时间

线程