Why one non-voluntary context switch per second?

问题

The OS is RHEL 6 (2.6.32). I have isolated a core and am running a compute intensive thread on it. /proc/{thread-id}/status shows one non-voluntary context switch every second.

The thread in question is a SCHED_NORMAL thread and I don't want to change this.

How can I reduce this number of non-voluntary context switches? Does this depend on any scheduling parameters in /proc/sys/kernel?

EDIT: Several responses suggest alternative approaches. Before going that route, I first want to understand why I am getting exactly one non-voluntary context switch per second even over hours of run. For example, is this caused by CFS? If so, which parameters and how?

EDIT2: Further clarification - first question I would like an answer to is the following: Why am I getting one non-voluntary context switch per second instead of, say, one switch every half or two seconds?

回答1:

This is a guess, but an educated one - since you use an isolated CPU the scheduler does not schedule any task except your own on it with one exception - the vmstat code in the kernel has a timer that schedules a single work queue item on each CPU once per second to calculate memory usage statistics and this is what you are seeing gets scheduled each second.

The work queue code is smart enough to not schedule the work queue kernel thread if the core is 100% idle but not if it is running a single task.

You can verify this using ftrace. If the sched_switch tracer shows that the entity you switch to once every second or so (the value is rounded to the nearest jiffie events and the timer does not count when the cpu is idle so this might skew the timing) is the events/CPU_NUMBER task (or keventd for older kernels), then it's almost 100% that the cause is indeed the vmstat_update function setting its timer to queue a work queue item every second which the events kernel thread runs.

Note that the cycle at which vmstat sets its timer is configurable - you can set it to other value via the vm.stat_interval sysctl knob. Increasing this value will give you a lower rate of such interruptions at the cost of less accurate memory usage statistics.

I maintain a wiki with all the sources of interruptions to isolated CPU work loads here. I also have a patch in the works for getting vmstat to not schedule the work queue item if there is no change between one vmstat work queue run to the next - such as would happen if your single task on the CPU does not use any dynamic memory allocations. Not sure it will benefit you, though - it depends on your work load.

回答2:

I strongly suggest you try to optimize the code itself so that when it's running on a CPU, you get the maximum out of it.
Anyhow, I am not sure this will work, but give it a try anyway and let us know:

What I'll basically do is just set the scheduling policy to be FIFO then give the process the maximum priority possible.

#include<sched.h>
struct sched_param sp = sched_get_priority_max(SCHED_FIFO);
int ret;

ret = sched_setscheduler(0, SCHED_FIFO, &sp);
if (ret == -1) {
  perror("sched_setscheduler");
  return 1;
}

Please keep in mind that any blocking statement your process makes is MOST LIKELY gonna cause the scheduler to get it off the CPU.

Source
Man page
EDIT:
Sorry, just noticed the pthread tag; the concept still holds so check out this man page: http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_setschedparam.3.html

回答3:

If one interrupt per second on your dedicated CPU is still too much, then you really need to not go through the normal scheduler at all. May I suggest the real-time and isochronous priority levels, that can leave your process scheduled more reliably than the usual pre-emptive mechanisms?

来源：https://stackoverflow.com/questions/14037070/why-one-non-voluntary-context-switch-per-second

标签

linux-kernel

operating-system

scheduling

kernel