Why does Linux's scheduler put two threads onto the same physical core on processors with HyperThreading?

半城伤御伤魂 提交于 2019-12-04 00:21:26

I think it's time to summarize some knowledge from comments.

Linux scheduler is aware of HyperThreading -- information about it should be read from ACPI SRAT/SLIT tables, which are provided by BIOS/UEFI -- than Linux builds scheduler domains from that.

Domains have hierarchy -- i.e. on 2-CPU servers you will get three layers of domains: all-cpus, per-cpu-package, and per-cpu-core domain. You may check it from /proc/schedstat:

$ awk '/^domain/ { print $1, $2; } /^cpu/ { print $1; }' /proc/schedstat
cpu0
domain0 0000,00001001     <-- all cpus from core 0
domain1 0000,00555555     <-- all cpus from package 0
domain2 0000,00ffffff     <-- all cpus in the system

Part of CFS scheduler is load balancer -- the beast that should steal tasks from your busy core to another core. Here are its description from the Kernel documentation:

While doing that, it checks to see if the current domain has exhausted its rebalance interval. If so, it runs load_balance() on that domain. It then checks the parent sched_domain (if it exists), and the parent of the parent and so forth.

Initially, load_balance() finds the busiest group in the current sched domain. If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in that group. If it manages to find such a runqueue, it locks both our initial CPU's runqueue and the newly found busiest one and starts moving tasks from it to our runqueue. The exact number of tasks amounts to an imbalance previously computed while iterating over this sched domain's groups.

From: https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt

You can monitor for activities of load balancer by comparing numbers in /proc/schedstat. I wrote a script for doing that: schedstat.py

Counter alb_pushed shows that load balancer was successfully moved out task:

Sun Apr 12 14:15:52 2015              cpu0    cpu1    ...    cpu6    cpu7    cpu8    cpu9    cpu10   ...
.domain1.alb_count                                    ...      1       1                       1  
.domain1.alb_pushed                                   ...      1       1                       1  
.domain2.alb_count                              1     ...                                         
.domain2.alb_pushed                             1     ...

However, logic of load balancer is complex, so it is hard to determine what reasons can stop it from doing its work well and how they are related with schedstat counters. Neither me nor @thatotherguy can reproduce your issue.

I see two possibilities for that behavior:

  • You have some aggressive power saving policy that tries to save one core to reduce power consumption of CPU.
  • You really encountered a bug with scheduling subsystem, than you should go to LKML and carefully share your findings (including mpstat and schedstat data)

I'm unable to reproduce this on 3.13.0-48 with my Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.

I have 6 cores with hyperthreading, where logical core N maps to physical core N mod 6.

Here's a typical output of top with stress -c 4 in two columns, so that each row is one physical core (I left out a few cores because my system is not idle):

%Cpu0  :100.0 us,   %Cpu6  :  0.0 us, 
%Cpu1  :100.0 us,   %Cpu7  :  0.0 us, 
%Cpu2  :  5.9 us,   %Cpu8  :  2.0 us, 
%Cpu3  :100.0 us,   %Cpu9  :  5.7 us, 
%Cpu4  :  3.9 us,   %Cpu10 :  3.8 us, 
%Cpu5  :  0.0 us,   %Cpu11 :100.0 us, 

Here it is after killing and restarting stress:

%Cpu0  :100.0 us,   %Cpu6  :  2.6 us, 
%Cpu1  :100.0 us,   %Cpu7  :  0.0 us, 
%Cpu2  :  0.0 us,   %Cpu8  :  0.0 us, 
%Cpu3  :  2.6 us,   %Cpu9  :  0.0 us, 
%Cpu4  :  0.0 us,   %Cpu10 :100.0 us, 
%Cpu5  :  2.6 us,   %Cpu11 :100.0 us, 

I did this several times, and did not see any instances where 4 threads across 12 logical cores would schedule on the same physical core.

With -c 6 I tend to get results like this, where Linux appears to be helpfully scheduling other processes on their own physical cores. Even so, they're distributed way better than chance:

%Cpu0  : 18.2 us,   %Cpu6  :  4.5 us, 
%Cpu1  :  0.0 us,   %Cpu7  :100.0 us, 
%Cpu2  :100.0 us,   %Cpu8  :100.0 us, 
%Cpu3  :100.0 us,   %Cpu9  :  0.0 us, 
%Cpu4  :100.0 us,   %Cpu10 :  0.0 us, 
%Cpu5  :100.0 us,   %Cpu11 :  0.0 us, 

Quoting your experience with two additional processors that seemed to work correctly, the i7-2600 and Xeon E5-1620; This could be a long-shot but how about a CPU microcode update? It could include something to fix the problem if it's internal CPU behaviour.

Intel CPU Microcode Downloads: http://intel.ly/1aku6ak

Also see here: https://wiki.archlinux.org/index.php/Microcode

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!