There seems to be some controversy on whether the number of jobs in GNU make is supposed to be equal to the number of cores, or if you can optimize the build time by adding
From my experience, there must be some performance benefits when adding extra jobs. It is simply because disk I/O is one of the bottle necks besides CPU. However it is not easy to decide on the number of extra jobs as it is highly inter-connected with the number of cores and types of the disk being used.
I've run my home project on my 4-core with hyperthreading laptop and recorded the results. This is a fairly compiler-heavy project but it includes a unit test of 17.7 seconds at the end. The compiles are not very IO intensive; there is very much memory available and if not the rest is on a fast SSD.
1 job real 2m27.929s user 2m11.352s sys 0m11.964s
2 jobs real 1m22.901s user 2m13.800s sys 0m9.532s
3 jobs real 1m6.434s user 2m29.024s sys 0m10.532s
4 jobs real 0m59.847s user 2m50.336s sys 0m12.656s
5 jobs real 0m58.657s user 3m24.384s sys 0m14.112s
6 jobs real 0m57.100s user 3m51.776s sys 0m16.128s
7 jobs real 0m56.304s user 4m15.500s sys 0m16.992s
8 jobs real 0m53.513s user 4m38.456s sys 0m17.724s
9 jobs real 0m53.371s user 4m37.344s sys 0m17.676s
10 jobs real 0m53.350s user 4m37.384s sys 0m17.752s
11 jobs real 0m53.834s user 4m43.644s sys 0m18.568s
12 jobs real 0m52.187s user 4m32.400s sys 0m17.476s
13 jobs real 0m53.834s user 4m40.900s sys 0m17.660s
14 jobs real 0m53.901s user 4m37.076s sys 0m17.408s
15 jobs real 0m55.975s user 4m43.588s sys 0m18.504s
16 jobs real 0m53.764s user 4m40.856s sys 0m18.244s
inf jobs real 0m51.812s user 4m21.200s sys 0m16.812s
Basic results:
My guess right now: If you do something else on your computer, use the core count. If you do not, use the thread count. Exceeding it shows no benefit. At some point they will become memory limited and collapse due to that, making the compiling much slower. The "inf" line was added at a much later date, giving me the suspicion that there was some thermal throttling for the 8+ jobs. This does show that for this project size there's no memory or throughput limit in effect. It's a small project though, given 8GB of memory to compile in.
I just got an Athlon II X2 Regor proc with a Foxconn M/B and 4GB of G-Skill memory.
I put my 'cat /proc/cpuinfo' and 'free' at the end of this so others can see my specs. It's a dual core Athlon II x2 with 4GB of RAM.
uname -a on default slackware 14.0 kernel is 3.2.45.
I downloaded the next step kernel source (linux-3.2.46) to /archive4;
extracted it (tar -xjvf linux-3.2.46.tar.bz2
);
cd'd into the directory (cd linux-3.2.46
);
and copied the default kernel's config over (cp /usr/src/linux/.config .
);
used make oldconfig
to prepare the 3.2.46 kernel config;
then ran make with various incantations of -jX.
I tested the timings of each run by issuing make after the time command, e.g., 'time make -j2'. Between each run I 'rm -rf' the linux-3.2.46 tree and reextracted it, copied the default /usr/src/linux/.config into the directory, ran make oldconfig and then did my 'make -jX' test again.
plain "make":
real 51m47.510s
user 47m52.228s
sys 3m44.985s
bob@Moses:/archive4/linux-3.2.46$
as above but with make -j2
real 27m3.194s
user 48m5.135s
sys 3m39.431s
bob@Moses:/archive4/linux-3.2.46$
as above but with make -j3
real 27m30.203s
user 48m43.821s
sys 3m42.309s
bob@Moses:/archive4/linux-3.2.46$
as above but with make -j4
real 27m32.023s
user 49m18.328s
sys 3m43.765s
bob@Moses:/archive4/linux-3.2.46$
as above but with make -j8
real 28m28.112s
user 50m34.445s
sys 3m49.877s
bob@Moses:/archive4/linux-3.2.46$
'cat /proc/cpuinfo' yields:
bob@Moses:/archive4$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon(tm) II X2 270 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 3399.957
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6799.91
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon(tm) II X2 270 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 3399.957
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6799.94
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
'free' yields:
bob@Moses:/archive4$ free
total used free shared buffers cached
Mem: 3991304 3834564 156740 0 519220 2515308
Just as a ref:
From Spawning Multiple Build Jobs
section in LKD:
where n is the number of jobs to spawn. Usual practice is to spawn one or two jobs per processor. For example, on a dual processor machine, one might do
$ make j4
Ultimately, you'll have to do some benchmarks to determine the best number to use for your build, but remember that the CPU isn't the only resource that matters!
If you've got a build that relies heavily on the disk, for example, then spawning lots of jobs on a multicore system might actually be slower, as the disk will have to do extra work moving the disk head back and forth to serve all the different jobs (depending on lots of factors, like how well the OS handles the disk-cache, native command queuing support by the disk, etc.).
And then you've got "real" cores versus hyper-threading. You may or may not benefit from spawning jobs for each hyper-thread. Again, you'll have to benchmark to find out.
I can't say I've specifically tried #cores + 1, but on our systems (Intel i7 940, 4 hyperthreaded cores, lots of RAM, and VelociRaptor drives) and our build (large-scale C++ build that's alternately CPU and I/O bound) there is very little difference between -j4 and -j8. (It's maybe 15% better... but nowhere near twice as good.)
If I'm going away for lunch, I'll use -j8, but if I want to use my system for anything else while it's building, I'll use a lower number. :)
I, personally, use make -j n
where n is "number of cores" + 1.
I can't, however, give a scientific explanation: I've seen a lot of people using the same settings and they gave me pretty good results so far.
Anyway, you have to be careful because some make-chains simply aren't compatible with the --jobs
option, and can lead to unexpected results. If you're experiencing strange dependency errors, just try to make
without --jobs
.