Update: I repeated this test by compiling kernel 3.7.2 using different settings (
http://verahill.blogspot.com.au/2013/01/321-compiling-kernel-372-on-debian.html) -- given the length of the compile and it's reliance on CPU grunt it is probably a better test case. It came out showing that N -- or even N-1 -- was better than over-committing.
The original commentator also offered this explanation:
Historically N+1 or even N*1.5 was used & worked better on memory / I-O constrained systems where the available cache was used as a short lasting one to feed the extra committed threads / processes while I-O was in progress.
As you've observed correctly this is not the case on machines that have an abundance of RAM, where this acts as a long lasting cache, no data that got written to disk will be read back > spawning additional threads / processes has therefore a detrimental effect on efficiency due to (much) more rescheduling / TLB shootdown interrupts.
In short, when available ram is larger then total disk-space needed for build
N = amount of logical cpu's
if not
N = logical cpu's + 1
Setting the global environment variable (CONCURRENCY_LEVEL) instead of fixed values for -j for automated builds using the previously mentioned
#export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN`
is always the safest bet, especially when using server grade machines and high speed 0 seek time solid state disks ...
I think the conclusion is the one offered above -- stick to N for optimal performance, unless you have a compelling reason not to. I should also emphasize that I don't have a background in computing of any sort, whereas the poster is a professional in the HPC field.
So if I'm allowed to paraphrase and make conclusions:
for a very short compile, like the one in this post, you may find that N+1 seemingly gives a better result since disk I/O plays a big part relative to the code generation (and whatever else a compiler does). For a longer, more 'normal' compilation disk I/O play a smaller part.
If your RAM is too small and you need to cache to disk repeatedly, then that obviously increases the disk I/O as well.
In the end, the penalty for over-committing (
http://verahill.blogspot.com.au/2013/01/321-compiling-kernel-372-on-debian.html) is large enough that it's a better bet to just got for N threads.
I really shouldn't be surprised -- it's the same effect you see when launching a computational job: you do NOT want to launch more threads than cores.
Original post:
I got a
comment recently regarding the number of threads that should be used for
make:
make -j7 is the number of cores +1
Stop copy paste nonsense.... sigh...
make -j1 will spawn 1 worker process
-j7 will spawn 7.
#export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN`
makes adding -jjob unnecessary
on an i7 this is the same as -j8
When in doubt check top.....
So the question is whether for N cores, should you spawn N threads or N+1? The poster has a valid point -- there's not that much data on what really is the best configuration and while most people keep repeating the (mostly) accepted N+1 (or 1.5*N) wisdom, we really need more hard numbers.
So here's my real-world unscientific benchmark for compiling
Gromacs 4.5.5 on a
six core AMD Phenom II 1055T with 8 Gb RAM and a slow 5400 rpm hard drive (disk I/O plays into things as well). I'm using gcc 4.7.2-4 and Debian Wheezy/Testing.
To get the data I used this script,
maketest.sh:
make distclean
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/openblas/lib -lopenblas"
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/openblas/include"
./configure --disable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_sp --prefix=/opt/gromacs/gromacs-4.5.5
time make -j$1
which I called with e.g.
sh maketest.sh 6
Admittedly, this is a fairly short build but it is a 'real' one.
Results:
N Time (real)
1 9 m 52 s
2 5 m 18 s
3 3 m 48 s
4 3 m 02 s
5 2 m 24 s
6 2 m 16 s
7 2 m 05 s
8 2 m 06 s
9 2 m 07 s
10 2 m 07 s
11 2 m 08 s
12 2 m 09 s
Or as a plot:
|
The buiild time decreases roughly exponentially with the number of threads. The blue line is at 125 seconds i.e. dx/dy=0. |
I'm actually quite surprised at how N+1 turned out to be the best configuration, although in general it seems that you don't suffer any penalty for using more threads, so 1.5*N works just as well.
I also ran
sar (sysstat;
sar -u 1 180 |gawk '{print $3,$5,$8}' |tee n7.dat ) for
-j7 to see how the load varies with time during make (I collected a little bit of data before and after
make, hence the flat line at the end):
|
The black/blue (user/idle) lines are what are interesting here |
The build is very evidently not perfectly parallel at all stages, and that will also affect the optimal number of threads/core.
Raw results
N=1
real 9m51.519s
user 6m43.316s
sys 0m44.092s
N=2
real 5m18.359s
user 7m3.548s
sys 0m46.112s
N=3
real 3m47.850s
user 7m22.732s
sys 0m47.064s
N=4
real 3m2.131s
user 7m56.068s
sys 0m41.744s
N=5
real 2m24.258s
user 7m53.140s
sys 0m34.928s
N=6
real 2m16.429s
user 8m15.088s
sys 0m27.160s
N=7
real 2m5.361s
user 7m50.200s
sys 0m28.280s
N=8
real 2m5.820s
user 7m52.380s
sys 0m27.548s
N=9
real 2m7.266s
user 7m54.344s
sys 0m28.340s
N=10
real 2m7.057s
user 7m56.628s
sys 0m27.872s
N=11
real 2m7.728s
user 7m58.276s
sys 0m27.332s
N=12
real 2m8.819s
user 8m0.600s
sys 0m27.544s