The original commentator also offered this explanation:
I think the conclusion is the one offered above -- stick to N for optimal performance, unless you have a compelling reason not to. I should also emphasize that I don't have a background in computing of any sort, whereas the poster is a professional in the HPC field.Historically N+1 or even N*1.5 was used & worked better on memory / I-O constrained systems where the available cache was used as a short lasting one to feed the extra committed threads / processes while I-O was in progress. As you've observed correctly this is not the case on machines that have an abundance of RAM, where this acts as a long lasting cache, no data that got written to disk will be read back > spawning additional threads / processes has therefore a detrimental effect on efficiency due to (much) more rescheduling / TLB shootdown interrupts. In short, when available ram is larger then total disk-space needed for build N = amount of logical cpu's if not N = logical cpu's + 1 Setting the global environment variable (CONCURRENCY_LEVEL) instead of fixed values for -j for automated builds using the previously mentioned #export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN` is always the safest bet, especially when using server grade machines and high speed 0 seek time solid state disks ...
So if I'm allowed to paraphrase and make conclusions:
for a very short compile, like the one in this post, you may find that N+1 seemingly gives a better result since disk I/O plays a big part relative to the code generation (and whatever else a compiler does). For a longer, more 'normal' compilation disk I/O play a smaller part.
If your RAM is too small and you need to cache to disk repeatedly, then that obviously increases the disk I/O as well.
In the end, the penalty for over-committing (http://verahill.blogspot.com.au/2013/01/321-compiling-kernel-372-on-debian.html) is large enough that it's a better bet to just got for N threads.
I really shouldn't be surprised -- it's the same effect you see when launching a computational job: you do NOT want to launch more threads than cores.
Original post:
I got a comment recently regarding the number of threads that should be used for make:
make -j7 is the number of cores +1 Stop copy paste nonsense.... sigh... make -j1 will spawn 1 worker process -j7 will spawn 7. #export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN` makes adding -jjob unnecessary on an i7 this is the same as -j8 When in doubt check top.....
So the question is whether for N cores, should you spawn N threads or N+1? The poster has a valid point -- there's not that much data on what really is the best configuration and while most people keep repeating the (mostly) accepted N+1 (or 1.5*N) wisdom, we really need more hard numbers.
So here's my real-world unscientific benchmark for compiling Gromacs 4.5.5 on a six core AMD Phenom II 1055T with 8 Gb RAM and a slow 5400 rpm hard drive (disk I/O plays into things as well). I'm using gcc 4.7.2-4 and Debian Wheezy/Testing.
To get the data I used this script, maketest.sh:
make distclean export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/openblas/lib -lopenblas" export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/openblas/include" ./configure --disable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_sp --prefix=/opt/gromacs/gromacs-4.5.5 time make -j$1
which I called with e.g.
sh maketest.sh 6
Admittedly, this is a fairly short build but it is a 'real' one.
Results:
Or as a plot:N Time (real) 1 9 m 52 s 2 5 m 18 s 3 3 m 48 s 4 3 m 02 s 5 2 m 24 s 6 2 m 16 s 7 2 m 05 s 8 2 m 06 s 9 2 m 07 s 10 2 m 07 s 11 2 m 08 s 12 2 m 09 s
The buiild time decreases roughly exponentially with the number of threads. The blue line is at 125 seconds i.e. dx/dy=0. |
I also ran sar (sysstat; sar -u 1 180 |gawk '{print $3,$5,$8}' |tee n7.dat ) for -j7 to see how the load varies with time during make (I collected a little bit of data before and after make, hence the flat line at the end):
The black/blue (user/idle) lines are what are interesting here |
Raw results
N=1
N=2real 9m51.519s user 6m43.316s sys 0m44.092s
N=3real 5m18.359s user 7m3.548s sys 0m46.112s
N=4real 3m47.850s user 7m22.732s sys 0m47.064s
N=5real 3m2.131s user 7m56.068s sys 0m41.744s
N=6real 2m24.258s user 7m53.140s sys 0m34.928s
N=7real 2m16.429s user 8m15.088s sys 0m27.160s
N=8real 2m5.361s user 7m50.200s sys 0m28.280s
N=9real 2m5.820s user 7m52.380s sys 0m27.548s
N=10real 2m7.266s user 7m54.344s sys 0m28.340s
N=11real 2m7.057s user 7m56.628s sys 0m27.872s
N=12real 2m7.728s user 7m58.276s sys 0m27.332s
real 2m8.819s user 8m0.600s sys 0m27.544s
No comments:
Post a Comment