07 January 2013

305. make -jN -- should N equal number of cores or N+1 cores? Optimal number of threads per core

Update: I repeated this test by compiling kernel 3.7.2 using different settings (http://verahill.blogspot.com.au/2013/01/321-compiling-kernel-372-on-debian.html)  -- given the length of the compile and it's reliance on CPU grunt it is probably a better test case. It came out showing that N -- or even N-1 -- was better than over-committing.

The original commentator also offered this explanation:
Historically N+1 or even N*1.5 was used & worked better on memory / I-O constrained systems where the available cache was used as a short lasting one to feed the extra committed threads / processes while I-O was in progress. As you've observed correctly this is not the case on machines that have an abundance of RAM, where this acts as a long lasting cache, no data that got written to disk will be read back > spawning additional threads / processes has therefore a detrimental effect on efficiency due to (much) more rescheduling / TLB shootdown interrupts. In short, when available ram is larger then total disk-space needed for build N = amount of logical cpu's if not N = logical cpu's + 1 Setting the global environment variable (CONCURRENCY_LEVEL) instead of fixed values for -j for automated builds using the previously mentioned #export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN` is always the safest bet, especially when using server grade machines and high speed 0 seek time solid state disks ...
I think the conclusion is the one offered above -- stick to N for optimal performance, unless you have a compelling reason not to. I should also emphasize that I don't have a background in computing of any sort, whereas the poster is a professional in the HPC field.

So if I'm allowed to paraphrase and make conclusions:
for a very short compile, like the one in this post, you may find that N+1 seemingly gives a better result since disk I/O plays a big part relative to the code generation (and whatever else a compiler does). For a longer, more 'normal' compilation disk I/O play a smaller part.

If your RAM is too small and you need to cache to disk repeatedly, then that obviously increases the disk I/O as well.

In the end, the penalty for over-committing (http://verahill.blogspot.com.au/2013/01/321-compiling-kernel-372-on-debian.html) is large enough that it's a better bet to just got for N threads.

I really shouldn't be surprised -- it's the same effect you see when launching a computational job: you do NOT want to launch more threads than cores.

Original post:
I got a comment recently regarding the number of threads that should be used for make:
make -j7 is the number of cores +1 

Stop copy paste nonsense.... sigh...

make -j1 will spawn 1 worker process
-j7 will spawn 7. 

#export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN`

makes adding -jjob unnecessary 
on an i7 this is the same as -j8

When in doubt check top.....

So the question is whether for N cores, should you spawn N threads or N+1? The poster has a valid point -- there's not that much data on what really is the best configuration and while most people keep repeating the (mostly) accepted N+1 (or 1.5*N)  wisdom, we really need more hard numbers.

So here's my real-world unscientific benchmark for compiling Gromacs 4.5.5 on a six core AMD Phenom II  1055T with 8 Gb RAM and a slow 5400 rpm hard drive (disk I/O plays into things as well). I'm using gcc 4.7.2-4 and Debian Wheezy/Testing.

To get the data I used this script, maketest.sh:

make distclean
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/openblas/lib -lopenblas"
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/openblas/include"
./configure --disable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_sp --prefix=/opt/gromacs/gromacs-4.5.5
time make -j$1

which I called with e.g.
sh maketest.sh 6

Admittedly, this is a fairly short build but it is a 'real' one.

Results:
N    Time (real)
1    9 m 52 s
2    5 m 18 s
3    3 m 48 s
4    3 m 02 s
5    2 m 24 s
6    2 m 16 s
7    2 m 05 s
8    2 m 06 s
9    2 m 07 s
10   2 m 07 s
11   2 m 08 s
12   2 m 09 s
Or as a plot:
The buiild time decreases roughly exponentially with the number of threads. The blue line is at 125 seconds i.e. dx/dy=0.
I'm actually quite surprised at how N+1 turned out to be the best configuration, although in general it seems that you don't suffer any penalty for using more threads, so 1.5*N works just as well.

I also ran sar (sysstat; sar -u 1 180 |gawk '{print $3,$5,$8}' |tee n7.dat ) for -j7 to see how the load varies with time during make (I collected a little bit of data before and after make, hence the flat line at the end):
The black/blue (user/idle) lines are what are interesting here
The build is very evidently not perfectly parallel at all stages, and that will also affect the optimal number of threads/core.


Raw results
N=1
real    9m51.519s
user    6m43.316s
sys     0m44.092s
N=2
real    5m18.359s
user    7m3.548s
sys     0m46.112s
N=3
real    3m47.850s
user    7m22.732s
sys     0m47.064s
N=4
real    3m2.131s
user    7m56.068s
sys     0m41.744s
N=5
real    2m24.258s
user    7m53.140s
sys     0m34.928s
N=6
real    2m16.429s
user    8m15.088s
sys     0m27.160s
N=7
real    2m5.361s
user    7m50.200s
sys     0m28.280s
N=8
real    2m5.820s
user    7m52.380s
sys     0m27.548s
N=9
real    2m7.266s
user    7m54.344s
sys     0m28.340s
N=10
real    2m7.057s
user    7m56.628s
sys     0m27.872s
N=11
real    2m7.728s
user    7m58.276s
sys     0m27.332s
N=12
real    2m8.819s
user    8m0.600s
sys     0m27.544s

No comments:

Post a Comment