20 September 2012

243. My own personal benchmarks for NWChem, gromacs with atlas, openblas, acml on AMD and intel

Update: you can compile against acml on intel as well, and against mkl on amd. Still need to do some performance testing to see how well it works. The artificial penalty of running mkl on AMD is well-publicised and led to a lawsuit, but I don't know how acml performs on mkl.


The title says it all, really. Since I'm back to exploring ways of improving performance for my little cluster I figured I'd break this out as a separate post. Most of this data was found here before: http://verahill.blogspot.com.au/2012/09/new-compute-node-using-amd-fx-8150.html

All units are running up-to-date debian testing (wheezy).

Configuration:
Boron (B): Phenom II X6 2.8 GHz, 8Gb RAM (2.8*6=16.8 GFLOPS predicted)
Neon (Ne): FX-8150 X8 3.6 GHz, 16 Gb RAM (3.6*8=28.8 GFLOPS predicted (int), 3.6*4=14.4 GFLOPS (fpu))
Tantalum (Ta): Quadcore i5-2400 3.1 GHz, 8 Gb RAM (3.1*4=12.4 GFLOPS predicted)
Vanadium (V):  Dual socket 2x Quadcore Xeon X3480 3.06 GHz, 8Gb. CentOS (ROCKS 5.4.3)/openblas.

Results

Gromacs --double (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, double precision, 500k steps)
B  :  10.662 ns/day (11.8  GFLOPS, runtime 8104 seconds)***
B  :    9.921 ns/day ( 10.9 GFLOPS, runtime 8709 seconds)**
Ne:  10.606 ns/day (11.7  GFLOPS, runtime 8146 seconds) *
Ne:  12.375 ns/day (13.7  GFLOPS, runtime 6982 seconds)**
Ne:  12.385 ns/day (13.7  GFLOPS, runtime 6976 seconds)****
Ta:  10.825 ns/day (11.9  GFLOPS, runtime 7981 seconds)***
V :   10.560 ns/dat (11.7  GFLOPS, runtime 8182 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

Gromacs --single (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, single precision, 500 k steps)
B  :   17.251 ns/day (19.0 GFLOPS, runtime 5008 seconds)***
Ne:   21.874 ns/day (24.2 GFLOPS, runtime  3950 seconds)**
Ne:   21.804 ns/day (24.1 GFLOPS, runtime 3963  seconds)****
Ta:   17.345 ns/day (19.2 GFLOPS, runtime  4982 seconds)***
V :   17.297 ns/day (19.1 GFLOPS, runtime 4995 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

NWChem (opt biphenyl cation, cp-md/pspw):
B  :   5951 seconds**
B  :   4084 seconds ***
B  :   5782 seconds ***xy
Ne:    3689 seconds**
Ta :   4102 seconds***
Ta :   4230 seconds***xy
V :    5396 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem

NWChem (opt biphenyl cation, geovib, 6-31G**/ub3lyp):
B  :  2841 seconds **
B  :  2410 seconds***
B  :  2101 seconds ***x
B  :  2196 seconds ***xy
Ne: 1665 seconds **
Ta : 1785 seconds***
Ta : 1789 seconds***xy
V  : 2600 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem
y NWChem 6.1.1

A Certain Commercial Ab Initio Package (Freq calc of pre-optimised H14C19O3 at 6-31+G*/rb3lyp):
B  :    2h 00 min (CPU time 10h 37 min)
Ne:   1h 37 min (CPU time: 11h 13 min)
Ta:   1h 26 min (CPU time: 5h 27 min)
V  :   2h 15 min (CPU time 15h 50 min)
Using precompiled binaries.


Gamess:
(I'm still working on learning how to run gamess efficiently, so take these values with a huge saucer of salt for now). bn.inp does a geometry optimisation of a biphenyl cation (mult 2) at ub3lyp/6-31G**. bn.inp has no $STATPT card while bn3.inp does and it makes a huge difference -- but is this because it does 20 steps (nsteps=20), then kills the run? The default is 50 steps and it does seem like all the runs do the maximum number of steps, then exit.

 Again, still learning. See below for input files. Will fix this post as I learn what the heck I'm doing. The relative run times on each node are still comparable though, but just don't use the numbers to compare the run speed of e.g. nwchem vs gamess.

Gamess using bn.inp with atlas
B:    9079 seconds
Ne: 7252 seconds
Ta:  9283 seconds

Gamess using bn.inp with openblas
B:   9071 seconds
Ta: 9297 seconds

Gamess using bn.inp with acml
Ne: 7062 seconds

Gamess using bn3.inp with atlas. 
B: 4016 seconds
Ne: 3162 seconds
Ta: 4114 seconds

MPQC:
Here I've used the version in the debian repos. I've created a hostfile
neon slots=8 max_slots=8
tantalum slots=4 max_slots=4
boron slots=6 max_slots=6

and then just looked at changing the order and slots assignment as well as total number of cores assigned using mpirun.

Simple test case looking at number of cores/distribution:
n cores:  Seconds: Configuration(cores,exec nodes)
4    :   11   : 4(Ta)
4    :   17   : 4(Ne)
4    :   17   : 4(B)
4    :   42   : 2(Ta)+2(B)
6    :   12   : 6(B)
6    :   13   : 6(Ne)
6    :   74   : 2(Ta)+2(B)+2(Ne)
8    :   12   : 8(Ne)
10  :   43   : 4(Ta)+6(B)
12  :   47   : 4(Ta)+8(Ne)
14  :   55   : 6(B)+8(Ne)
18  :   170 :  4(Ta)+6(B)+8(Ne)

My beowulf cluster doesn't seem to be much of a super computer. All in all, this looks like a pretty good argument in favour of upgrading to infiniband...


bn.inp:
 $CONTRL 
COORD=CART UNITS=ANGS scftyp=uhf dfttyp=b3lyp runtyp=optimize 
ICHARG=1 MULT=2 maxit=100
$END
 $system mwords=2000 $end
 $BASIS gbasis=n31 ngauss=6 ndfunc=1 npfunc=1 $END
 $guess guess=huckel $end

 $DATA
biphenyl
C1
C      6.0      0.0000000000   -3.5630100000    0.0000000000 
C      6.0     -1.1392700000   -2.8592800000   -0.3938400000 
C      6.0     -1.1387900000   -1.4654500000   -0.3941500000 
C      6.0      0.0000000000   -0.7428100000    0.0000000000 
C      6.0      1.1387900000   -1.4654500000    0.3941500000 
C      6.0      1.1392700000   -2.8592800000    0.3938400000 
C      6.0      0.0000000000    0.7428100000    0.0000000000 
C      6.0      1.1387900000    1.4654500000   -0.3941500000 
C      6.0      1.1392700000    2.8592800000   -0.3938400000 
C      6.0     -1.1387900000    1.4654500000    0.3941500000 
C      6.0      0.0000000000    3.5630100000    0.0000000000 
C      6.0     -1.1392700000    2.8592800000    0.3938400000 
H      1.0      0.0000000000   -4.6489600000    0.0000000000 
H      1.0     -2.0282700000   -3.3966200000   -0.7116100000 
H      1.0     -2.0214800000   -0.9282700000   -0.7279300000 
H      1.0      2.0282700000   -3.3966200000    0.7116100000 
H      1.0      2.0282700000    3.3966200000   -0.7116100000 
H      1.0     -2.0214800000    0.9282700000    0.7279300000 
H      1.0      0.0000000000    4.6489600000    0.0000000000 
H      1.0     -2.0282700000    3.3966200000    0.7116100000 
H      1.0      2.0214800000    0.9282700000   -0.7279300000 
H      1.0      2.0214800000   -0.9282700000    0.7279300000 
 $END


bn3.inp:
$CONTRL 
COORD=CART UNITS=ANGS scftyp=uhf dfttyp=b3lyp runtyp=optimize 
ICHARG=1 MULT=2 maxit=100
$END
 $system mwords=2000 $end
 $BASIS gbasis=n31 ngauss=6 ndfunc=1 npfunc=1 $END
 $STATPT OPTTOL=0.0001 NSTEP=20 HSSEND=.TRUE. $END
 $guess guess=huckel $end

 $DATA
biphenyl
C1
C      6.0      0.0000000000   -3.5630100000    0.0000000000 
C      6.0     -1.1392700000   -2.8592800000   -0.3938400000 
C      6.0     -1.1387900000   -1.4654500000   -0.3941500000 
C      6.0      0.0000000000   -0.7428100000    0.0000000000 
C      6.0      1.1387900000   -1.4654500000    0.3941500000 
C      6.0      1.1392700000   -2.8592800000    0.3938400000 
C      6.0      0.0000000000    0.7428100000    0.0000000000 
C      6.0      1.1387900000    1.4654500000   -0.3941500000 
C      6.0      1.1392700000    2.8592800000   -0.3938400000 
C      6.0     -1.1387900000    1.4654500000    0.3941500000 
C      6.0      0.0000000000    3.5630100000    0.0000000000 
C      6.0     -1.1392700000    2.8592800000    0.3938400000 
H      1.0      0.0000000000   -4.6489600000    0.0000000000 
H      1.0     -2.0282700000   -3.3966200000   -0.7116100000 
H      1.0     -2.0214800000   -0.9282700000   -0.7279300000 
H      1.0      2.0282700000   -3.3966200000    0.7116100000 
H      1.0      2.0282700000    3.3966200000   -0.7116100000 
H      1.0     -2.0214800000    0.9282700000    0.7279300000 
H      1.0      0.0000000000    4.6489600000    0.0000000000 
H      1.0     -2.0282700000    3.3966200000    0.7116100000 
H      1.0      2.0214800000    0.9282700000   -0.7279300000 
H      1.0      2.0214800000   -0.9282700000    0.7279300000 
 $END

6 comments:

  1. Hi Lindqvist,

    After I started testing NWChem and ECCE. I am not sure if I have hit the limit of the efficiency on my machine running NWChem. Therefore I wish if you can give some hint to me.

    We currently have a Cray XE6m with two types of AMD cores (Abu Dhabi 256 cores and Istanbul 192 cores) running. I am thinking to do some meaningful MD, AIMD and QM calculations, but I am struggling benchmarking the softwares. For now I have personally compiled GROMACS and also a Cray engineer helped to compile NWChem on the machine.

    The ldd on nwchem binary gives the following:

    /packages/nwchem-6.1.1/bin/LINUX64/nwchem: /usr/lib64/libgfortran.so.3: version `GFORTRAN_1.4' not found (required by /packages/nwchem-6.1.1/bin/LINUX64/nwchem)
    /packages/nwchem-6.1.1/bin/LINUX64/nwchem: /usr/lib64/libgfortran.so.3: version `GFORTRAN_1.4' not found (required by /opt/cray/lib64/libga_gnu_47.so.0)
    linux-vdso.so.1 => (0x00007fffeb650000)
    libsci_gnu.so.2 => /opt/cray/lib64/libsci_gnu.so.2 (0x00007f0a5b93e000)
    libonesided.so.1 => /opt/cray/lib64/libonesided.so.1 (0x00007f0a5b732000)
    libnumatoolkit.so.1 => /opt/cray/lib64/libnumatoolkit.so.1 (0x00007f0a5b52c000)
    libga_gnu_47.so.0 => /opt/cray/lib64/libga_gnu_47.so.0 (0x00007f0a5b054000)
    libarmci_gnu_47.so.0 => /opt/cray/lib64/libarmci_gnu_47.so.0 (0x00007f0a5ae1a000)
    libdmapp.so.1 => /opt/cray/dmapp/default/lib64/libdmapp.so.1 (0x00007f0a5abe0000)
    libmpich.so.1 => /opt/cray/lib64/libmpich.so.1 (0x00007f0a5a745000)
    libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f0a5a46b000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f0a5a1f1000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f0a59fda000)
    libquadmath.so.0 => /opt/cray/lib64/cce/libquadmath.so.0 (0x00007f0a59da4000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f0a59a2f000)
    libmpich_gnu_47.so.1 => /opt/cray/lib64/libmpich_gnu_47.so.1 (0x00007f0a59594000)
    libfftw3.so.3 => /opt/cray/lib64/libfftw3.so.3 (0x00007f0a59199000)
    libfftw3f.so.3 => /opt/cray/lib64/libfftw3f.so.3 (0x00007f0a58d8d000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0a58b70000)
    libhugetlbfs.so => /usr/lib64/libhugetlbfs.so (0x00007f0a58951000)
    librca.so.0 => /opt/cray/rca/default/lib64/librca.so.0 (0x00007f0a5874d000)
    libAtpSigHandler.so.0 => /opt/cray/lib64/libAtpSigHandler.so.0 (0x00007f0a58547000)
    libsci_gnu_mp.so.2 => /opt/cray/lib64/libsci_gnu_mp.so.2 (0x00007f0a5703f000)
    libgomp.so.1 => /opt/gcc/default/snos/lib64/libgomp.so.1 (0x00007f0a56e30000)
    libugni.so.0 => /opt/cray/ugni/default/lib64/libugni.so.0 (0x00007f0a56c03000)
    libpmi.so.0 => /opt/cray/pmi/default/lib64/libpmi.so.0 (0x00007f0a569e2000)
    libudreg.so.0 => /opt/cray/udreg/default/lib64/libudreg.so.0 (0x00007f0a567da000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f0a5ccc6000)
    libmpl.so.0 => /opt/cray/lib64/libmpl.so.0 (0x00007f0a565d4000)
    libopa.so.1 => /opt/cray/lib64/libopa.so.1 (0x00007f0a563d2000)
    libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007f0a561d0000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f0a55fc6000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f0a55dc2000)

    I am also running a simulation similar to your

    Gromacs --double (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, double precision, 500k steps)

    but I am getting a very slow step time. Could you comment on it?

    Many thanks!

    ReplyDelete
    Replies
    1. Not sure I understand -- what happens when you try to run a simple nwchem job? Does it not work at all? Or just runs very slow?

      The libgfortran.so.3 issue doesn't look good -- it should point to something like /usr/lib/x86_64-linux-gnu/libgfortran.so.3

      The only thing that I've seen affect performance a lot was the absence of libpthread, but I see that you've linked against that.

      I think the nwchem forum might be a good place to ask questions -- they have a better understanding of different architectures than I do, and they are familiar with different cray architectures: http://www.nwchem-sw.org/index.php/Special:AWCforum

      How bad is the gromacs performance? I'm even less confident re gromacs than nwchem. I also have no experience of non-x86 archs.



      Delete
  2. The walltime of the NWChem calculation with 64 Abu Dhabi cores took 7 hours for running the GROMACS system 7.2nm x 7.2nm x 7.2nm water box (12467 water molecules) with a three a.a. peptide inside it, which is much slower than your run on clusters (approx. 3.4ns a day). I don't know exactly the benchmark of the machine but 64 cores should be around 600 GFLOPS on the Abu Dhabi platform I am running on. I guess that is really really slow compared with your tests.

    ReplyDelete
    Replies
    1. Ah. The gromacs systems above were run in gromacs 4.6 -- not nwchem.

      Delete
    2. Yes, but... there is still a big gap between two applications. I have posted the question to NWChem user community.
      http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id885/NWChem_MD_efficiency.html

      Delete
    3. Fair enough re similar performance. Hopefully the nwchem devs can address it.

      Delete