24 May 2013

432. NWChem 6.3 -- COSMO is now fast(er)!

I probably would've been more excited about this about a year ago when I 'believed' in implicit solvation models (nothing's perfect, and we'll use what is practical so they do fill a strong need. They just aren't very informative for a lot of systems) but it's still a Good Thing.

COSMO has been done using numerical gradients in nwchem 6.1.1 and earlier versions, which has meant that it's been horrendously slow in many cases, in particular if you need to optimise a structure using implicit solvation. COSMO has been -- and still is -- the only implicit solvation model implemented in NWChem, so slow COSMO puts a bit of a spanner in the solvation energy works. Sometimes the calculation even refuses to converge at all.

In contrast, Gaussian has had a number of implicit solvation models implemented, ranging from the quick and dirty PCM, to slower (and better?) C-PCM and I-PCM.

So this is great news.

A quick example:


The test:
Here's a test job (the default cosmo parameters aren't realistic, but this is for testing purposes):
scratch_dir /scratch start benzene geometry units angstroms C 0.100 1.396 0.000 C 1.209 0.698 0.000 C 1.209 -0.698 0.000 C 0.000 -1.396 0.000 C -1.209 -0.698 0.000 C -1.209 0.698 0.000 H 0.000 2.479 0.000 H 2.147 1.240 0.000 H 2.147 -1.240 0.000 H 0.000 -2.479 0.000 H -2.147 -1.240 0.000 H -2.147 1.240 0.000 end basis H library "6-31+g*" c library "6-31+g*" end dft direct end cosmo end scf maxiter 999 end task dft
Note that this is the same test job (plus cosmo, minus optimize) as shown here: http://verahill.blogspot.com.au/2013/05/430-briefly-crude-comparison-of.html

The results:
And here is what I see using nwchem 6.3. (w/ acml 5.3.1, AMD FX 8150/32 gb ram):
6.1.1 19.4 seconds
6.3   14.3 seconds

The difference isn't significant (in the sense that times are too variable so we can't really tell which is faster for such a short job).

But when we change task dft to task dft optimize we get
6.1.1 Fails after 2600 seconds
6.3   128.3 seconds

6.3 churns through the steps pretty efficiently:
@ Step Energy Delta E Gmax Grms Xrms Xmax Walltime @ ---- ---------------- -------- -------- -------- -------- -------- -------- @ 0 -230.09337488 0.0D+00 0.07376 0.01302 0.00000 0.00000 18.1 @ 1 -230.10523734 -1.2D-02 0.00903 0.00231 0.03627 0.10509 45.7 @ 2 -230.10619442 -9.6D-04 0.00491 0.00084 0.01898 0.06082 69.1 @ 3 -230.10628696 -9.3D-05 0.00176 0.00030 0.00737 0.02428 93.3 @ 4 -230.10629787 -1.1D-05 0.00023 0.00005 0.00219 0.00682 115.8 @ 5 -230.10629827 -4.0D-07 0.00004 0.00001 0.00047 0.00136 128.2 @ 5 -230.10629827 -4.0D-07 0.00004 0.00001 0.00047 0.00136 128.2
while 6.1.1 drags itself along for almost an hour:
@ Step Energy Delta E Gmax Grms Xrms Xmax Walltime @ ---- ---------------- -------- -------- -------- -------- -------- -------- @ 0 -230.09389924 0.0D+00 0.07389 0.01306 0.00000 0.00000 691.4 @ 1 -230.10680306 -1.3D-02 0.01081 0.00197 0.03065 0.10438 1378.3 @ 2 -230.10690186 -9.9D-05 0.01000 0.00167 0.00231 0.00803 2092.2
before failing with
6:6:driver: task_gradient failed:: 0 (rank:6 hostname:neon pid:4536):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0 ------------------------------------------------------------------------ There is an error related to the specified geometry ------------------------------------------------------------------------

Sure, the optimization takes 128 seconds instead of ca 44 seconds, but for anyone who's used NWCHEM with COSMO in the past, that's actually not too bad.

I ran another job to get a better feeling for how much longer COSMO vs no COSMO takes for optimization. Optimization of Arecoline (available in ECCE as a fragment) at rb3lyp/6-31+G* takes 2h 5 min with COSMO (33 optimization steps). Without COSMO it takes 37 minutes and uses 14 steps.

431. Briefly: a crude comparison of performance of NWChem 6.1, 6.1.1 and 6.3.

Just a simple comparison of different versions of nwchem on different hardware. It's mostly interesting to myself as a general guide to how slow my nodes are in relative terms.

I built nwchem as shown here: http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html
and openblas as shown here: http://verahill.blogspot.com.au/2013/05/423-openblas-on-debian-wheezy.html
and installed acml as shown here: http://verahill.blogspot.com.au/2013/05/422-set-up-acml-on-linux.html

I'm using ECCE: http://verahill.blogspot.com.au/2013/01/325-compiling-ecce-64-on-debian-testing.html
and SGE: http://verahill.blogspot.com.au/2012/06/setting-up-sun-grid-engine-with-three.html
I've set up ECCE similarly to what is shown here: http://verahill.blogspot.com.au/2012/06/ecce-in-virtual-machine-step-by-step.html

Test job:
scratch_dir /scratch
Title "opt freq"

Start  biphenyl_cation_twisted

echo

charge 1

geometry autosym units angstrom
 C     0.00000     -3.56301     0.00000
 C     -1.13927     -2.85928     -0.393841
 C     -1.13879     -1.46545     -0.394153
 C     0.00000     -0.742814     0.00000
 C     1.13879     -1.46545     0.394153
 C     1.13927     -2.85928     0.393841
 C     0.00000     0.742814     0.00000
 C     1.13879     1.46545     -0.394153
 C     1.13927     2.85928     -0.393841
 C     -1.13879     1.46545     0.394153
 C     0.00000     3.56301     0.00000
 C     -1.13927     2.85928     0.393841
 H     0.00000     -4.64896     0.00000
 H     -2.02827     -3.39662     -0.711607
 H     -2.02148     -0.928265     -0.727933
 H     2.02827     -3.39662     0.711607
 H     2.02827     3.39662     -0.711607
 H     -2.02148     0.928265     0.727933
 H     0.00000     4.64896     0.00000
 H     -2.02827     3.39662     0.711607
 H     2.02148     0.928265     -0.727933
 H     2.02148     -0.928265     0.727933
end

ecce_print ecce.out

basis "ao basis" cartesian print
  H library "6-31G**"
  C library "6-31G**"
END

dft
  mult 2
  XC b3lyp
  mulliken
end

driver
end

task dft optimize
task dft freq numerical

Results:
The jobs were run using all cores available.

AMD Phenom II X6 1055T, 8 Gb RAM, Openblas. Six cores.
6.1    2461
6.1.1  2114
6.3    2044
6.3    2048**

**using MKL compiled with ifort (http://verahill.blogspot.com.au/2013/07/469-intel-compiler-on-debian.html).

AMD FX 8150, 32 Gb RAM, acml 5.3.1 (gfortran, int64, fma4) -- earlier versions of nwchem were compiled against different versions of acml. Eight cores.
6.1    1619s
6.1.1  1588s
6.3    1611s
6.3    1507s**

**using MKL compiled with ifort (http://verahill.blogspot.com.au/2013/07/469-intel-compiler-on-debian.html).

Intel i5-2400, 16 Gb RAM, openblas. Four cores.
6.1    1689s
6.1.1  1696s
6.3    1652s
6.3    1550s*
6.3    1498s**

*using Intel MKL (see http://verahill.blogspot.com.au/2013/06/465-intel-mkl-math-kernel-library-on.html)
**using MKL compiled with ifort (http://verahill.blogspot.com.au/2013/07/469-intel-compiler-on-debian.html).

AMD Athlon II X3, 4 gb RAM, acm 5.3.1. Three cores.
6.3    4818s 
6.3    4058s**
**using MKL compiled with ifort (http://verahill.blogspot.com.au/2013/07/469-intel-compiler-on-debian.html).

23 May 2013

430. Strange issue with NWChem, openmpi, SGE and ECCE

This one's a bit odd.

Odd in the sense that

  • the math libs (acml) I'm using should be suitable for the processors that I'm using them for.
  • it only happens when I submit with ECCE + SGE. Calcs on the input files are fine if I launch the by hand



The problem:
I'm having issues launching jobs on two nodes where the nwchem 6.3. binaries were compiled against acml 5.3.1 (gfortran, int64). I'm launching the jobs from ECCE and I've got SGE set up and working since a long time. My two other nodes, one i5-2400 linked against openblas, and one AMD FX 8150 linked against acml 5.3.1 (gfortran, fma4, int64) work absolutely fine.

Both binaries were linked with acml using
export BLASOPT="-L/opt/acml/acml5.3.1/gfortran64_int64/lib -lacml"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.3.1/gfortran64_int64/lib"

The first node is an AMD phenom II X6 1055, while the second one is an ancient, recently-revived AMD Athlon X2 3800+. The acml util cpuid.exe gives
Chip manufacturer: AuthenticAMD AuthenticAMD family 15 extended family 1 model 10 Model Name: AMD Phenom(tm) II X6 1055T Processor Chip supports SSE Chip supports SSE2 Chip supports SSE3 Chip does not support AVX Chip does not support FMA3 Chip does not support FMA4
and
Model Name: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Chip supports SSE Chip supports SSE2 Chip supports SSE3 Chip does not support AVX Chip does not support FMA3 Chip does not support FMA4
respectively. On the AMD Phenom II X6 1055T I kept getting
Scaling coordinates for geometry "geometry" by 1.889725989 (inverse scale = 0.529177249) 0:Illegal Instruction error, status=: 4 (rank:0 hostname:boron pid:12386):ARMCI DASSERT fail. ../../ga-5-2/armci/src/ common/signaltrap.c:SigIllHandler():276 cond:0
. On the Athlon 64 X2 3800+ the job would just exit at
Directory information --------------------- 0 permanent = . 0 scratch = /home/me/scratch
There would be no other errors (in e.g. .po or .o files).

If I launch the job by hand, e.g.
mpirun -n 6 nwchem nwch.nw
it works fine.



The Partial solution
The errors for the AMD Phenom II X6 1055T went away when I instead of acml used openblas:
export BLASOPT="-L/opt/openblas/lib -lopenblas"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib"

See e.g. http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html for general compilation instructions.

The odd thing:
With openblas the AMD Athlon X2 3800+ suddenly gives
Scaling coordinates for geometry "geometry" by 1.889725989 (inverse scale = 0.529177249) 0:Illegal Instruction error, status=: 4 (rank:0 hostname:beryllium pid:9267):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigIllHandler():276 cond:0