Odd in the sense that
- the math libs (acml) I'm using should be suitable for the processors that I'm using them for.
- it only happens when I submit with ECCE + SGE. Calcs on the input files are fine if I launch the by hand
The problem:
I'm having issues launching jobs on two nodes where the nwchem 6.3. binaries were compiled against acml 5.3.1 (gfortran, int64). I'm launching the jobs from ECCE and I've got SGE set up and working since a long time. My two other nodes, one i5-2400 linked against openblas, and one AMD FX 8150 linked against acml 5.3.1 (gfortran, fma4, int64) work absolutely fine.
Both binaries were linked with acml using
export BLASOPT="-L/opt/acml/acml5.3.1/gfortran64_int64/lib -lacml" export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.3.1/gfortran64_int64/lib"
The first node is an AMD phenom II X6 1055, while the second one is an ancient, recently-revived AMD Athlon X2 3800+. The acml util cpuid.exe gives
andChip manufacturer: AuthenticAMD AuthenticAMD family 15 extended family 1 model 10 Model Name: AMD Phenom(tm) II X6 1055T Processor Chip supports SSE Chip supports SSE2 Chip supports SSE3 Chip does not support AVX Chip does not support FMA3 Chip does not support FMA4
respectively. On the AMD Phenom II X6 1055T I kept gettingModel Name: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Chip supports SSE Chip supports SSE2 Chip supports SSE3 Chip does not support AVX Chip does not support FMA3 Chip does not support FMA4
. On the Athlon 64 X2 3800+ the job would just exit atScaling coordinates for geometry "geometry" by 1.889725989 (inverse scale = 0.529177249) 0:Illegal Instruction error, status=: 4 (rank:0 hostname:boron pid:12386):ARMCI DASSERT fail. ../../ga-5-2/armci/src/ common/signaltrap.c:SigIllHandler():276 cond:0
There would be no other errors (in e.g. .po or .o files).Directory information --------------------- 0 permanent = . 0 scratch = /home/me/scratch
If I launch the job by hand, e.g.
mpirun -n 6 nwchem nwch.nwit works fine.
The Partial solution
The errors for the AMD Phenom II X6 1055T went away when I instead of acml used openblas:
export BLASOPT="-L/opt/openblas/lib -lopenblas" export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib"
See e.g. http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html for general compilation instructions.
The odd thing:
With openblas the AMD Athlon X2 3800+ suddenly gives
Scaling coordinates for geometry "geometry" by 1.889725989 (inverse scale = 0.529177249) 0:Illegal Instruction error, status=: 4 (rank:0 hostname:beryllium pid:9267):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigIllHandler():276 cond:0