10 June 2013

443. Briefly: Running the QA tests in NWChem

To make sure that everything is working properly and that you get the expected results from your nwchem binaries, you should run the QA tests that come with nwchem.

Here's how to do it with the nwchem 6.3 QA tests.

I'm presuming that you built nwchem with mpi as shown e.g. here: http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html

In this particular case I'm using nwchem linked with openblas on an AMD Phenom II 1055T with 8 Gb RAM since it was the only node that was free.

0. Go to the QA directories.
In my case everything is housed in /opt/nwchem/nwchem-6.3-src.patched and the QA tests are in /opt/nwchem/nwchem-6.3-src.patched/QA

1. Run the tests
First set the environmental variables, then start the tests. The 6 in './doqmtests.mpi 6' is the number of threads i.e. processors to use in parallel.

export NWCHEM_TOP=/opt/nwchem/nwchem-6.3-src.patched
export NWCHEM_TARGET=LINUX64
./doqmtests.mpi 6 |tee doqmtests.mpi.log
====================================================== QM: Running all tests (including some really big ones) ====================================================== Running tests/h2o_opt/h2o_opt cleaning scratch copying input and verified output files running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem) 26.3u 8.8s 0:07.13 492.8% (0t+0ds+0avg+49046max)k 0i+6199464o 18pf 0swaps verifying output ... OK Running tests/c2h4/c2h4 cleaning scratch copying input and verified output files running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem) 55.9u 2.1s 0:10.92 532.3% (0t+0ds+0avg+59834max)k 0i+808848o 19pf 0swaps verifying output ... OK [..]

2. Verify

Once the runs are done, go through the log to find out which, if any, failed. In my case, I had
Running tests/autosym/autosym 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
15.3u 2.5s 0:04.20 426.4% (0t+0ds+0avg+48842max)k 0i+1325784o 17pf 0swaps
     verifying output ... failed

Running tests/dft_s12gh/dft_s12gh 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
619.4u 4.3s 1:45.33 592.1% (0t+0ds+0avg+76724max)k 2472i+1938952o 26pf 0swaps
     verifying output ... failed
 
Failed
 
Running tests/cosmo_trichloroethene/cosmo_trichloroethene 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
113.6u 2.3s 0:19.57 592.8% (0t+0ds+0avg+63700max)k 64i+1149456o 20pf 0swaps
     verifying output ... failed

 Running tests/bsse_dft_trimer/bsse_dft_trimer 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
228.2u 4.8s 0:40.16 580.5% (0t+0ds+0avg+53806max)k 0i+2716224o 21pf 0swaps
     verifying output ... failed
 
Failed

and so on. Not a good start.

3. Troubleshoot the failed tests
To find out whether the failures are significant, we first need to understand how the script is doing the testing.

In runtests.mpi.unix
369 # Now verify the output 370 371 echo -n " verifying output ... " 372 373 perl $NWPARSE $STUB.out >& /dev/null 374 if ($status) then 375 echo nwparse.pl failed on test output $STUB.out 376 set overall_status = 1 377 continue 378 endif 379 perl $NWPARSE $STUB.ok.out >& /dev/null 380 if ($status) then 381 echo nwparse.pl failed on verified output $STUB.ok.out 382 set overall_status = 1 383 continue 384 endif 385 386 diff -w $STUB.ok.out.nwparse $STUB.out.nwparse >& /dev/null 387 @ diff1status = $status 388 # 389 endif 390 # 391 392 if ($diff1status) then 393 echo "failed" 394 set overall_status = 1 395 continue 396 else

In my case autosym failed:

cd testoutputs
diff autosym.ok.out.nwparse autosym.out.nwparse 
45c45 < Effective nuclear repulsion energy (a.u.) 4265.6221 --- > Effective nuclear repulsion energy (a.u.) 4265.6222
It seems to be a rounding error. As far as I know the precision at which the data is stored is significantly higher than at which it is reported, so this doesn't necessarily need to be a problem (it's still not a good thing though). Note that everything else, such as the thermochemical parameters, are identical.

Continuing:
diff dft_s12gh.ok.out.nwparse dft_s12gh.out.nwparse 
52c52 < The Zero-Point Energy (Kcal/mol) = 21.82496 --- > The Zero-Point Energy (Kcal/mol) = 21.82497 128c128 < H 0.0123 0.0030 0.0000 --- > H 0.0122 0.0030 0.0000
Same thing.

Here's a list over the tests that failed for me (-> indicates that the execution failed -- more details below; * indicates that it is expected to fail):

autosym
dft_s12gh
cosmo_trichloroethene
bsse_dft_trimer
cosmo_h3co
cosmo_h3co_gp
h2o_diag_to_cg_ub3lyp
* oh2
dft_cr2
dft_x
dft_ozone
hess_nh3_ub3lyp
pspw_SiC
paw
-> tddft_h2o_mxvc20
-> tddft_h2o_uhf_mxvc20
tce_cr_eom_t_ch_rohf
hi_zora_sf
o2_zora_so
lys_qmmm
ethane_qmmm
qmmm_opt0
prop_ch3f
ch3f-lc-wpbe
ch3f-lc-wpbeh
ch3radical_rot
ch3radical_unrot
cho_bp_props
-> prop_cg_nh3_b3lyp
acr-camb3lyp-cdfit
acr-camb3lyp-direct
acr_lcblyp
o2_bnl
fh_m06 ???
disp_dimer_ch4
disp_dimer_ch4_cgmin
mep-test
sif_sodft
h2o_raman_3
h2o_raman_4
tropt-ch3nh2
h3_dirdyvtst
h2o_hcons
etf_hcons
cho_bp_props
-> dntmc_h2o_nh3
5h2o_core
co_core
talc
neb-fch3cl
neb-isobutene
nwxc_pspw_1he
nwxc_pspw_1ne
nwxc_pspw_4n
nwxc_pspw_4p
nwxc_pspw_new_1he
nwxc_pspw_new_3he
nwxc_pspw_new_1ne
nwxc_pspw_new_4n
nwxc_pspw_new_1ar
nwxc_pspw_new_4p
nwxc_pspw_new_1kr
nwxc_pspw_new_4as
nwxc_pspw_new_1xe
nwxc_pspw_new_4sb
hess_nh3_dimer
pbo_nesc1e
h2o_selci
hess_biph
-> ch4_zts
-> ch4cl_zts

All the jobs without a '->' or '*' failed due to rounding errors. To quickly go through them I put the list of failed jobs in a file, and then did
cat fails |xargs -I {} diff testoutputs/{}.ok.out.nwparse testoutputs/{}.out.nwparse|less


The jobs that failed outright are listed below:

-> tddft_h2o_mxvc20
tddft_diagon: negative excitation energy 0 ------------------------------------------------------------------------ This type of error is most commonly associated with calculations not reaching convergence criteria
-> tddft_h2o_uhf_mxvc20
Last System Error Message from Task 5:: Numerical result out of range tddft_diagon: negative excitation energy 0
-> prop_cg_nh3_b3lyp
task hessian incompatible with cgmin 0 ------------------------------------------------------------------------ A feature requested has not yet been implemented
-> dntmc_h2o_nh3
********** Destroying SubGroups *********** ******************************************* deleting cloned rtdb deleting cloned rtdb Closing subgroup Closing subgroup 1:1:ga_pgroup_destroy_:Attempt to destroy process group with attached GAs:: 2 (rank:1 hostname:boron pid:1941):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0 2:2:ga_pgroup_destroy_:Attempt to destroy process group with attached GAs:: 2 (rank:2 hostname:boron pid:1942):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0 Last System Error Message from Task 1:: No such file or directory Last System Error Message from Task 2:: No such file or directory
-> ch4_zts
scf string failed 0 ------------------------------------------------------------------------ This type of error is most commonly associated with calculations not reaching convergence criteria
-> ch4cl_zts
scf string failed 0 ------------------------------------------------------------------------ This type of error is most commonly associated with calculations not reaching convergence criteria

It's time to go back and compare with 1. nwchem-6.3/acml and 2. nwchem-6.1.1/openblas and 3. a different processor architecture...

The question is how serious this is. In most cases I think the rounding errors are ok, but errors do accumulate, and especially when large and small numbers are multiplied they can become significant.

No comments:

Post a Comment