Lindqvist -- a blog about Linux and Science. Mostly.: 443. Briefly: Running the QA tests in NWChem

To make sure that everything is working properly and that you get the expected results from your nwchem binaries, you should run the QA tests that come with nwchem.

Here's how to do it with the nwchem 6.3 QA tests.

I'm presuming that you built nwchem with mpi as shown e.g. here: http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html

In this particular case I'm using nwchem linked with openblas on an AMD Phenom II 1055T with 8 Gb RAM since it was the only node that was free.

0. Go to the QA directories.
In my case everything is housed in /opt/nwchem/nwchem-6.3-src.patched and the QA tests are in /opt/nwchem/nwchem-6.3-src.patched/QA

1. Run the tests
First set the environmental variables, then start the tests. The 6 in './doqmtests.mpi 6' is the number of threads i.e. processors to use in parallel.

export NWCHEM_TOP=/opt/nwchem/nwchem-6.3-src.patched
export NWCHEM_TARGET=LINUX64
./doqmtests.mpi 6 |tee doqmtests.mpi.log

======================================================
 QM: Running all tests (including some really big ones)
 ======================================================

 
 Running tests/h2o_opt/h2o_opt 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
26.3u 8.8s 0:07.13 492.8% (0t+0ds+0avg+49046max)k 0i+6199464o 18pf 0swaps
     verifying output ... OK
 
 Running tests/c2h4/c2h4 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
55.9u 2.1s 0:10.92 532.3% (0t+0ds+0avg+59834max)k 0i+808848o 19pf 0swaps
     verifying output ... OK
[..]

2. Verify

Once the runs are done, go through the log to find out which, if any, failed. In my case, I had

Running tests/autosym/autosym 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
15.3u 2.5s 0:04.20 426.4% (0t+0ds+0avg+48842max)k 0i+1325784o 17pf 0swaps
     verifying output ... failed

Running tests/dft_s12gh/dft_s12gh 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
619.4u 4.3s 1:45.33 592.1% (0t+0ds+0avg+76724max)k 2472i+1938952o 26pf 0swaps
     verifying output ... failed
 
Failed
 
Running tests/cosmo_trichloroethene/cosmo_trichloroethene 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
113.6u 2.3s 0:19.57 592.8% (0t+0ds+0avg+63700max)k 64i+1149456o 20pf 0swaps
     verifying output ... failed

 Running tests/bsse_dft_trimer/bsse_dft_trimer 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
 
228.2u 4.8s 0:40.16 580.5% (0t+0ds+0avg+53806max)k 0i+2716224o 21pf 0swaps
     verifying output ... failed
 
Failed

and so on. Not a good start.

3. Troubleshoot the failed tests
To find out whether the failures are significant, we first need to understand how the script is doing the testing.

In runtests.mpi.unix


369 # Now verify the output
370 
371     echo -n "     verifying output ... "
372 
373     perl $NWPARSE $STUB.out >& /dev/null
374     if ($status) then
375       echo nwparse.pl failed on test output $STUB.out
376       set overall_status = 1
377       continue
378     endif
379     perl $NWPARSE $STUB.ok.out >& /dev/null
380     if ($status) then
381       echo nwparse.pl failed on verified output $STUB.ok.out
382       set overall_status = 1
383       continue
384     endif
385 
386     diff -w $STUB.ok.out.nwparse $STUB.out.nwparse >& /dev/null
387     @ diff1status = $status
388 #
389   endif
390 #
391 
392   if ($diff1status) then
393     echo "failed"
394     set overall_status = 1
395     continue
396   else

In my case autosym failed:

cd testoutputs
diff autosym.ok.out.nwparse autosym.out.nwparse 

45c45
< Effective nuclear repulsion energy (a.u.) 4265.6221
---
> Effective nuclear repulsion energy (a.u.) 4265.6222

It seems to be a rounding error. As far as I know the precision at which the data is stored is significantly higher than at which it is reported, so this doesn't necessarily need to be a problem (it's still not a good thing though). Note that everything else, such as the thermochemical parameters, are identical.

Continuing:

diff dft_s12gh.ok.out.nwparse dft_s12gh.out.nwparse 

52c52
< The Zero-Point Energy (Kcal/mol) = 21.82496
---
> The Zero-Point Energy (Kcal/mol) = 21.82497
128c128
<           H     0.0123     0.0030     0.0000
---
>           H     0.0122     0.0030     0.0000

Same thing.

Here's a list over the tests that failed for me (-> indicates that the execution failed -- more details below; * indicates that it is expected to fail):

autosym
dft_s12gh
cosmo_trichloroethene
bsse_dft_trimer
cosmo_h3co
cosmo_h3co_gp
h2o_diag_to_cg_ub3lyp
* oh2
dft_cr2
dft_x
dft_ozone
hess_nh3_ub3lyp
pspw_SiC
paw
-> tddft_h2o_mxvc20
-> tddft_h2o_uhf_mxvc20
tce_cr_eom_t_ch_rohf
hi_zora_sf
o2_zora_so
lys_qmmm
ethane_qmmm
qmmm_opt0
prop_ch3f
ch3f-lc-wpbe
ch3f-lc-wpbeh
ch3radical_rot
ch3radical_unrot
cho_bp_props
-> prop_cg_nh3_b3lyp
acr-camb3lyp-cdfit
acr-camb3lyp-direct
acr_lcblyp
o2_bnl
fh_m06 ???
disp_dimer_ch4
disp_dimer_ch4_cgmin
mep-test
sif_sodft
h2o_raman_3
h2o_raman_4
tropt-ch3nh2
h3_dirdyvtst
h2o_hcons
etf_hcons
cho_bp_props
-> dntmc_h2o_nh3
5h2o_core
co_core
talc
neb-fch3cl
neb-isobutene
nwxc_pspw_1he
nwxc_pspw_1ne
nwxc_pspw_4n
nwxc_pspw_4p
nwxc_pspw_new_1he
nwxc_pspw_new_3he
nwxc_pspw_new_1ne
nwxc_pspw_new_4n
nwxc_pspw_new_1ar
nwxc_pspw_new_4p
nwxc_pspw_new_1kr
nwxc_pspw_new_4as
nwxc_pspw_new_1xe
nwxc_pspw_new_4sb
hess_nh3_dimer
pbo_nesc1e
h2o_selci
hess_biph
-> ch4_zts
-> ch4cl_zts

All the jobs without a '->' or '*' failed due to rounding errors. To quickly go through them I put the list of failed jobs in a file, and then did

cat fails |xargs -I {} diff testoutputs/{}.ok.out.nwparse testoutputs/{}.out.nwparse|less

The jobs that failed outright are listed below:

-> tddft_h2o_mxvc20

tddft_diagon: negative excitation energy        0
 ------------------------------------------------------------------------
 This type of error is most commonly associated with calculations not reaching convergence criteria

-> tddft_h2o_uhf_mxvc20

Last System Error Message from Task 5:: Numerical result out of range
 tddft_diagon: negative excitation energy        0

-> prop_cg_nh3_b3lyp

task hessian incompatible with cgmin        0
 ------------------------------------------------------------------------
 A feature requested has not yet been implemented

-> dntmc_h2o_nh3

********** Destroying SubGroups ***********
 *******************************************
  deleting cloned rtdb 
  deleting cloned rtdb 
 Closing subgroup 
 Closing subgroup 
1:1:ga_pgroup_destroy_:Attempt to destroy process group with attached GAs:: 2
(rank:1 hostname:boron pid:1941):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0
2:2:ga_pgroup_destroy_:Attempt to destroy process group with attached GAs:: 2
(rank:2 hostname:boron pid:1942):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0
Last System Error Message from Task 1:: No such file or directory
Last System Error Message from Task 2:: No such file or directory

-> ch4_zts

scf string failed                                                                       0
 ------------------------------------------------------------------------
 This type of error is most commonly associated with calculations not reaching convergence criteria

-> ch4cl_zts

scf string failed                                                                       0
 ------------------------------------------------------------------------
 This type of error is most commonly associated with calculations not reaching convergence criteria

It's time to go back and compare with 1. nwchem-6.3/acml and 2. nwchem-6.1.1/openblas and 3. a different processor architecture...

The question is how serious this is. In most cases I think the rounding errors are ok, but errors do accumulate, and especially when large and small numbers are multiplied they can become significant.

Pages

10 June 2013

443. Briefly: Running the QA tests in NWChem

No comments:

Post a Comment

Contributors

Statcounter