Here's how to do it with the nwchem 6.3 QA tests.
I'm presuming that you built nwchem with mpi as shown e.g. here: http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html
In this particular case I'm using nwchem linked with openblas on an AMD Phenom II 1055T with 8 Gb RAM since it was the only node that was free.
0. Go to the QA directories.
In my case everything is housed in /opt/nwchem/nwchem-6.3-src.patched and the QA tests are in /opt/nwchem/nwchem-6.3-src.patched/QA
1. Run the tests
First set the environmental variables, then start the tests. The 6 in './doqmtests.mpi 6' is the number of threads i.e. processors to use in parallel.
export NWCHEM_TOP=/opt/nwchem/nwchem-6.3-src.patched export NWCHEM_TARGET=LINUX64 ./doqmtests.mpi 6 |tee doqmtests.mpi.log====================================================== QM: Running all tests (including some really big ones) ====================================================== Running tests/h2o_opt/h2o_opt cleaning scratch copying input and verified output files running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem) 26.3u 8.8s 0:07.13 492.8% (0t+0ds+0avg+49046max)k 0i+6199464o 18pf 0swaps verifying output ... OK Running tests/c2h4/c2h4 cleaning scratch copying input and verified output files running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem) 55.9u 2.1s 0:10.92 532.3% (0t+0ds+0avg+59834max)k 0i+808848o 19pf 0swaps verifying output ... OK [..]
2. Verify
Once the runs are done, go through the log to find out which, if any, failed. In my case, I had
Running tests/autosym/autosym
cleaning scratch
copying input and verified output files
running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
15.3u 2.5s 0:04.20 426.4% (0t+0ds+0avg+48842max)k 0i+1325784o 17pf 0swaps
verifying output ... failed
Running tests/dft_s12gh/dft_s12gh
cleaning scratch
copying input and verified output files
running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
619.4u 4.3s 1:45.33 592.1% (0t+0ds+0avg+76724max)k 2472i+1938952o 26pf 0swaps
verifying output ... failed
Failed
Running tests/cosmo_trichloroethene/cosmo_trichloroethene
cleaning scratch
copying input and verified output files
running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
113.6u 2.3s 0:19.57 592.8% (0t+0ds+0avg+63700max)k 64i+1149456o 20pf 0swaps
verifying output ... failed
Running tests/bsse_dft_trimer/bsse_dft_trimer
cleaning scratch
copying input and verified output files
running nwchem (/opt/nwchem/nwchem-6.3-src.patched/bin/LINUX64/nwchem)
228.2u 4.8s 0:40.16 580.5% (0t+0ds+0avg+53806max)k 0i+2716224o 21pf 0swaps
verifying output ... failed
Failed
and so on. Not a good start.
3. Troubleshoot the failed tests
To find out whether the failures are significant, we first need to understand how the script is doing the testing.
In runtests.mpi.unix
369 # Now verify the output 370 371 echo -n " verifying output ... " 372 373 perl $NWPARSE $STUB.out >& /dev/null 374 if ($status) then 375 echo nwparse.pl failed on test output $STUB.out 376 set overall_status = 1 377 continue 378 endif 379 perl $NWPARSE $STUB.ok.out >& /dev/null 380 if ($status) then 381 echo nwparse.pl failed on verified output $STUB.ok.out 382 set overall_status = 1 383 continue 384 endif 385 386 diff -w $STUB.ok.out.nwparse $STUB.out.nwparse >& /dev/null 387 @ diff1status = $status 388 # 389 endif 390 # 391 392 if ($diff1status) then 393 echo "failed" 394 set overall_status = 1 395 continue 396 else
In my case autosym failed:
cd testoutputs diff autosym.ok.out.nwparse autosym.out.nwparseIt seems to be a rounding error. As far as I know the precision at which the data is stored is significantly higher than at which it is reported, so this doesn't necessarily need to be a problem (it's still not a good thing though). Note that everything else, such as the thermochemical parameters, are identical.45c45 < Effective nuclear repulsion energy (a.u.) 4265.6221 --- > Effective nuclear repulsion energy (a.u.) 4265.6222
Continuing:
diff dft_s12gh.ok.out.nwparse dft_s12gh.out.nwparseSame thing.52c52 < The Zero-Point Energy (Kcal/mol) = 21.82496 --- > The Zero-Point Energy (Kcal/mol) = 21.82497 128c128 < H 0.0123 0.0030 0.0000 --- > H 0.0122 0.0030 0.0000
Here's a list over the tests that failed for me (-> indicates that the execution failed -- more details below; * indicates that it is expected to fail):
autosym dft_s12gh cosmo_trichloroethene bsse_dft_trimer cosmo_h3co cosmo_h3co_gp h2o_diag_to_cg_ub3lyp * oh2 dft_cr2 dft_x dft_ozone hess_nh3_ub3lyp pspw_SiC paw -> tddft_h2o_mxvc20 -> tddft_h2o_uhf_mxvc20 tce_cr_eom_t_ch_rohf hi_zora_sf o2_zora_so lys_qmmm ethane_qmmm qmmm_opt0 prop_ch3f ch3f-lc-wpbe ch3f-lc-wpbeh ch3radical_rot ch3radical_unrot cho_bp_props -> prop_cg_nh3_b3lyp acr-camb3lyp-cdfit acr-camb3lyp-direct acr_lcblyp o2_bnl fh_m06 ??? disp_dimer_ch4 disp_dimer_ch4_cgmin mep-test sif_sodft h2o_raman_3 h2o_raman_4 tropt-ch3nh2 h3_dirdyvtst h2o_hcons etf_hcons cho_bp_props -> dntmc_h2o_nh3 5h2o_core co_core talc neb-fch3cl neb-isobutene nwxc_pspw_1he nwxc_pspw_1ne nwxc_pspw_4n nwxc_pspw_4p nwxc_pspw_new_1he nwxc_pspw_new_3he nwxc_pspw_new_1ne nwxc_pspw_new_4n nwxc_pspw_new_1ar nwxc_pspw_new_4p nwxc_pspw_new_1kr nwxc_pspw_new_4as nwxc_pspw_new_1xe nwxc_pspw_new_4sb hess_nh3_dimer pbo_nesc1e h2o_selci hess_biph -> ch4_zts -> ch4cl_zts
All the jobs without a '->' or '*' failed due to rounding errors. To quickly go through them I put the list of failed jobs in a file, and then did
cat fails |xargs -I {} diff testoutputs/{}.ok.out.nwparse testoutputs/{}.out.nwparse|less
The jobs that failed outright are listed below:
-> tddft_h2o_mxvc20tddft_diagon: negative excitation energy 0 ------------------------------------------------------------------------ This type of error is most commonly associated with calculations not reaching convergence criteria-> tddft_h2o_uhf_mxvc20Last System Error Message from Task 5:: Numerical result out of range tddft_diagon: negative excitation energy 0-> prop_cg_nh3_b3lyptask hessian incompatible with cgmin 0 ------------------------------------------------------------------------ A feature requested has not yet been implemented-> dntmc_h2o_nh3********** Destroying SubGroups *********** ******************************************* deleting cloned rtdb deleting cloned rtdb Closing subgroup Closing subgroup 1:1:ga_pgroup_destroy_:Attempt to destroy process group with attached GAs:: 2 (rank:1 hostname:boron pid:1941):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0 2:2:ga_pgroup_destroy_:Attempt to destroy process group with attached GAs:: 2 (rank:2 hostname:boron pid:1942):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0 Last System Error Message from Task 1:: No such file or directory Last System Error Message from Task 2:: No such file or directory-> ch4_ztsscf string failed 0 ------------------------------------------------------------------------ This type of error is most commonly associated with calculations not reaching convergence criteria-> ch4cl_ztsscf string failed 0 ------------------------------------------------------------------------ This type of error is most commonly associated with calculations not reaching convergence criteria
It's time to go back and compare with 1. nwchem-6.3/acml and 2. nwchem-6.1.1/openblas and 3. a different processor architecture...
The question is how serious this is. In most cases I think the rounding errors are ok, but errors do accumulate, and especially when large and small numbers are multiplied they can become significant.
No comments:
Post a Comment