14 June 2012

191. Thinking about Molecular volume -- and is cosmo/nwchem yielding the right ones?

Disclaimer:
I'm an neither a theoretical nor computational chemist. I'm an analytical/inorganic chemist who likes computers. Don't trust my conclusions. This is more like thinking aloud.

The problem:
The underlying impetus is that of molecular volume: if we have a set of scatter points in space which define the surface of a molecule, how can we extract the volume? In particular as we're actually given the surface points by in the form of a cosmo.xyz file by COSMO (and yes, nwchem also outputs a volume - more about that later) there's no reason why we won't do the calculations ourselves. Also, there's at least one example of comparing results from a few major software packages, where nwchem was the odd one out.

Because it's good to know how to use Octave and bash, I'll show the commands as well.

The COSMO parameters used were
cosmo
    rsolv 0
end

[come to think of it: why bother with
and nwchem returned

 atomic radii =
 --------------
    1  6.000  2.000
    2  6.000  2.000
    3  6.000  2.000
    4  6.000  2.000
    5  6.000  2.000
    6  6.000  2.000
    7  1.000  1.300
    8  1.000  1.300
    9  1.000  1.300
   10  1.000  1.300
   11  1.000  1.300
   12  1.000  1.300
and a volume of ca 74.5 Å3

Processing:
me@Be:~$ head cosmo.xyz 
                  325
 cosmo charges
 Bq   2.1848085582473193      -0.38055253987610238        1.5251498369435705       -9.3089382062078174E-004
 Bq   1.6134835908159706      -0.59877925881345084        1.8782480854375714       -3.3706153046646758E-003
 Bq  0.43449121346899733      -0.59877925881345084        1.8782480854375714       -3.9739778624157118E-003
 Bq   1.0239874021424840      -0.23823332776127137        1.8683447179254316       -1.6433149723942275E-003


OK, we need to remove the first two lines, and the first column.
me@Be:~$ tail -n +3 cosmo.xyz|gawk '{print $2,$3,$4,$5}'> cos2.xyz
Start octave.
octave:1> bz=load('cos2.xyz');
octave:2> x=bz(:,1);y=bz(:,2);z=bz(:,3);c=bz(:,4);
octave:3> plot3(x,y,z)

Paradoxically, this would be fairly easy to do with a 'normal-size' physical object (e.g. water displacement, or area on a 2D project: draw it, cut it out, weigh it and use the density of the paper)

 Computationally, we need to think about it though. The most logical approach seems to be to take all x,y data points with a small range of values of z=zi±dz, project them on a 2D surface, calculate the area within, and multiply it by dz. Do this for all values of z.
octave:4> plot(y,z,'*')


But how to calculate the area inside an arbitrary two-dimensional figure then? If we can pick a point in the 'centre' of the figure, we can draw repeated triangles with this point as one of the corners. It's easy to calculate the area of a triangle, so we just need to sum the areas of the triangles. All we need to know is how to find such a central point to use as a corner. Also, there are problems when dz is too large and the 'border' becomes fuzzy.
octave:5> plot(y(1:25),z(1:25),'*')

In fact, at this stage there may well be pre-canned algorithms to help us.
octave:6>H=convhull(y(1:25),z(1:25));
octave:7>plot(y(H),z(H))
octave:8>hold
octave:9>plot(y(1:25),z(1:25),'*')

That way we can reduce the number of points to the ones defining the encircling figure.
octave:10>area(y(H),z(H))


That still doesn't give us the area (I think matlab does though). Since it's centred around the x axis we could probably use cumsum(abs(z(H))), but that's not general enough. In fact, there'd be so much  quality analysis required in order to make sure that we include enough, but not too many, points in our slices that it quickly becomes a chore.

So we'll take a step back. Turns out it's even easier:
octave:11>[H V]=convhulln([x y z]);
This probably isn't how you're supposed to plot it, but it works:
octave:13>trisurf(H,x,y,z)

trisurf plot
octave:12>V
gives V=104.07  Å3 (c.f. Nwchem/cosmo ca 74.5 Å3 for rsolv=0.)

Now that doesn't look good, but it has been noted nwchem/cosmo gives volumes which are about half of what every other program gives. See here and here:

">Cosmo produced volumes, which were twice as small
> as those obtained by PCM, while surfaces where by about 15% bigger in
> Cosmo."

I think nwchem actually isn't returning values of the wrong magnitude -- I think the value returned by nwchem is the molecular volume, while the other programmes return the solvent accessible surface-based volume. But what is in cosmo.xyz?

It appears to be a little bit more complex than that though.


We can open the cosmo.xyz file in jmol, but calculating the volume from these would be meaningless due to the way jmol works.

Instead we'll have to use the VdW radii of the xyz coordinates of the (unoptimised) molecule:


$ isosurface sasurface 0.5 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 141.06999
$ isosurface sasurface 0.225 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume =104.452415
$ isosurface solvent 0 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 79.09731
$ isosurface solvent volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = [80.26721490808025]
$ isosurface molecular volume
isosurface2 created with cutoff=0.0; isosurface count: 2
isosurfaceVolume = [80.58888982478977]
$ isosurface sasurface 0.2 area
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceArea = 118.730934
Making sense?

sasurface generates a solvent accessible surface. We can generate a value similar to what we saw from the cosmo.xyz points by forcing the sasurface probe radius..

The vdw radii of H and C are 1.2 and 1.7 Å, but COSMO uses 1.3 and 2.0.

Look at this plot again:


The height goes from -2 to 2, which agrees with the large 2.0 Å VDW radius for C that COSMO uses. The volume outputted by Nwchem is the molecular volume (as actually is stated). 
 number of -cosmo- surface points =      176
 molecular surface =    125.008 angstrom**2
 molecular volume  =     74.512 angstrom**3
(electrostatic) solvation energy =         0.0052128678 (    3.27 kcal/mol)
The molecular volume for rsolv=0 is 74.5 Å3 which is fairly close to isosurface sasurface 0 volume. Area is trickier, and requires isosurface sasurface 0.23 volume to yield anything close.

I don't think it's a coincidence that isosurface sasurface 0.225 volume gives a reasonable agreement with the cosmo.xyz, since 1.7+0.225=1.925 which is ca 2 (we only add 0.1 for H).

I'm sure all this is in the manual somewhere. But there's nothing like learning through doing.

The conclusions:
* NWchem returns a volume based on the vdw radii, not the solvent cavity
* cosmo.xyz contains points defining the surface according to the vdw radii that cosmo uses
* These are two different sets of vdw radii
* You can't compare the output of different software packages if they aren't outputting the same data
* The reported NWChem volume does depend on rsolv, the cosmo vol doesn't
* The cosmo.xyz volume is insensitive to rsolv, but sensitive to radius as expected. As far as I understand, the cosmo volumes are based solely on the vdw radii (as supplied to cosmo)
* I haven't quite figured out how, but looking at the dependency of rsolve vs defining vdw radii for cosmo, the radii used to calculate the nwchem volume is is certainly affected.

Increase rsolv=0.0, increase vdw +0.0: 74.51/104.07/3.27
Increase rsolv=0.5, increase vdw +0.0: 58.0/103.96/3.01
Increase rsolv=1.0, increase vdw +0.0: 54 /103.87/2.95
Increase rsolv=0.0, increase vdw +0.1: 84.79/115.10/2.72
Increase rsolv=0.1, increase vdw +0.1: 82.68/115.10/2.63
Increase rsolv=0.27, increase vdw +0.1: 71.84/114.97/2.56
Increase rsolv=0.0, increase vdw +0.2: 96.59/126.83/2.22
Increase rsolv=0.1, increase vdw +0.2: 85.70/126.68/2.09
increase rsolv=0.70, increase vdw +0.2: 74.68/126.56/2.01

My only real conclusion at this point is that you have to be extremely careful about what you do. This is not easy.


A Certain Commercial Programme (ACCP):
Using pcm:

scrf=(pcm,solvent=water) -- this uses vdw radii.
GePol: Cavity volume                                =      134.665 Ang**3
GePol: Cavity surface area                          =    143.132 Ang**2
Let's see if we can do this in jmol:
$ isosurface sasurface 0.5 area
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceArea = 144.25595
$ isosurface sasurface 0.46 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 135.33589
PCM is less of a mystery now.

ACCP has a few more options though.
Using IPCM with 50 points. This uses the isodensity volume.
Volume of Solute Cavity = 8.026500E+02
Total "Solvent Accessible Surface Area" of Solute = 4.485628E+02
I've been told that the units are in Bohr3 and Bohr2. That translates to 118.94 Å3 and 125.61Å3, respectively, which sounds about right. 

13 June 2012

190. In deep water: NWChem and COSMO

This post is based entirely on empirical experience. I don't claim to know what I'm doing. Right now I'm just looking at performance.

To actually learn more about COSMO (implemented) and COSMO-RS (not implemented), read the following article by the creator of the methods: http://onlinelibrary.wiley.com/doi/10.1002/wcms.56/abstract

Anyway.


As always, the test job (benzene at b3lyp/g-31+g*) is very short, so the error margin is large. A major impetus for this is the execeptional performance of PCM in gaussian, and seemingly poor performance of nwchem using standard settings. When several numbers are quoted they come from multiple runs

Task energy - empty COSMO block:
0. Gas phase - ca 40 s
1. From scratch. Empty cosmo block - 79 s
2. Loaded movecs from gas phase, empty cosmo block - 48 s, 65 s, 65 s

The default is water and rsolv=0

COSMO parameters
Movecs loaded in all cases. Solvation energies in []

Task energy -- rsolv
0. rsolv=0 - [3.27 kcal/mol] - 48 s, 65 s, 65 s, 65 s
1. rsolv=0.5 - [3.01 kcal/mol] - 58 s, 58 s
2. rsolv=1 - [2.95 kcal/mol] - 57 s, 57 s, 58 s
3. rsolv=2 - [2.62 kcal/mol] - 55 s

The molecular volumes obtained are 74.5, 58.0 and 54.0 Å3, respectively, for r=0..1. My next post will talk about what this actually means, but in short, this has nothing to do with the solvent/solute cavity.

Task energy -- lineq
0. rsolv=0.5; lineq 0 - [3.01 kcal/mol] - 58 s, 58 s, 56 s
1. rsolv=0.5; lineq 1 - [3.01 kcal/mol] -  58 s

Task energy -- ificos
0. rsolv=0.5; lineq 0, ificos=0 - [3.01 kcal/mol] - 58 s, 58 s, 56 s
1. rsolv=0.5; lineq 0, ificos=1 - [3.01 kcal/mol] - 62 s

1 (one) kcal./mol = 4.184 kJ/mol -- there's thus a fairly wide range of values obtained above depending on the absolute settings.

Rsolve defines the probe used to find the solvent accessible surface -- the smaller the value, the more fine-grained and larger the apparent accessible surface. We would expect that a fairly small number is preferable for rsolv.

Ultimately, I don't see any obvious way of improving performance, other than using large values for rsolv.

An interesting feature is that the surface used by COSMO is save in a cosmo.xyz file in the runtime directory -- all that remains is working out a way of calculating the volume from this (I know it's reported in the nwchem output, but it never hurts being paranoid)

189. Thoughts on restarting NWChem jobs in ECCE

UPDATE: Because all my nodes are working hard to keep my office warm in the Australian winter, I haven't tested this very extensively, but it seems like freq jobs can be restarted using
freq
           reuse oldhessian.hess
end
Original post
As often is the case, this is as much a note to myself as a blog post.

Or that's how it started out. I've since spent a bit of time testing different restart options, as I found that some paradoxically were actually seen to lead to longer calculations...

The jobs I've experimented with are very short, so the error margin is probably huge.

What I tried:

A. Task dft geometry:
1. Original job - 215 s
2. Substituting Start with restart in the same directory as A.1 (i.e. loaded db)  - 204 s
3. Same as A.2, but also deleted geometry section - 43 s
4. Same as A.2 and A.3, but loaded movecs explicitly - 43 s
5. Used start, but loaded movecs from job B.3. - 267 s

Comment: not sure whether A.5 is consistently slower than A.1, but I've never seen it go faster. A.3 looks like a good bet when resuming a calculation.

B. Task dft energy:
1. Original job - 41 s
2. Deleted geometry, loaded movecs, db from B.1 - 8.7 s
3. Used start directive, kept geometry, loaded movecs from B.1, no db - 8.7 s

Comment: loading movecs (B.3) seems like a winner

C. Task dft frequency
0. Original job - 952 s
1. Deleted geom, loaded movecs, db from task energy (B.2 above) - 853 s
2. Delete geometry, loaded movecs from opt, put drv.hess, .hess, fd_ddipole in same directory (from A.1 above) - 842 s
3. Same as C2, but deleted basis block as well, and removed everything from the dft block except direct and vectors - 849 s
4. Copied hessian from 0, and put 'reuse' inside the freq block - 0.1 s (!)
5. Copied hessian from A.1 and put 'reuse' inside the freq block - 0.1 s (but data wrong)


Comment: is C0 really slower than the other jobs?
The problem with approach C.5 is that while C.4 gives

65.943 kcal/mol, 69.553 cal/mol-K
C.5 gives
288.058 kcal/mol, 64.706 cal/mol-K


Here's a comment from one of the developers, Bert:

"If you just want to redo an energy calculation followed by and ESP calculation, I would never use restart, but just use start and define the geometry in the geometry block. [cut] The restart is purely to continue the calculation that got interrupted, and the runtime database is probably not in a clean enough state to do something completely different with it. You can use the movecs that have been generated as the starting vectors though. "

A.5 (no effect) and B.3 (speed-up) would be in line with that approach.

With that in hand, time to work on ECCE.

ECCE is a nice tool, but as any point-and-click program it has it's limitations -- it's impossible to predict every single type of usage, and this is particularly true for computational wizardry. To a large extent this is compensated for by the ability to do a 'final edit' using vim before submitting a job -- there is obviously nothing whatsoever that you're prevented from doing at this point, so it offer ultimate flexibility.

There is a major weakness using ECCE though -- using files from old jobs.

In particular this is a weakness when it comes to restarting jobs. In terms of structure, this isn't a problem -- the last structure is provided by ecce via ecce.out.  It would be nice to able to carry over the .movecs files though (I'm still learning, but loading movecs and using fragment guess seems to be the neatest thing). This is high on the wish list.

Anyway, there are two major use cases:

Restarting an interrupted job
Assuming that you resubmit it without changing the name and to the same cluster so that it'll run remotely in the same directory:

Replace
Start
with
restart 
which should tell nwchem to look for the .db file and
either edit the scf or dft block and add
vectors input jobname.movecs
Obviously, this isn't example of great insight, but rather a product of reading the manual.

Also, the manual does state (but does not suggest) that leaving the start/restart directive out would cause nwchem to look for evidence of whether it's a restarted job or not. The problem is that ECCE automatically names all files nwch.nw, which would cause nwchem to look for nwch.db and fail.


Launching a new calculation based on an old job
Now, if you are duplicating a job, or if you've since renamed the job, you're in a spot of trouble since ecce doesn't concern itself with the .db and .movecs files. Maybe there's a good reason for this? But if I understand everything correctly, this means that you are loosing a lot of time on scf cycles which you could avoid by loading the .movecs and .db file.

I think that in the case of: same cluster and similar directory structure (i.e. the previous job is also a subdirectory of the same parent directory as the new job) you can put this at the beginning of your job
task shell "cp ../oldjob/*.movecs ."

and either edit the scf or dft block and add
vectors input jobname.movecs

and it actually works.

But I had no luck doing this
task shell "cp ../oldjob/*.db ."
And combining it with restart -- it wants the .db file to be there already if you use restart. This, at least in terms of functionality, agrees well with the comment by Bert above - same directory is ok, but with a different directory only movecs are reasonably easy.

Now, all we need is a tick box to copy the old movecs files between jobs...and the underlying structure. At the moment the movecs files don't get imported, so it would take a bit of editing to get to that point.