14 June 2012

191. Thinking about Molecular volume -- and is cosmo/nwchem yielding the right ones?

Disclaimer:
I'm an neither a theoretical nor computational chemist. I'm an analytical/inorganic chemist who likes computers. Don't trust my conclusions. This is more like thinking aloud.

The problem:
The underlying impetus is that of molecular volume: if we have a set of scatter points in space which define the surface of a molecule, how can we extract the volume? In particular as we're actually given the surface points by in the form of a cosmo.xyz file by COSMO (and yes, nwchem also outputs a volume - more about that later) there's no reason why we won't do the calculations ourselves. Also, there's at least one example of comparing results from a few major software packages, where nwchem was the odd one out.

Because it's good to know how to use Octave and bash, I'll show the commands as well.

The COSMO parameters used were
cosmo
    rsolv 0
end

[come to think of it: why bother with
and nwchem returned

 atomic radii =
 --------------
    1  6.000  2.000
    2  6.000  2.000
    3  6.000  2.000
    4  6.000  2.000
    5  6.000  2.000
    6  6.000  2.000
    7  1.000  1.300
    8  1.000  1.300
    9  1.000  1.300
   10  1.000  1.300
   11  1.000  1.300
   12  1.000  1.300
and a volume of ca 74.5 Å3

Processing:
me@Be:~$ head cosmo.xyz 
                  325
 cosmo charges
 Bq   2.1848085582473193      -0.38055253987610238        1.5251498369435705       -9.3089382062078174E-004
 Bq   1.6134835908159706      -0.59877925881345084        1.8782480854375714       -3.3706153046646758E-003
 Bq  0.43449121346899733      -0.59877925881345084        1.8782480854375714       -3.9739778624157118E-003
 Bq   1.0239874021424840      -0.23823332776127137        1.8683447179254316       -1.6433149723942275E-003


OK, we need to remove the first two lines, and the first column.
me@Be:~$ tail -n +3 cosmo.xyz|gawk '{print $2,$3,$4,$5}'> cos2.xyz
Start octave.
octave:1> bz=load('cos2.xyz');
octave:2> x=bz(:,1);y=bz(:,2);z=bz(:,3);c=bz(:,4);
octave:3> plot3(x,y,z)

Paradoxically, this would be fairly easy to do with a 'normal-size' physical object (e.g. water displacement, or area on a 2D project: draw it, cut it out, weigh it and use the density of the paper)

 Computationally, we need to think about it though. The most logical approach seems to be to take all x,y data points with a small range of values of z=zi±dz, project them on a 2D surface, calculate the area within, and multiply it by dz. Do this for all values of z.
octave:4> plot(y,z,'*')


But how to calculate the area inside an arbitrary two-dimensional figure then? If we can pick a point in the 'centre' of the figure, we can draw repeated triangles with this point as one of the corners. It's easy to calculate the area of a triangle, so we just need to sum the areas of the triangles. All we need to know is how to find such a central point to use as a corner. Also, there are problems when dz is too large and the 'border' becomes fuzzy.
octave:5> plot(y(1:25),z(1:25),'*')

In fact, at this stage there may well be pre-canned algorithms to help us.
octave:6>H=convhull(y(1:25),z(1:25));
octave:7>plot(y(H),z(H))
octave:8>hold
octave:9>plot(y(1:25),z(1:25),'*')

That way we can reduce the number of points to the ones defining the encircling figure.
octave:10>area(y(H),z(H))


That still doesn't give us the area (I think matlab does though). Since it's centred around the x axis we could probably use cumsum(abs(z(H))), but that's not general enough. In fact, there'd be so much  quality analysis required in order to make sure that we include enough, but not too many, points in our slices that it quickly becomes a chore.

So we'll take a step back. Turns out it's even easier:
octave:11>[H V]=convhulln([x y z]);
This probably isn't how you're supposed to plot it, but it works:
octave:13>trisurf(H,x,y,z)

trisurf plot
octave:12>V
gives V=104.07  Å3 (c.f. Nwchem/cosmo ca 74.5 Å3 for rsolv=0.)

Now that doesn't look good, but it has been noted nwchem/cosmo gives volumes which are about half of what every other program gives. See here and here:

">Cosmo produced volumes, which were twice as small
> as those obtained by PCM, while surfaces where by about 15% bigger in
> Cosmo."

I think nwchem actually isn't returning values of the wrong magnitude -- I think the value returned by nwchem is the molecular volume, while the other programmes return the solvent accessible surface-based volume. But what is in cosmo.xyz?

It appears to be a little bit more complex than that though.


We can open the cosmo.xyz file in jmol, but calculating the volume from these would be meaningless due to the way jmol works.

Instead we'll have to use the VdW radii of the xyz coordinates of the (unoptimised) molecule:


$ isosurface sasurface 0.5 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 141.06999
$ isosurface sasurface 0.225 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume =104.452415
$ isosurface solvent 0 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 79.09731
$ isosurface solvent volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = [80.26721490808025]
$ isosurface molecular volume
isosurface2 created with cutoff=0.0; isosurface count: 2
isosurfaceVolume = [80.58888982478977]
$ isosurface sasurface 0.2 area
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceArea = 118.730934
Making sense?

sasurface generates a solvent accessible surface. We can generate a value similar to what we saw from the cosmo.xyz points by forcing the sasurface probe radius..

The vdw radii of H and C are 1.2 and 1.7 Å, but COSMO uses 1.3 and 2.0.

Look at this plot again:


The height goes from -2 to 2, which agrees with the large 2.0 Å VDW radius for C that COSMO uses. The volume outputted by Nwchem is the molecular volume (as actually is stated). 
 number of -cosmo- surface points =      176
 molecular surface =    125.008 angstrom**2
 molecular volume  =     74.512 angstrom**3
(electrostatic) solvation energy =         0.0052128678 (    3.27 kcal/mol)
The molecular volume for rsolv=0 is 74.5 Å3 which is fairly close to isosurface sasurface 0 volume. Area is trickier, and requires isosurface sasurface 0.23 volume to yield anything close.

I don't think it's a coincidence that isosurface sasurface 0.225 volume gives a reasonable agreement with the cosmo.xyz, since 1.7+0.225=1.925 which is ca 2 (we only add 0.1 for H).

I'm sure all this is in the manual somewhere. But there's nothing like learning through doing.

The conclusions:
* NWchem returns a volume based on the vdw radii, not the solvent cavity
* cosmo.xyz contains points defining the surface according to the vdw radii that cosmo uses
* These are two different sets of vdw radii
* You can't compare the output of different software packages if they aren't outputting the same data
* The reported NWChem volume does depend on rsolv, the cosmo vol doesn't
* The cosmo.xyz volume is insensitive to rsolv, but sensitive to radius as expected. As far as I understand, the cosmo volumes are based solely on the vdw radii (as supplied to cosmo)
* I haven't quite figured out how, but looking at the dependency of rsolve vs defining vdw radii for cosmo, the radii used to calculate the nwchem volume is is certainly affected.

Increase rsolv=0.0, increase vdw +0.0: 74.51/104.07/3.27
Increase rsolv=0.5, increase vdw +0.0: 58.0/103.96/3.01
Increase rsolv=1.0, increase vdw +0.0: 54 /103.87/2.95
Increase rsolv=0.0, increase vdw +0.1: 84.79/115.10/2.72
Increase rsolv=0.1, increase vdw +0.1: 82.68/115.10/2.63
Increase rsolv=0.27, increase vdw +0.1: 71.84/114.97/2.56
Increase rsolv=0.0, increase vdw +0.2: 96.59/126.83/2.22
Increase rsolv=0.1, increase vdw +0.2: 85.70/126.68/2.09
increase rsolv=0.70, increase vdw +0.2: 74.68/126.56/2.01

My only real conclusion at this point is that you have to be extremely careful about what you do. This is not easy.


A Certain Commercial Programme (ACCP):
Using pcm:

scrf=(pcm,solvent=water) -- this uses vdw radii.
GePol: Cavity volume                                =      134.665 Ang**3
GePol: Cavity surface area                          =    143.132 Ang**2
Let's see if we can do this in jmol:
$ isosurface sasurface 0.5 area
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceArea = 144.25595
$ isosurface sasurface 0.46 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 135.33589
PCM is less of a mystery now.

ACCP has a few more options though.
Using IPCM with 50 points. This uses the isodensity volume.
Volume of Solute Cavity = 8.026500E+02
Total "Solvent Accessible Surface Area" of Solute = 4.485628E+02
I've been told that the units are in Bohr3 and Bohr2. That translates to 118.94 Å3 and 125.61Å3, respectively, which sounds about right. 

13 June 2012

190. In deep water: NWChem and COSMO

This post is based entirely on empirical experience. I don't claim to know what I'm doing. Right now I'm just looking at performance.

To actually learn more about COSMO (implemented) and COSMO-RS (not implemented), read the following article by the creator of the methods: http://onlinelibrary.wiley.com/doi/10.1002/wcms.56/abstract

Anyway.


As always, the test job (benzene at b3lyp/g-31+g*) is very short, so the error margin is large. A major impetus for this is the execeptional performance of PCM in gaussian, and seemingly poor performance of nwchem using standard settings. When several numbers are quoted they come from multiple runs

Task energy - empty COSMO block:
0. Gas phase - ca 40 s
1. From scratch. Empty cosmo block - 79 s
2. Loaded movecs from gas phase, empty cosmo block - 48 s, 65 s, 65 s

The default is water and rsolv=0

COSMO parameters
Movecs loaded in all cases. Solvation energies in []

Task energy -- rsolv
0. rsolv=0 - [3.27 kcal/mol] - 48 s, 65 s, 65 s, 65 s
1. rsolv=0.5 - [3.01 kcal/mol] - 58 s, 58 s
2. rsolv=1 - [2.95 kcal/mol] - 57 s, 57 s, 58 s
3. rsolv=2 - [2.62 kcal/mol] - 55 s

The molecular volumes obtained are 74.5, 58.0 and 54.0 Å3, respectively, for r=0..1. My next post will talk about what this actually means, but in short, this has nothing to do with the solvent/solute cavity.

Task energy -- lineq
0. rsolv=0.5; lineq 0 - [3.01 kcal/mol] - 58 s, 58 s, 56 s
1. rsolv=0.5; lineq 1 - [3.01 kcal/mol] -  58 s

Task energy -- ificos
0. rsolv=0.5; lineq 0, ificos=0 - [3.01 kcal/mol] - 58 s, 58 s, 56 s
1. rsolv=0.5; lineq 0, ificos=1 - [3.01 kcal/mol] - 62 s

1 (one) kcal./mol = 4.184 kJ/mol -- there's thus a fairly wide range of values obtained above depending on the absolute settings.

Rsolve defines the probe used to find the solvent accessible surface -- the smaller the value, the more fine-grained and larger the apparent accessible surface. We would expect that a fairly small number is preferable for rsolv.

Ultimately, I don't see any obvious way of improving performance, other than using large values for rsolv.

An interesting feature is that the surface used by COSMO is save in a cosmo.xyz file in the runtime directory -- all that remains is working out a way of calculating the volume from this (I know it's reported in the nwchem output, but it never hurts being paranoid)

189. Thoughts on restarting NWChem jobs in ECCE

UPDATE: Because all my nodes are working hard to keep my office warm in the Australian winter, I haven't tested this very extensively, but it seems like freq jobs can be restarted using
freq
           reuse oldhessian.hess
end
Original post
As often is the case, this is as much a note to myself as a blog post.

Or that's how it started out. I've since spent a bit of time testing different restart options, as I found that some paradoxically were actually seen to lead to longer calculations...

The jobs I've experimented with are very short, so the error margin is probably huge.

What I tried:

A. Task dft geometry:
1. Original job - 215 s
2. Substituting Start with restart in the same directory as A.1 (i.e. loaded db)  - 204 s
3. Same as A.2, but also deleted geometry section - 43 s
4. Same as A.2 and A.3, but loaded movecs explicitly - 43 s
5. Used start, but loaded movecs from job B.3. - 267 s

Comment: not sure whether A.5 is consistently slower than A.1, but I've never seen it go faster. A.3 looks like a good bet when resuming a calculation.

B. Task dft energy:
1. Original job - 41 s
2. Deleted geometry, loaded movecs, db from B.1 - 8.7 s
3. Used start directive, kept geometry, loaded movecs from B.1, no db - 8.7 s

Comment: loading movecs (B.3) seems like a winner

C. Task dft frequency
0. Original job - 952 s
1. Deleted geom, loaded movecs, db from task energy (B.2 above) - 853 s
2. Delete geometry, loaded movecs from opt, put drv.hess, .hess, fd_ddipole in same directory (from A.1 above) - 842 s
3. Same as C2, but deleted basis block as well, and removed everything from the dft block except direct and vectors - 849 s
4. Copied hessian from 0, and put 'reuse' inside the freq block - 0.1 s (!)
5. Copied hessian from A.1 and put 'reuse' inside the freq block - 0.1 s (but data wrong)


Comment: is C0 really slower than the other jobs?
The problem with approach C.5 is that while C.4 gives

65.943 kcal/mol, 69.553 cal/mol-K
C.5 gives
288.058 kcal/mol, 64.706 cal/mol-K


Here's a comment from one of the developers, Bert:

"If you just want to redo an energy calculation followed by and ESP calculation, I would never use restart, but just use start and define the geometry in the geometry block. [cut] The restart is purely to continue the calculation that got interrupted, and the runtime database is probably not in a clean enough state to do something completely different with it. You can use the movecs that have been generated as the starting vectors though. "

A.5 (no effect) and B.3 (speed-up) would be in line with that approach.

With that in hand, time to work on ECCE.

ECCE is a nice tool, but as any point-and-click program it has it's limitations -- it's impossible to predict every single type of usage, and this is particularly true for computational wizardry. To a large extent this is compensated for by the ability to do a 'final edit' using vim before submitting a job -- there is obviously nothing whatsoever that you're prevented from doing at this point, so it offer ultimate flexibility.

There is a major weakness using ECCE though -- using files from old jobs.

In particular this is a weakness when it comes to restarting jobs. In terms of structure, this isn't a problem -- the last structure is provided by ecce via ecce.out.  It would be nice to able to carry over the .movecs files though (I'm still learning, but loading movecs and using fragment guess seems to be the neatest thing). This is high on the wish list.

Anyway, there are two major use cases:

Restarting an interrupted job
Assuming that you resubmit it without changing the name and to the same cluster so that it'll run remotely in the same directory:

Replace
Start
with
restart 
which should tell nwchem to look for the .db file and
either edit the scf or dft block and add
vectors input jobname.movecs
Obviously, this isn't example of great insight, but rather a product of reading the manual.

Also, the manual does state (but does not suggest) that leaving the start/restart directive out would cause nwchem to look for evidence of whether it's a restarted job or not. The problem is that ECCE automatically names all files nwch.nw, which would cause nwchem to look for nwch.db and fail.


Launching a new calculation based on an old job
Now, if you are duplicating a job, or if you've since renamed the job, you're in a spot of trouble since ecce doesn't concern itself with the .db and .movecs files. Maybe there's a good reason for this? But if I understand everything correctly, this means that you are loosing a lot of time on scf cycles which you could avoid by loading the .movecs and .db file.

I think that in the case of: same cluster and similar directory structure (i.e. the previous job is also a subdirectory of the same parent directory as the new job) you can put this at the beginning of your job
task shell "cp ../oldjob/*.movecs ."

and either edit the scf or dft block and add
vectors input jobname.movecs

and it actually works.

But I had no luck doing this
task shell "cp ../oldjob/*.db ."
And combining it with restart -- it wants the .db file to be there already if you use restart. This, at least in terms of functionality, agrees well with the comment by Bert above - same directory is ok, but with a different directory only movecs are reasonably easy.

Now, all we need is a tick box to copy the old movecs files between jobs...and the underlying structure. At the moment the movecs files don't get imported, so it would take a bit of editing to get to that point.


188. Notes: virtualbox and /etc/init.d/vboxdrv

If you install virtualbox and get ready to fire up the installation of a new virtual machine, but immediately get an error about having to do '/etc/init.d/vboxdrv setup' -- but find that there's no such executable in /etc/init.d in spite of having installed the virtualbox-dkms package, then

sudo apt-get install linux-headers-`uname -r`
sudo dpkg-reconfigure virtualbox-dkms

12 June 2012

187. Thunderbird 13.0 from source on debian wheezy

First look here for dependencies:
http://verahill.blogspot.com.au/2012/05/thunderbird-1201-on-debian.html

In terms of building it's almost exactly the same as for the 12-series: the only difference is that you have to build outside the source tree.

cd ~/tmp
rm comm-release -rf
wget ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/releases/13.0/source/thunderbird-13.0.source.tar.bz2
tar xvf thunderbird-13.0.source.tar.bz2
mkdir thunderbird13
cd thunderbird13
../comm-release/./configure --disable-necko-wifi

The next step takes a while (30-60 minutes)
make 

sudo make install

Done.

What's new: http://www.ghacks.net/2012/06/06/whats-new-in-thunderbird-13/

Errors
No rule to make target ../../../xpcom/idl-parser/xpidllex.py
Solution:
Build outside the source tree as shown above.

11 June 2012

186. Installing gnome shell extensions in gnome 3.4 on debian wheezy-- frippery panel, menu etc.

Gnome 3.4 frippery extensions in Debian Wheezy: bottom panel, favourites etc.
Upgrading to gnome 3.4 disabled all my extensions. It also remove all my keyboard shortcuts.

Update: Interesting take on ther GNOME 3/KDE 4releases  http://www.datamation.com/open-source/the-gnome-exodus-and-kde-2.html I think the idea of a lack of trust is a valid one: I might be able to get GNOME to do what I want today, but whatabout tomorrow? How much longer can I manually patch my screenshot app?

So, we need to get:
* move clock
* favourites
* application menu
* bottom panel
* static workspaces

Btw, extensions.gnome.org doesn't do International English. Try searching for favourites. And that's just the beginning of the headaches. I had problems finding any extensions compatible with gnome 3.4.

Anyway, as usual frippery (http://intgat.tigress.co.uk/rmy/extensions/index.html) comes to the rescue of the users (and by extension to the rescue of Gnome -- I'd already be long gone if I couldn't revert some of the more insane behaviour of gnome-shell...)

In your ~ folder (in order that the files get untared to the correct location)
wget http://intgat.tigress.co.uk/rmy/extensions/gnome-shell-frippery-0.4.1.tgz
tar xvf gnome-shell-frippery-0.4.1.tgz 

Hit alt+f2 to bring up the launcher thingy, type 'r' and hit enter. You're done!


To make life worth living again, also do
sudo apt-get install gnome-tweak-tool
if you haven't already

That way you can get the Minimize/Maximize/Close buttons back on your window border.

Another noticeable change is that it's become very difficult to resize windows using the mouse -- expand horizontally or vertically is like before, but dragging a corner is tough -- it takes a lot of fiddling to be able to grab the corner in the first place.

Finally, ctrl+b is mapped to some bookmark function in epiphany/web which is annoying, since it's universally used to make things bold. The gnome developer instructions even say not to do this:
http://developer.gnome.org/hig-book/3.4/input-keyboard.html.en (see table 10.8)


Interesting side-effect:
my fancy gnome-screenshot.debugged isn't called anymore -- and the metacity/keybinding_commands list is depopulated in addition to the gnome system settings/keyboard/shortcuts/Custom. Gnome shell 3.4 seems to mark the point where gconf-editor is deprecated. See the gnome-screenshot compilation post for more info.


At any rate, the keyboard shortcuts related to Screenshots now contains five different combination commands. Seriously -- they 'simplify' gnome-screenshot, then they want users to learn four different key combinations in addition to vanilla prtscr? And none of them does what I really need -- i.e. a quick and simple way to save screenshot with the name I want in the location I want.


Links to this post:
https://www.linuxquestions.org/questions/debian-26/how-to-add-panel-in-gnome-debian-wheezy-4175463451/

185. Troubleshooting: ECCE


10 June 2012

184. Fixing Gnome screenshot (3.4.1) in Debian Wheezy by patching and compiling

Approach
Putting a hold on gnome-screenshot forever will likely prevent gnome from upgrading properly since I'd suspect it's a required dependency.

Clarification: this fix restores the original behaviour. gnome-screenshot --interactive is NOT an acceptable solution. This guide restores gnome-screenshot to it's good old functional state.

So, time to build our own gnome-screenshot -- but one which actually works in a reasonable way. The gnome-screenshot cockup is just another sign that something is clearly amiss with the way gnome is being developed. And this, if true, is another truly idiotic 'feature' -- turn gnome into windows? Most of us left for a reason...

Anyway, linux is still sane though -- if we don't like something we're not entirely up a creek, which will buy us a bit more time while we're getting ready to move to xmonad  -- or for debian to move away from not making downstreams changes to gnome.

We have two options:
Either look here: http://git.gnome.org/browse/gnome-screenshot/commit/?id=3bbc1e158fd58ec7f4f984f6d3c15ec95e65a035&ignorews=1 and try to come up with your own way of reverting the crippling.

Or use the ubuntu patches as a guide: http://packages.ubuntu.com/precise/gnome-screenshot

Normally you shouldn't mix ubuntu and debian packages, and we won't: we'll be compiling our own package, but using the work done by the ubuntu maintainers.

In particular, look at this: http://archive.ubuntu.com/ubuntu/pool/main/g/gnome-screenshot/gnome-screenshot_3.4.1-0ubuntu1.debian.tar.gz

Look in the debian/patches directory and you'll find the ubuntu_interative_screenshots.patch

Building:
sudo apt-get install libgtk-3-dev libcanberra-gtk3-dev intltool
wget http://ftp.de.debian.org/debian/pool/main/g/gnome-screenshot/gnome-screenshot_3.4.1.orig.tar.xz
tar xvf gnome-screenshot_3.4.1.orig.tar.xz
cd gnome-screenshot-3.4.1/src

You can wget http://archive.ubuntu.com/ubuntu/pool/main/g/gnome-screenshot/gnome-screenshot_3.4.1-0ubuntu1.debian.tar.gz and untar it to look at the debian/patches/ubuntu_interactive_screenshots.patch, which what we do below is based on:

In the ubuntu patch there's a test to see whether unity is used. We'll do it a bit cruder -- we'll just make sure the condition is always true by testing for 0<1.


Edit src/screenshot-application.c and change the part in red

130 static void
131 save_pixbuf_handle_error (ScreenshotApplication *self,
132                           GError *error)
133 {
134   if (screenshot_config->interactive)
135     {
136       ScreenshotDialog *dialog = self->priv->dialog;
137       GtkWidget *toplevel = screenshot_dialog_get_toplevel (dialog);
138 
139       screenshot_dialog_set_busy (dialog, FALSE);
to

134   if (0 < 1)

Also, change

348   screenshot_play_sound_effect ("screen-capture", _("Screenshot taken"));
349 
350   if (screenshot_config->interactive)
351     {
352       self->priv->dialog = screenshot_dialog_new (self->priv->screenshot, self->priv->save_uri);
353       toplevel = screenshot_dialog_get_toplevel (self->priv->dialog);
354       gtk_widget_show (toplevel);
to


350   if (0 < 1 )

Time to build!
./configure --prefix=${HOME}/.gsc --program-suffix=.debugged
make
make install

Note: the install prefix here works fine for a single-user desktop. If you want everyone to be able to use our shiny new gnome-screenshot, put everything in /usr/bin instead.

We now have a working gnome screenshot in ~/.gsc that behaves as intended.
tree -L 2 -d
.
|-- bin
`-- share
    |-- applications
    |-- GConf
    |-- glib-2.0
    |-- gnome-screenshot
    |-- locale
    `-- man


 However, we need to make sure our fixed gnome-screenshot gets invoked.

In Gnome Shell 3.2.X
sudo apt-get install gconf-editor
Start gconf-editor
go to /apps/metacity/keybinding_commands/command_screenshot
change to e.g.  /home/verahill/.gsc/bin/gnome-screenshot.debugged
Also, change command_window_screenshot to
/home/verahill/.gsc/bin/gnome-screenshot.debugged --window

Note: defining Print/Alt+print keyboard shortcuts the 'gnome-shell' way (i.e. via system-settings) doesn't seem to work in gnome 3.2. Conversely, doing it the gconf-editor way in gnome 3.4 doesn't work.


In Gnome Shell 3.4.X
Go to System Settings, Keyboard, Shortcuts
Disable the automatically defined shortcuts for gnome-screenshot

And add your own under custom shortcuts:




Done! 
Unless you want to add to PATH in which case you can put this in your ~/.bashrc:
export PATH=$PATH:${HOME}/.gsc/bin




Note: If it's still not working, try to launch from the terminal. If you get
(gnome-screenshot.debugged:7493): GLib-GIO-ERROR **: Settings schema 'org.gnome.gnome-screenshot' does not contain a key named 'auto-save-directory'
Trace/breakpoint trap
it's because you had the old, good gnome-screenshot. 
sudo su
echo "gnome-screenshot install"|dpkg --set-selections
exit
sudo apt-get install gnome-screenshot

Now try
gsettings get org.gnome.gnome-screenshot auto-save-directory
which should be empty.

gsettings set org.gnome.gnome-screenshot auto-save-directory '/home/verahill/Pictures'

Finally, make sure to re-set your keybindings.



Links to this post:
http://qfox.nl/notes/153

08 June 2012

183. Compiling OpenMM 4.1 on debian testing

OpenMM 4.0 is still somewhat of a traumatic memory. However, having gotten a question about the compilation of v4.1 I can't really resist giving the new version a go.

Having said that, I never ended up using the GPU-enabled gromacs for which I built openmm, so it was all an enormous waste of time -- for those of you thinking about GPU/Gromacs know this:
* not all graphics cards are supported or worth supporting
* there's no speed-up for explicit solvent molecules, and what else would you use gromacs or MD for?
* consumer-grade graphics cards get very hot

I make no attempt at ferreting out what packages are needed other than what I'm explicitly prompted for. Look at http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy_20.html for an indication of what you might need.

Also, I already have openmm 4.0 installed, so e.g. paths and other things defined in the post above are still active.


Start here
Register with simtk.org and download the source file.
sudo apt-get install cmake-curses-gui libgccxml-dev gccxml nvidia-cuda-toolkit
unzip -x OpenMM4.1-Source.zipe
mkdir openmm_build
cd openmm_build/
ccmake -i ../OpenMM4.1-Source/

It'll say Empty Cache. Hit c which will populate the list.

I think we can ignore the EMU libs since they do device emulation. I never figured out what the CUT program was and it's not mentioned in the manual from what I can see.


These are the settings I chose -- I had problems before setting the OPENCL parts (in red) to off.

BUILD_TESTING:BOOL=ON
CMAKE_BUILD_TYPE:STRING=Release
CMAKE_INSTALL_PREFIX:PATH=/home/verahill/.openmm
CUDA_BUILD_TYPE:STRING=Device
CUDA_INSTALL_PREFIX:PATH=/usr/bin
CUDA_NVCC:FILEPATH=/usr/bin/nvcc
DL_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libdl.so
FOUND_CUBLAS:FILEPATH=/usr/lib/x86_64-linux-gnu/libcublas.so
FOUND_CUBLASEMU:FILEPATH=FOUND_CUBLASEMU-NOTFOUND
FOUND_CUFFT:FILEPATH=/usr/lib/x86_64-linux-gnu/libcufft.so
FOUND_CUFFTEMU:FILEPATH=FOUND_CUFFTEMU-NOTFOUND
FOUND_CUT:FILEPATH=FOUND_CUT-NOTFOUND
FOUND_CUT_INCLUDE:PATH=FOUND_CUT_INCLUDE-NOTFOUND

GCCXML_EXTRA_ARGS:STRING=
GCCXML_PATH:FILEPATH=/usr/bin/gccxml
OPENMM_BUILD_AMOEBA_CUDA_LIB:BOOL=ON
OPENMM_BUILD_AMOEBA_PLUGIN:BOOL=ON
OPENMM_BUILD_CUDA_LIB:BOOL=ON
OPENMM_BUILD_CUDA_TESTS:BOOL=TRUE
OPENMM_BUILD_C_AND_FORTRAN_WRAPPERS:BOOL=ON
OPENMM_BUILD_FREE_ENERGY_CUDA_LIB:BOOL=ON
OPENMM_BUILD_FREE_ENERGY_PLUGIN:BOOL=ON
OPENMM_BUILD_OPENCL_LIB:BOOL=OFF
OPENMM_BUILD_OPENCL_TESTS:BOOL=OFF
OPENMM_BUILD_PYTHON_WRAPPERS:BOOL=ON
OPENMM_BUILD_RPMD_OPENCL_LIB:BOOL=OFF
OPENMM_BUILD_RPMD_PLUGIN:BOOL=ON
OPENMM_BUILD_SERIALIZATION_SUPPORT:BOOL=ON
OPENMM_BUILD_STATIC_LIB:BOOL=ON
OPENMM_GENERATE_API_DOCS:BOOL=OFF
OPENMM_SVN_REVISION:STRING=exported
PYTHON_EXECUTABLE:FILEPATH=/usr/bin/python
SVNVERSION_PROGRAM:FILEPATH=/usr/bin/svnversion
SWIG_EXECUTABLE:FILEPATH=/usr/bin/swig
SWIG_VERSION:STRING=2.0.7

Make your changes and hit c again, then hit g which brings you back to the terminal.



make -d|tee make.log
make test


If all goes well you'll see
126/126 Test #126: TestParser ......................................   Passed    0.02 sec
100% tests passed, 0 tests failed out of 126
Total Test time (real) = 345.83 sec
make install


[..]
-- Installing: /home/verahill/.openmm/examples/Makefile
-- Installing: /home/verahill/.openmm/examples/NMakefile
-- Installing: /home/verahill/.openmm/examples/MakefileNotes.txt
-- Installing: /home/verahill/.openmm/examples/Empty.cpp

And you are done!

tree ~/.openmm/ -L 4 -d
.openmm/
|-- bin
|-- docs
|   |-- api-c++
|   `-- api-python
|-- examples
|   `-- VisualStudio
|-- include
|   `-- openmm
|       |-- internal
|       `-- serialization
|-- lib
|   `-- plugins
`-- licenses


182. Oracle Java JDK (java, javac and javaws) in debian testing/wheezy

With ECCE I was having problems with getting the same version of java and javac on a computer where I was using sun java 6.0

Since I'm using SGE I (think I) need the closed source SUN java version.
Download (and click on the license agreement) here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk-6u32-downloads-1594644.html


(v7u4 is available here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u4-downloads-1591156.html
)


Then follow this: http://verahill.blogspot.com.au/2012/04/installing-sunoracle-java-in-debian.html
sudo apt-get install java-package
make-jpkg jdk-6u32-linux-x64.bin

and follow the instructions. Once the package is built, install:
sudo dpkg -i oracle-j2sdk1.6_1.6.0+update32_amd64.deb

Unpacking oracle-j2sdk1.6 (from oracle-j2sdk1.6_1.6.0+update32_amd64.deb) ...
Setting up oracle-j2sdk1.6 (1.6.0+update32) ...
update-alternatives: using /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/ControlPanel to provide /usr/bin/ControlPanel (ControlPanel) in auto mode.
update-alternatives: using /usr/lib/jvm/j2sdk1.6-oracle/jre/lib/amd64/libnpjp2.so to provide /usr/lib/iceweasel/plugins/libjavaplugin.so (iceweasel-javaplugin.so) in auto mode.
update-alternatives: using /usr/lib/jvm/j2sdk1.6-oracle/jre/lib/amd64/libnpjp2.so to provide /usr/lib/chromium/plugins/libjavaplugin.so (chromium-javaplugin.so) in auto mode.
sudo update-alternatives --config java
There are 6 choices for the alternative java (providing /usr/bin/java).
  Selection    Path                                            Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java   1061      auto mode
  1            /usr/bin/gij-4.4                                 1044      manual mode
  2            /usr/bin/gij-4.6                                 1046      manual mode
* 3            /usr/lib/jvm/j2re1.6-oracle/bin/java             314       manual mode
  4            /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/java        315       manual mode
  5            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java   1061      manual mode
  6            /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java   1051      manual mode
sudo update-alternatives --config javac
There are 2 choices for the alternative javac (providing /usr/bin/javac).
  Selection    Path                                         Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-7-openjdk-amd64/bin/javac   1051      auto mode
  1            /usr/lib/jvm/j2sdk1.6-oracle/bin/javac        315       manual mode
  2            /usr/lib/jvm/java-7-openjdk-amd64/bin/javac   1051      manual mode
sudo update-alternatives --config javaws
There are 3 choices for the alternative javaws (providing /usr/bin/javaws).

  Selection    Path                                              Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/javaws   1061      auto mode
* 1            /usr/lib/jvm/j2re1.6-oracle/bin/javaws             314       manual mode
  2            /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/javaws        315       manual mode
  3            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/javaws   1061      manual mode




While I was making the package this little guy popped up. Don't fret. I think it was meant to take me to the java.com website or something similar. I don't like the sight of that /root/ thingy though -- what's oracle thinking of us punters?





07 June 2012

181. Compiling openmpi on debian wheezy

There's nothing complicated about this compilation. It's not a terribly quick build though, and I'm not yet sure exactly what packages are necessary.

sudo apt-get install build-essential gfortran
wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.tar.bz2
tar xvf openmpi-1.6.tar.bz2
cd openmpi-1.6/

sudo mkdir /opt/openmpi/
sudo chown ${USER} /opt/openmpi/
./configure --prefix=/opt/openmpi/1.6/ --with-sge

make
make install

And you're done.

tree -L2 -d /opt/openmpi




Linking to the libs is done as before, although the path to e.g. libmpi.so is /opt/openmpi/1.6/lib/ and not /opt/openmpi/1.6/lib/openmpi/ like in the regular debian package.

.

└── 1.6
    ├── bin
    ├── etc
    ├── include
    │   ├── openmpi
    │   └── vampirtrace
    ├── lib
    │   ├── openmpi
    │   └── pkgconfig
    └── share
        ├── man
        ├── openmpi
        └── vampirtrace


You might also want to update the /etc/alternatives/libmpi.so symlink.

This is definitely one of those packages where it's worth doing ./configure --help to see what options are available.

Also, I imagine that on ROCKS there may well be a few packages which will have to be compile first and specified using --with-<> switches.

A sample:

  --with-blcr(=DIR)       Path to BLCR Installation
  --with-blcr-libdir=DIR  Search for BLCR libraries in DIR
  --with-hwloc(=DIR)      Build hwloc support. DIR can take one of three
  --with-hwloc-libdir=DIR Search for hwloc libraries in DIR. Should only be
  --with-valgrind(=DIR)   Directory where the valgrind software is installed
  --with-memory-manager=TYPE
  --with-libpicl(=DIR)    Build libpicl support, optionally adding
  --with-libpicl-libdir=DIR
  --with-timer=TYPE       Build high resolution timer component TYPE
  --with-portals=DIR      Specify the installation directory of PORTALS
  --with-portals-libs=LIBS
                          Libraries to link with for portals
  --with-alps             Build ALPS scheduler component (default: no)
  --with-lsf(=DIR)        Build LSF support
  --with-lsf-libdir=DIR   Search for LSF libraries in DIR
  --with-pmi              Build PMI support (default: no)
  --with-cray-pmi-ext     Include Cray PMI2 extensions (default: no)
  --with-slurm            Build SLURM scheduler component (default: yes)
  --with-tm(=DIR)         Build TM (Torque, PBSPro, and compatible) support,
  --with-ftb(=DIR)        Build FTB (Fault Tolerance Backplane) support,
  --with-ftb-libdir=DIR   Search for FTB (Fault Tolerance Backplane) libraries
  --with-esmtp(=DIR)      Build esmtp support, optionally adding DIR/include,
  --with-esmtp-libdir=DIR Search for the esmtp libraries in DIR
  --with-sge              Build SGE or Grid Engine support (default: no)
  --with-loadleveler      Build LoadLeveler scheduler component (default: yes)
  --with-elan(=DIR)       Build Elan (QsNet2) support, searching for libraries
  --with-elan-libdir=DIR  Search for Elan (QsNet2) libraries in DIR
  --with-mx(=DIR)         Build MX (Myrinet Express) support, optionally
  --with-mx-libdir=DIR    Search for MX (Myrinet Express) libraries in DIR
  --with-openib(=DIR)     Build OpenFabrics support, optionally adding
  --with-openib-libdir=DIR
  --with-portals(=DIR)    Build Portals support, optionally adding
  --with-portals-config   configuration to use for Portals support. One of
  --with-portals-libs=LIBS
                          Libraries to link with for portals
  --with-sctp(=DIR)       Build SCTP support, searching for libraries in DIR
  --with-sctp-libdir=DIR  Search for SCTP libraries in DIR
  --with-knem(=DIR)       Build knem Linux kernel module support, searching
  --with-udapl(=DIR)      Build uDAPL support, optionally adding DIR/include,
  --with-udapl-libdir=DIR Search for uDAPL libraries in DIR
  --with-fca(=DIR)        Build fca (Mellanox Fabric Collective Accelerator)
  --with-io-romio-flags=FLAGS
  --with-mxm(=DIR)        Build Mellanox Messaging support
  --with-mxm-libdir=DIR   Search for Mellanox Messaging libraries in DIR
  --with-psm(=DIR)        Build PSM (Qlogic InfiniPath) support, optionally
  --with-psm-libdir=DIR   Search for PSM (QLogic InfiniPath PSM) libraries in
  --with-contrib-vt-flags=FLAGS
  --with-event-rtsig      compile with support for real time signals
  --with-pic[=PKGS]       try to use only PIC/non-PIC objects [default=use
  --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
  --with-sysroot=DIR Search for dependent libraries within DIR

180. Temporary fix for supertuxkart

I don't often play games, but I noticed that supertuxkart had been updated in debian wheezy and having a little bit of free time I figured I'd give it a whirl.

 supertuxkart 
supertuxkart: error while loading shared libraries: libIrrlicht.so.1.7a.3: cannot open shared object file: No such file or directory

Make sure that libirrlicht1.7a is installed.
sudo apt-get install libirrlicht1.7a

Then
cd /usr/lib
sudo ln -s libIrrlicht.so.1.7a.2 libIrrlicht.so.1.7a.3

It's obviously not a permanent fix, but I haven't had any problems playing.

179. Building ECCE on Debian Testing/Wheezy

UPDATE: Build went fine. Upgrade went fine. But the organizer doesn't show my jobs properly i.e. the files are there but they aren't recognised as jobs. I haven't had a solid look at this yet, and it might just be because I need to restart more services than just the http server. It's been a long day...

UPDATE 2: An update on a different computer went without a hitch, with all the old job files being imported properly.

UPDATE 3: I started ECCE and let the data manger chew on it for four hours. No luck on the troublesome computer. The only difference is the java -- openjdk 7 worked fine, oracle jre/jdk didn't. Dunno if this is the reason, but currently in the process of installing the binaries to see whether that works better. Updates will come...

UPDATE 4: Installing the prebuilt ECCE binaries did the trick. In summary: as far as I know you MUST use openjdk 7. SUN/Oracle Java does not appear to work. It's exhibited in a lack of ability to recognise old jobs as being...jobs rather than just folders and files.


POST BEGINS HERE:
I'm trying to document everything I'm doing these days, no matter how simple or (at least in retrospect) obvious it is.

Here's how to build the 4th of June 2012 version of ecce v6.3.

You need to be registered with EMSL/PNNL to download ecce. There are plans to open-source the software properly (i.e. no need to register) sometime this coming northern summer. But for now you need to be an academic group leader to have access.

(I originally posted a somewhat different post where I recommended making some changes to the build scripts re ECCE_HOME. Eventually I saw the light and realised the error of my ways)

Download ecce-v6.3.src.tar.bz2, and put it in a suitable folder, e.g. ~/tmp
cd ~/tmp
tar -xvf ecce-v6.3-src.tar.bz2
cd ecce-v6.3/
export ECCE_HOME=`pwd`
cd build/
./build_ecce

The first time you run ./build_ecce you'll be asked a series of questions relating to installed packages. If it's all good, answer
Do you want to skip these checks for future build_ecce invocations (y/n)? y
If anything came up, then read the message carefully and install the missing package.

NOTE: on one box I noticed different version of java and javac being found, as I had both openjdk 6 and 7 installed. I couldn't set javac to 6 but I could do
sudo update-alternatives --config java
and set it to openjdk 7.

[From my small, statistically unsound sample set Oracle/SUN java will NOT work.]

Then do ./build_ecce again. And again. And again. In all, I think you do it six or seven times - each time a new package is built.
I always get a
lib: No such file or directory.
at the end of the httpd build. Not sure why, but everything seems to be ok in spite of that.
Anyway, you know that you're done running ./build_ecce when you get
ECCE built and distribution created in /home/verahill/tmp/ecce-v6.3
At this point, you are ready to install
DO NOT USE ./install_ecce

GO UP ONE LEVEL AND DO
./install_ecce.v6.3.csh

But that's a different story. install_ecce will give you weird error messages about missing tar files. install_ecce.v6.3.csh on the other hand will work fine.

178. Gridengine queues on heterogeneous systems

I don't want to risk three slot jobs being submitted to quad core jobs, so I figured I'll try setting up different queues based on the jobs parameters.

Some reading:
http://wiki.gridengine.info/wiki/index.php/StephansBlog
https://www.clumeq.ca/wiki/index.php/Using_SGE#Queues_List
http://ait.web.psi.ch/services/linux/hpc/merlin3/sge/admin/sge_queues.html

qconf -ahgrp @quads
group_name @quads
hostlist tantalum
qconf -aq four.q
qname                  three.q
hostlist              @quads
seq_no             1
slots                 4,[tantalum=4]
pe_list              make mpi4
qconf -ahgrp @thrice
group_name @thrice
hostlist boron beryllium
qconf -aq three.q
qname                 three.q
hostlist              @thrice
seq_no                1
pe_list               make mpi3
slots                 3,[boron=3],[beryllium=3]
Finally, to avoid submitting jobs to main.q without deleting it, we change the seq_no to 9 for that particular q.
Also, we'll change the pe_list on main.q to remove mpi3 and mpi4 -- that way main.q is only used if I request only one core.
pe_list       make mpi1
And now jobs get sent to the right queue (and node) depending on the number of cores I request.

06 June 2012

177. Jerry-rigging g09 UV/VIS spectra in gnuplot and/or octave

EDIT: I had a nicer post with lots of figures before. Because I realised that the data is good enough to be included in a future paper we're working on, I had to take everything down again. All the data in the plots now is made up (hence 'fakeuv.dat'), and I haven't made the plots look nice.

I don't like proprietary formats for anything. They never, ever benefit anyone other than the software vendor.

Almost as bad as using binary proprietary formats is not providing export facilities to ascii formats.

I may have missed it, but I was using gaussview to look at td-dft calculated uv/vis spectra -- and couldn't find a way of exporting the data. Sure, I could export the graph as a png, svg etc. file. But not double column tab-separated ascii file.

There's a bit of fudging in what I'm doing  -- I'll be the first one to admit that.

So here's single line to export the wavelengths and intensities:
cat g03.g03out|grep Excited|grep -v singles|sed 's/=/\t/g'|gawk '{print $7,$10}'>uvvis.dat

You can plot them in gnuplot using
plot 'uvvis.dat' u 1:2 w impulse

The problem is that these are just spikes -- not the smooth uv/vis like spectra we're used to. On the other hand, if I understand things correctly, this is the REAL data, while the smoothed uv/vis spectrum above is more for presentation purposes. I might obviously be wrong, and I am by no stretch a computational or theoretical chemist - I just like their tools.

We've got an immensely powerful tool at our hands: Octave!
data=load('fakeuv.dat');
gauss= @(x,c,r,s) r.*1./(s.*sqrt(2*pi)).*exp(-0.5*((x-c)./s).^2)
x=linspace(250,850,600);
plot(x,cumsum(gauss(x,data(:,1),data(:,2),20)))

where 20 is an abitrary value. Anyway, this is how it looks:
We can try s=30 instead:

We export it
outdata=cumsum(gauss(x,data(:,1),data(:,2),30));
exportdata=[x' outdata'];
save 'uvvis2.sim' exportdata
and plot it in gnuplot
plot 'uvvis2.sim' u 1:48 w lines
It might not look like the UV/VIS spectrum you're used to, but as I said in the beginning, the data's all made up -- using 'real' calculated data I got a beautiful spectrum.

176. Weaning people onto SGE, one script at a time

On a five node (1 front + 4 exec; each node has 8 cores and 8 GB RAM) cluster that I know and hang out with, people have been submitting jobs one by one. As in, doing it manually, without a queue manager.

I got one of the users to start using my Very Simple Python Queue Manager to prevent too much idle time, but not everyone is using it yet.

Another downside when people aren't using queue managers is that they use top and kill to manage jobs, and that has a way of screwing things up for everyone. SGE is a much better solution in every possible sense.

To make it easier for the users to switch to using qsub i.e. make the change as undisruptive as possible, I wrote a little bash function and set up some standard qsub files.

The user navigates to the  directory where their .in file is (e.g. test.in) and runs
presub test
which open test.in and creates test.qsub

The user then submits by doing
qsub test.qsub


It's easy enough to customize the function and the output files (.e.g using .com, .g03in etc.). This script obviously only does g09, but I'll post a more general script later.




The .bashrc function:
presub () {

    paste -s -d "\n" ~/.qsub/qsub.head $1.in ~/.qsub/qsub.tail > $1.qsub
    return 0
}



The files:
I put the following files in ~/.qsub/

qsub.head:

#$ -S /bin/sh
#$ -cwd
#$ -l h_rt=99:30:00
#$ -l h_vmem=8G
#$ -j y
#$ -pe orte 8



export GAUSS_SCRDIR=/tmp
export GAUSS_EXEDIR=/share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09
/share/apps/gaussian/g09/g09 << END >> g09.log

qsub.tail:





END

The empty lines above are on purpose since gaussian can be annoying in that sense.


175. Track Changes in Libreoffice

Since I collaborate I occasionally need to whip out libreoffice. I can never find the track changes function , so I'll make this a brief post:

In Libre Office, go to Edit, Changes, and tick Record.

Other than that it works exactly like the Track Changes function we've come to hate/love in Word.

05 June 2012

174. Setting up Sun Grid Engine with three nodes on Debian

Firstly, I must acknowledge this guide: http://helms-deep.cable.nu/~rwh/blog/?p=159

I FOLLOW THAT POST ALMOST VERBATIM

This post will be more of a "I followed this guide and it actually works on debian testing/wheezy too and here's how" post, since it doesn't add anything significant to the post above, other than detail.

Since I ran into problems over and over again, I'm posting as much as I can here. Hopefully you can ignore most of the post for this reason.

Some reading before you start:
Having toyed with this for a while I've noticed one important factor in getting this to work:
the hostnames you use when you configure SGE MUST match those returned by hostname. It doesn't matter what you've defined in your /etc/host file. This can obviously cause a little bit of trouble when you've got multiple subnets set up (my computers communicate via a 10/100 net for WAN and a 10/100/1000 net for computations). My front node is called beryllium (i.e. this is what is returned when hostname is executed) but it's known as corella on the gigabit LAN. Same goes for one of my sub nodes: it's called borax on the giganet and boron on the slow LAN. hostname here returns boron. I should obviously go back and redo this for the gigabit subnet later -- I'm just posting  what worked.

While setting it up on the front node takes a little while, the good news is that very little work needs to be done on each node. This would become important when you are working with a large number of nodes -- with the power of xargs and a name list, setting them up on the front node should be a breeze.

My front node is beryllium, and one of my subnodes is boron. I've got key-based, password-less ssh login set up.

Set up your front node before you touch your subnodes. Add all the node name to your front node before even installing gridengine-exec on the subnode.

I've spent a day struggling with this. The order of events listed here is the first thing that worked. You make modifications at your own peril (and frustration). I tried openjdk with little luck, hence the sun java.

NFS
Finally, I've got nfs set up to share a folder from the front node (~/jobs) to all my subnodes. See here for instructions on how to set it up: http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-sharing-folder.html

When you use ecce, you can and SHOULD use local scratch folders i.e. use your nfs shared folder as the runtime folder, but set scratch to e.g. /tmp which isn't an nfs exported folder.


Before you start, stop and purge
if you've tried installing and configuring gridengine in the past, there may be processes and files which will interfere. On each computer do
ps aux|grep sge
use sudo kill to kill any sge processes
Then
sudo apt-get purge gridengine-*


First install sun/oracle java on all nodes.

[UPDATE 24 Aug 2013: openjdk-6-jre or openjdk-7-jre work fine, so you can skip this]

There's no sun/oracle java in the debian testing repos anymore, so we'll follow this: http://verahill.blogspot.com.au/2012/04/installing-sunoracle-java-in-debian.html

sudo apt-get install java-package
Download the jre-6u31-linux-x64.bin from here: http://java.com/en/download/manual.jsp?locale=en
make-jpkg jre-6u31-linux-x64.bin
sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb 

Then select your shiny oracle java by doing:
sudo update-alternatives --config java
sudo update-alternatives --config javaws

Do that one every node, front and subnodes. You don't have to do all the steps though: you just built oracle-j2re1.6_1.6.0+update31_amd64.deb so copy that to your nodes, do sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb and then do the sudo update-alternatives dance.



Front node:
sudo apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
(at the moment this installs v 6.2u5-7)

I used the following:
Configure automatically: yes
Cell name: rupert
Master hostname: beryllium
 => SGE_ROOT: /var/lib/gridengine
 => SGE_CELL: rupert
 => Spool directory: /var/spool/gridengine/spooldb
 => Initial manager user: sgeadmin

Once it was installed, I added myself as an sgeadmin:
sudo -u sgeadmin qconf -am ${USER}
sgeadmin@beryllium added "verahill" to manager list
and to the user list:
qconf -au ${USER} users
added "verahill" to access list "users"
We add beryllium as a submit host
qconf -as beryllium
beryllium added to submit host list
Create the group allhosts
qconf -ahgrp @allhosts
1 group_name @allhosts
2 hostlist NONE
I made no changes

Add beryllium to the hostlist
qconf -aattr hostgroup hostlist beryllium @allhosts
verahill@beryllium modified "@allhosts" in host group list
qconf -aq main.q
This opens another text file. I made no changes.
verahill@beryllium added "main.q" to cluster queue list
Add the host group to the queue:
qconf -aattr queue hostlist @allhosts main.q
verahill@beryllium modified "main.q" in cluster queue list
1 core on beryllium is added to SGE:


qconf -aattr queue slots "[beryllium=1]" main.q
verahill@beryllium modified "main.q" in cluster queue list
Add execution host
qconf -ae 
which opens a text file in vim

I edited hostname (boron) but nothing else. Saving returns
added host boron to exec host list
Add boron as a submit host
qconf -as boron
boron added to submit host list
Add 3 cores for boron:
qconf -aattr queue slots "[boron=3]" main.q

Add boron to the queue
qconf -aattr hostgroup hostlist boron @allhosts

Here's my history list in case you can't be bother reading everything in detail above.
 2015  sudo apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
 2016  sudo -u sgeadmin qconf -am ${USER}
 2017  qconf -help
 2018  qconf user_list
 2019  qconf -au ${USER} users
 2020  qconf -as beryllium
 2021  qconf -ahgrp @allhosts
 2022  qconf -aattr hostgroup hostlist beryllium @allhosts
 2023  qconf -aq main.q
 2024  qconf -aattr queue hostlist @allhosts main.q
 2025  qconf -aattr queue slots "[beryllium=1]" main.q
 2026  qconf -as boron
 2027  qconf -ae
 2028  qconf -aattr hostgroup hostlist beryllium @allhosts
 2029  qconf -aattr queue slots "[boron=3]" main.q
 2030  qconf -aattr hostgroup hostlist boron @allhosts

 Next, set up your subnodes:

My example here is a subnode called boron.

On the subnode:
sudo apt-get install gridengine-exec gridengine-client
Configure automatically: yes
Cell name: rupert
Master hostname: beryllium
This node is called boron.

Check whether sge_execd got start after the install
ps aux|grep sge
sgeadmin 25091  0.0  0.0  31712  1968 ?        Sl   13:54   0:00 /usr/lib/gridengine/sge_execd
If not, and only if not, do

/etc/init.d/gridengine-exec start

cat /tmp/execd_messages.*
If there's no message corresponding to the current iteration of sge (i.e. you may have old error messages from earlier attempts) then you're probably in a good place.

Back to the front node:
 qhost 
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      6  0.57    7.8G    3.9G   14.9G  597.7M
boron                   lx26-amd64      3  0.62    3.8G  255.6M   14.9G     0.0
If the exec node isn't recognised (i.e. it's listed but no cpu info or anything else) then you're in a dark place. Probably you'll find a message about "request for user soandso does not match credentials" in your /tmp/execd_messages.* files on the exec node. The only way I got that solved was stopping all sge processes everywhere, purging all gridengine-* packages on all nodes and starting from the beginning -- hence why I posted the history output above.

qstat -f

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.64     lx26-amd64  
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/3          0.72     lx26-amd64  


Time to see how far we've got:
Create a file called test.qsub on your front node:
#$ -S /bin/csh
#$ -cwd
tree -L 1 -d
hostname
qsub test.qsub 
Your job 2 ("test.qsub") has been submitted
qstat -u ${USER}
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
      2 0.00000 test.qsub  verahill         qw    06/05/2012 14:03:10                                    1        
ls
test.qsub  test.qsub.e2  test.qsub.o2
cat test.qsub.[oe]*
.
0 directories
beryllium
Tree could have had more exciting output I s'pose, but I didn't have any subfolders in my run directory.

So far, so good. We still need to set up parallel environments (e.g. orte, mpi).


Before that, we'll add another node, which is called tantalum and has a quadcore cpu.
On the front node:

qconf -as tantalum
qconf -ae
replace template with tantalum 
qconf -aattr queue slots "[tantalum=4]" main.q
qconf -aattr hostgroup hostlist tantalum @allhosts

 qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      6  0.67    7.8G    3.7G   14.9G  597.7M
boron                   lx26-amd64      3  0.14    3.8G  248.0M   14.9G     0.0
tantalum                -               -     -       -       -       -       -
On tantalum:
Install java by copying the oracle-j2re1.6_1.6.0+update31_amd64.deb which got created when you set it up the first time.
sudo dpkg -i  oracle-j2re1.6_1.6.0+update31_amd64.deb
sudo update-alternatives --config java
sudo update-alternatives --config javaws

Install gridengine:
sudo apt-get install gridengine-exec gridengine-client

On the front node:

 qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      6  0.62    7.8G    3.7G   14.9G  601.0M
boron                   lx26-amd64      3  0.15    3.8G  248.6M   14.9G     0.0
tantalum                lx26-amd64      4  4.02    7.7G  977.0M   14.9G   24.1M

 qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.71     lx26-amd64  
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/3          0.72     lx26-amd64  
---------------------------------------------------------------------------------
main.q@tantalum                BIP   0/0/4          4.01     lx26-amd64    

It's a beautiful thing when everything suddenly works. 


Parallel environments:
In order to use all the cores on each node we need to set up parallel environments.

qconf -ap orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     FALSE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

To use a parallel environement include #$ -pe orte 3 for 3 slots in your test.qsub:

#$ -S /bin/csh
#$ -cwd
#$ -pe orte 3
tree -L 1 -d
hostname

Submit it:
qsub test.qsub 

Your job 14 ("test.qsub") has been submitted
qstat 
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
     14 0.00000 test.qsub  verahill         qw    06/05/2012 15:43:25                                    3        


verahill@beryllium:~/mine/qsubtest$ cat test.qsub.*
.
0 directories
boron
It got executed on boron.



We are basically done with a basic setup now. To read more, use google. Some additional info that might be helpful is here: http://wiki.gridengine.info/wiki/index.php/StephansBlog

We're going to set up a few more parallel environments:


qconf -ap mpi1

pe_name            mpi1
slots              9
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qconf -ap mpi2



pe_name            mpi2
slots              9
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    2
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qconf -ap mpi3


pe_name            mpi3
slots              9
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    3
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qconf -ap mpi4


pe_name            mpi4
slots              9
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    4
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
And we'll call these using the #$ -pe mpi$totalprocs $totalprocs directive below

We need to add them (update: you need to add them to a queue. Which one is irrelevant, as long as the environment and queue parameters are consistent) to our main.q file though:
qconf -mq main.q
pe_list               make orte mpi1 mpi2 mpi3 mpi4

This obviously isn't the end of my travails -- now I need to get nwchem and gaussian happy.
I've got this in my CONFIG.Dynamic (inside joke) file

NWChem: /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /opt/gaussian/g09/g09
perlPath: /usr/bin/perl
qmgrPath: /usr/bin/

SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
#$ -pe mpi$totalprocs $totalprocs
}

NWChemCommand {
setenv LD_LIBRARY_PATH "/usr/lib/openmpi/lib:/opt/openblas/lib"
setenv PATH "/bin:/usr/bin:/sbin:/usr/sbin"
mpirun -n $totalprocs /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem $infile > $outfile
}

Gaussian-03Command{
setenv GAUSS_SCRDIR /scratch
setenv GAUSS_EXEDIR /opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09
/opt/gaussian/g09/g09 $infile $outfile >g09.log
}

And now everything works!
















See below for a few of the annoying errors I encountered during my adventures:


Error -- missing gridengine-client
The gaussian set-up worked fine. The nwchem setup worked on one node but not at all on another -- my problem sounded identical to that described here (two nodes, same binaries, still one works and one doesn't):
http://www.open-mpi.org/community/lists/users/2010/07/13503.php
And it's the same as this one too http://www.digipedia.pl/usenet/thread/11269/867/

[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 161--------------------------------------------------------------------------It looks like orte_init failed for some reason; your parallel process islikely to abort.  There are many reasons that a parallel process canfail during orte_init; some of which are due to configuration orenvironment problems.  This failure appears to be an internal failure;here's some additional information (which may only be relevant to anOpen MPI developer):
  orte_plm_base_select failed  --> Returned value Not found (-13) instead of ORTE_SUCCESS--------------------------------------------------------------------------[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../orte/runtime/orte_init.c at line 132--------------------------------------------------------------------------It looks like orte_init failed for some reason; your parallel process islikely to abort.  There are many reasons that a parallel process canfail during orte_init; some of which are due to configuration orenvironment problems.  This failure appears to be an internal failure;here's some additional information (which may only be relevant to anOpen MPI developer):
  orte_ess_set_name failed  --> Returned value Not found (-13) instead of ORTE_SUCCESS--------------------------------------------------------------------------[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../orte/tools/orterun/orterun.c at line 543

It's tooka while to troubleshoot this one. As always, when you're troubleshooting you discover the odd thing or two. On my front node:
/usr/bin/rsh -> /etc/alternatives/rsh
which is normal, but
/etc/alternatives/rsh -> /usr/bin/krb5-rsh
There are some krb packages on tantalum, but nothing on boron
boron:
locate rsh|grep "usr/bin"
/usr/bin/rsh

tantalum:
locate rsh|grep "usr/bin"
/usr/bin/glib-genmarshal
/usr/bin/qrsh
/usr/bin/rsh

sudo apt-get autoremove krb5-clients

Of course, that did not get it working...
The annoying things is that nwchem/mpirun on boron work perfectly together, also when submitting jobs directly via ECCE. It's just with qsub that I am having trouble. Search continues:
On the troublesome node:
aptitude search mpi|grep ^i
i   libblacs-mpi-dev                - Basic Linear Algebra Comm. Subprograms - D
i A libblacs-mpi1                   - Basic Linear Algebra Comm. Subprograms - S
i A libexempi3                      - library to parse XMP metadata (Library)
i   libopenmpi-dev                  - high performance message passing library -
i A libopenmpi1.3                   - high performance message passing library -
i   libscalapack-mpi-dev            - Scalable Linear Algebra Package - Dev. fil
i A libscalapack-mpi1               - Scalable Linear Algebra Package - Shared l
i A mpi-default-bin                 - Standard MPI runtime programs (metapackage
i A mpi-default-dev                 - Standard MPI development files (metapackag
i   openmpi-bin                     - high performance message passing library -
i A openmpi-checkpoint              - high performance message passing library -
i A openmpi-common                  - high performance message passing library -

Library conflict?
sudo apt-get autoremove mpi-default-*

And then recompile nwchem.  Still no change.

Finally I found the real problem:
gridengine-client was missing on the troublesome node. Once I had installed that, everything worked!



Errors:
If your parallel job won't start (sits with qw forever), and qstat -o jobid  gives you
scheduling info: cannot run in PE "orte" because it only offers 0 slots
make sure that qstat -f lists all your nodes.

This is good:

 qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.71     lx26-amd64
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/3          0.72     lx26-amd64
---------------------------------------------------------------------------------
main.q@tantalum                BIP   0/0/4          4.01     lx26-amd64    
This is bad:
qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.64     lx26-amd64    



To fix it, do 
qconf -aattr hostgroup hostlist tantalum @allhosts

on the front node for all your node names (change tantalum to the correct name)

An unhelpful error message:
qstat -u verahill
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
 3 0.50000 test.qsub  verahill         Eqw   06/05/2012 11:45:18                                    1        
cat test.qsub.[eo]*

/builder/src-buildserver/Platform-7.0/src/linux/lwmsg/src/connection-wire.c:325: Should not be here

This came from a faulty qsub directive: I used
#$ -S csh
instead of
#$ -S /bin/csh
i.e. you should use the latter.

I think it's a potentially common enough mistake that I post it here. See  http://helms-deep.cable.nu/~rwh/blog/?p=159 for more errors.

Links to this post:
http://gridengine.org/pipermail/users/2012-November/005207.html
http://web.archiveorange.com/archive/v/JfPLjOHE5fXSiyFH0yzc