Lindqvist -- a blog about Linux and Science. Mostly.

14 June 2012

191. Thinking about Molecular volume -- and is cosmo/nwchem yielding the right ones?

Disclaimer:
I'm an neither a theoretical nor computational chemist. I'm an analytical/inorganic chemist who likes computers. Don't trust my conclusions. This is more like thinking aloud.

The problem:
The underlying impetus is that of molecular volume: if we have a set of scatter points in space which define the surface of a molecule, how can we extract the volume? In particular as we're actually given the surface points by in the form of a cosmo.xyz file by COSMO (and yes, nwchem also outputs a volume - more about that later) there's no reason why we won't do the calculations ourselves. Also, there's at least one example of comparing results from a few major software packages, where nwchem was the odd one out.

Because it's good to know how to use Octave and bash, I'll show the commands as well.

The COSMO parameters used were
cosmo
rsolv 0
end

[come to think of it: why bother with
and nwchem returned

atomic radii =
--------------
1 6.000 2.000
2 6.000 2.000
3 6.000 2.000
4 6.000 2.000
5 6.000 2.000
6 6.000 2.000
7 1.000 1.300
8 1.000 1.300
9 1.000 1.300
10 1.000 1.300
11 1.000 1.300
12 1.000 1.300

and a volume of ca 74.5 Å³

Processing:

me@Be:~$ head cosmo.xyz

                  325
 cosmo charges
 Bq   2.1848085582473193      -0.38055253987610238        1.5251498369435705       -9.3089382062078174E-004
 Bq   1.6134835908159706      -0.59877925881345084        1.8782480854375714       -3.3706153046646758E-003
 Bq  0.43449121346899733      -0.59877925881345084        1.8782480854375714       -3.9739778624157118E-003
 Bq   1.0239874021424840      -0.23823332776127137        1.8683447179254316       -1.6433149723942275E-003

OK, we need to remove the first two lines, and the first column.

me@Be:~$ tail -n +3 cosmo.xyz|gawk '{print $2,$3,$4,$5}'> cos2.xyz

Start octave.

octave:1> bz=load('cos2.xyz');
octave:2> x=bz(:,1);y=bz(:,2);z=bz(:,3);c=bz(:,4);
octave:3> plot3(x,y,z)

Paradoxically, this would be fairly easy to do with a 'normal-size' physical object (e.g. water displacement, or area on a 2D project: draw it, cut it out, weigh it and use the density of the paper)

Computationally, we need to think about it though. The most logical approach seems to be to take all x,y data points with a small range of values of z=z_i±dz, project them on a 2D surface, calculate the area within, and multiply it by dz. Do this for all values of z.

octave:4> plot(y,z,'*')

But how to calculate the area inside an arbitrary two-dimensional figure then? If we can pick a point in the 'centre' of the figure, we can draw repeated triangles with this point as one of the corners. It's easy to calculate the area of a triangle, so we just need to sum the areas of the triangles. All we need to know is how to find such a central point to use as a corner. Also, there are problems when dz is too large and the 'border' becomes fuzzy.

octave:5> plot(y(1:25),z(1:25),'*')

In fact, at this stage there may well be pre-canned algorithms to help us.

octave:6>H=convhull(y(1:25),z(1:25));
octave:7>plot(y(H),z(H))
octave:8>hold
octave:9>plot(y(1:25),z(1:25),'*')

That way we can reduce the number of points to the ones defining the encircling figure.

octave:10>area(y(H),z(H))

That still doesn't give us the area (I think matlab does though). Since it's centred around the x axis we could probably use cumsum(abs(z(H))), but that's not general enough. In fact, there'd be so much quality analysis required in order to make sure that we include enough, but not too many, points in our slices that it quickly becomes a chore.

So we'll take a step back. Turns out it's even easier:

octave:11>[H V]=convhulln([x y z]);

This probably isn't how you're supposed to plot it, but it works:

octave:13>trisurf(H,x,y,z)

trisurf plot

octave:12>V

gives V=104.07 Å³(c.f. Nwchem/cosmo ca 74.5 Å³ for rsolv=0.)

Now that doesn't look good, but it has been noted nwchem/cosmo gives volumes which are about half of what every other program gives. See here and here:

">Cosmo produced volumes, which were twice as small
> as those obtained by PCM, while surfaces where by about 15% bigger in
> Cosmo."

I think nwchem actually isn't returning values of the wrong magnitude -- I think the value returned by nwchem is the molecular volume, while the other programmes return the solvent accessible surface-based volume. But what is in cosmo.xyz?

It appears to be a little bit more complex than that though.

We can open the cosmo.xyz file in jmol, but calculating the volume from these would be meaningless due to the way jmol works.

Instead we'll have to use the VdW radii of the xyz coordinates of the (unoptimised) molecule:

$ isosurface sasurface 0.5 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 141.06999
$ isosurface sasurface 0.225 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume =104.452415
$ isosurface solvent 0 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 79.09731
$ isosurface solvent volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = [80.26721490808025]
$ isosurface molecular volume
isosurface2 created with cutoff=0.0; isosurface count: 2
isosurfaceVolume = [80.58888982478977]
$ isosurface sasurface 0.2 area
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceArea = 118.730934

Making sense?

sasurface generates a solvent accessible surface. We can generate a value similar to what we saw from the cosmo.xyz points by forcing the sasurface probe radius..

The vdw radii of H and C are 1.2 and 1.7 Å, but COSMO uses 1.3 and 2.0.

Look at this plot again:

The height goes from -2 to 2, which agrees with the large 2.0 Å VDW radius for C that COSMO uses. The volume outputted by Nwchem is the molecular volume (as actually is stated).

number of -cosmo- surface points = 176
molecular surface = 125.008 angstrom**2
molecular volume = 74.512 angstrom**3
(electrostatic) solvation energy = 0.0052128678 ( 3.27 kcal/mol)

The molecular volume for rsolv=0 is 74.5 Å³ which is fairly close to isosurface sasurface 0 volume. Area is trickier, and requires isosurface sasurface 0.23 volume to yield anything close.

I don't think it's a coincidence that isosurface sasurface 0.225 volume gives a reasonable agreement with the cosmo.xyz, since 1.7+0.225=1.925 which is ca 2 (we only add 0.1 for H).

I'm sure all this is in the manual somewhere. But there's nothing like learning through doing.

The conclusions:
* NWchem returns a volume based on the vdw radii, not the solvent cavity
* cosmo.xyz contains points defining the surface according to the vdw radii that cosmo uses
* These are two different sets of vdw radii
* You can't compare the output of different software packages if they aren't outputting the same data
* The reported NWChem volume does depend on rsolv, the cosmo vol doesn't
* The cosmo.xyz volume is insensitive to rsolv, but sensitive to radius as expected. As far as I understand, the cosmo volumes are based solely on the vdw radii (as supplied to cosmo)
* I haven't quite figured out how, but looking at the dependency of rsolve vs defining vdw radii for cosmo, the radii used to calculate the nwchem volume is is certainly affected.

Increase rsolv=0.0, increase vdw +0.0: 74.51/104.07/3.27
Increase rsolv=0.5, increase vdw +0.0: 58.0/103.96/3.01
Increase rsolv=1.0, increase vdw +0.0: 54 /103.87/2.95
Increase rsolv=0.0, increase vdw +0.1: 84.79/115.10/2.72
Increase rsolv=0.1, increase vdw +0.1: 82.68/115.10/2.63
Increase rsolv=0.27, increase vdw +0.1: 71.84/114.97/2.56
Increase rsolv=0.0, increase vdw +0.2: 96.59/126.83/2.22
Increase rsolv=0.1, increase vdw +0.2: 85.70/126.68/2.09
increase rsolv=0.70, increase vdw +0.2: 74.68/126.56/2.01

My only real conclusion at this point is that you have to be extremely careful about what you do. This is not easy.

A Certain Commercial Programme (ACCP):

Using pcm:

scrf=(pcm,solvent=water) -- this uses vdw radii.

GePol: Cavity volume = 134.665 Ang**3
GePol: Cavity surface area = 143.132 Ang**2

Let's see if we can do this in jmol:

$ isosurface sasurface 0.5 area
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceArea = 144.25595
$ isosurface sasurface 0.46 volume
isosurface1 created with cutoff=0.0; isosurface count: 1
isosurfaceVolume = 135.33589

PCM is less of a mystery now.

ACCP has a few more options though.
Using IPCM with 50 points. This uses the isodensity volume.

Volume of Solute Cavity = 8.026500E+02
Total "Solvent Accessible Surface Area" of Solute = 4.485628E+02

I've been told that the units are in Bohr³ and Bohr². That translates to 118.94 Å³ and 125.61Å³, respectively, which sounds about right.

13 June 2012

190. In deep water: NWChem and COSMO

This post is based entirely on empirical experience. I don't claim to know what I'm doing. Right now I'm just looking at performance.

To actually learn more about COSMO (implemented) and COSMO-RS (not implemented), read the following article by the creator of the methods: http://onlinelibrary.wiley.com/doi/10.1002/wcms.56/abstract

Anyway.

As always, the test job (benzene at b3lyp/g-31+g*) is very short, so the error margin is large. A major impetus for this is the execeptional performance of PCM in gaussian, and seemingly poor performance of nwchem using standard settings. When several numbers are quoted they come from multiple runs

Task energy - empty COSMO block:
0. Gas phase - ca 40 s
1. From scratch. Empty cosmo block - 79 s
2. Loaded movecs from gas phase, empty cosmo block - 48 s, 65 s, 65 s

The default is water and rsolv=0

COSMO parameters
Movecs loaded in all cases. Solvation energies in []

Task energy -- rsolv
0. rsolv=0 - [3.27 kcal/mol] - 48 s, 65 s, 65 s, 65 s
1. rsolv=0.5 - [3.01 kcal/mol] - 58 s, 58 s
2. rsolv=1 - [2.95 kcal/mol] - 57 s, 57 s, 58 s
3. rsolv=2 - [2.62 kcal/mol] - 55 s

The molecular volumes obtained are 74.5, 58.0 and 54.0 Å3, respectively, for r=0..1. My next post will talk about what this actually means, but in short, this has nothing to do with the solvent/solute cavity.

Task energy -- lineq
0. rsolv=0.5; lineq 0 - [3.01 kcal/mol] - 58 s, 58 s, 56 s
1. rsolv=0.5; lineq 1 - [3.01 kcal/mol] - 58 s

Task energy -- ificos
0. rsolv=0.5; lineq 0, ificos=0 - [3.01 kcal/mol] - 58 s, 58 s, 56 s
1. rsolv=0.5; lineq 0, ificos=1 - [3.01 kcal/mol] - 62 s

1 (one) kcal./mol = 4.184 kJ/mol -- there's thus a fairly wide range of values obtained above depending on the absolute settings.

Rsolve defines the probe used to find the solvent accessible surface -- the smaller the value, the more fine-grained and larger the apparent accessible surface. We would expect that a fairly small number is preferable for rsolv.

Ultimately, I don't see any obvious way of improving performance, other than using large values for rsolv.

An interesting feature is that the surface used by COSMO is save in a cosmo.xyz file in the runtime directory -- all that remains is working out a way of calculating the volume from this (I know it's reported in the nwchem output, but it never hurts being paranoid)

189. Thoughts on restarting NWChem jobs in ECCE

UPDATE: Because all my nodes are working hard to keep my office warm in the Australian winter, I haven't tested this very extensively, but it seems like freq jobs can be restarted using

freq
reuse oldhessian.hess
end

Original post
As often is the case, this is as much a note to myself as a blog post.

Or that's how it started out. I've since spent a bit of time testing different restart options, as I found that some paradoxically were actually seen to lead to longer calculations...

The jobs I've experimented with are very short, so the error margin is probably huge.

What I tried:

A. Task dft geometry:
1. Original job - 215 s
2. Substituting Start with restart in the same directory as A.1 (i.e. loaded db) - 204 s
3. Same as A.2, but also deleted geometry section - 43 s
4. Same as A.2 and A.3, but loaded movecs explicitly - 43 s
5. Used start, but loaded movecs from job B.3. - 267 s

Comment: not sure whether A.5 is consistently slower than A.1, but I've never seen it go faster. A.3 looks like a good bet when resuming a calculation.

B. Task dft energy:
1. Original job - 41 s
2. Deleted geometry, loaded movecs, db from B.1 - 8.7 s
3. Used start directive, kept geometry, loaded movecs from B.1, no db - 8.7 s

Comment: loading movecs (B.3) seems like a winner

C. Task dft frequency
0. Original job - 952 s
1. Deleted geom, loaded movecs, db from task energy (B.2 above) - 853 s
2. Delete geometry, loaded movecs from opt, put drv.hess, .hess, fd_ddipole in same directory (from A.1 above) - 842 s
3. Same as C2, but deleted basis block as well, and removed everything from the dft block except direct and vectors - 849 s
4. Copied hessian from 0, and put 'reuse' inside the freq block - 0.1 s (!)
5. Copied hessian from A.1 and put 'reuse' inside the freq block - 0.1 s (but data wrong)

Comment: is C0 really slower than the other jobs?

The problem with approach C.5 is that while C.4 gives

65.943 kcal/mol, 69.553 cal/mol-K
C.5 gives
288.058 kcal/mol, 64.706 cal/mol-K

Here's a comment from one of the developers, Bert:

"If you just want to redo an energy calculation followed by and ESP calculation, I would never use restart, but just use start and define the geometry in the geometry block. [cut] The restart is purely to continue the calculation that got interrupted, and the runtime database is probably not in a clean enough state to do something completely different with it. You can use the movecs that have been generated as the starting vectors though. "

A.5 (no effect) and B.3 (speed-up) would be in line with that approach.

With that in hand, time to work on ECCE.

ECCE is a nice tool, but as any point-and-click program it has it's limitations -- it's impossible to predict every single type of usage, and this is particularly true for computational wizardry. To a large extent this is compensated for by the ability to do a 'final edit' using vim before submitting a job -- there is obviously nothing whatsoever that you're prevented from doing at this point, so it offer ultimate flexibility.

There is a major weakness using ECCE though -- using files from old jobs.

In particular this is a weakness when it comes to restarting jobs. In terms of structure, this isn't a problem -- the last structure is provided by ecce via ecce.out. It would be nice to able to carry over the .movecs files though (I'm still learning, but loading movecs and using fragment guess seems to be the neatest thing). This is high on the wish list.

Anyway, there are two major use cases:

Restarting an interrupted job
Assuming that you resubmit it without changing the name and to the same cluster so that it'll run remotely in the same directory:

Replace

Start

with

restart

which should tell nwchem to look for the .db file and
either edit the scf or dft block and add

vectors input jobname.movecs

Obviously, this isn't example of great insight, but rather a product of reading the manual.

Also, the manual does state (but does not suggest) that leaving the start/restart directive out would cause nwchem to look for evidence of whether it's a restarted job or not. The problem is that ECCE automatically names all files nwch.nw, which would cause nwchem to look for nwch.db and fail.

Launching a new calculation based on an old job
Now, if you are duplicating a job, or if you've since renamed the job, you're in a spot of trouble since ecce doesn't concern itself with the .db and .movecs files. Maybe there's a good reason for this? But if I understand everything correctly, this means that you are loosing a lot of time on scf cycles which you could avoid by loading the .movecs and .db file.

I think that in the case of: same cluster and similar directory structure (i.e. the previous job is also a subdirectory of the same parent directory as the new job) you can put this at the beginning of your job

task shell "cp ../oldjob/*.movecs ."

and either edit the scf or dft block and add

vectors input jobname.movecs

and it actually works.

But I had no luck doing this

task shell "cp ../oldjob/*.db ."

And combining it with restart -- it wants the .db file to be there already if you use restart. This, at least in terms of functionality, agrees well with the comment by Bert above - same directory is ok, but with a different directory only movecs are reasonably easy.

Now, all we need is a tick box to copy the old movecs files between jobs...and the underlying structure. At the moment the movecs files don't get imported, so it would take a bit of editing to get to that point.

188. Notes: virtualbox and /etc/init.d/vboxdrv

If you install virtualbox and get ready to fire up the installation of a new virtual machine, but immediately get an error about having to do '/etc/init.d/vboxdrv setup' -- but find that there's no such executable in /etc/init.d in spite of having installed the virtualbox-dkms package, then

sudo apt-get install linux-headers-`uname -r`
sudo dpkg-reconfigure virtualbox-dkms

12 June 2012

187. Thunderbird 13.0 from source on debian wheezy

First look here for dependencies:
http://verahill.blogspot.com.au/2012/05/thunderbird-1201-on-debian.html

In terms of building it's almost exactly the same as for the 12-series: the only difference is that you have to build outside the source tree.

cd ~/tmp
rm comm-release -rf
wget ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/releases/13.0/source/thunderbird-13.0.source.tar.bz2
tar xvf thunderbird-13.0.source.tar.bz2
mkdir thunderbird13
cd thunderbird13
../comm-release/./configure --disable-necko-wifi

The next step takes a while (30-60 minutes)
make

sudo make install

Done.

What's new: http://www.ghacks.net/2012/06/06/whats-new-in-thunderbird-13/

Errors
No rule to make target ../../../xpcom/idl-parser/xpidllex.py
Solution:

Build outside the source tree as shown above.

11 June 2012

186. Installing gnome shell extensions in gnome 3.4 on debian wheezy-- frippery panel, menu etc.

Gnome 3.4 frippery extensions in Debian Wheezy: bottom panel, favourites etc.
Upgrading to gnome 3.4 disabled all my extensions. It also remove all my keyboard shortcuts.

Update: Interesting take on ther GNOME 3/KDE 4releases http://www.datamation.com/open-source/the-gnome-exodus-and-kde-2.html I think the idea of a lack of trust is a valid one: I might be able to get GNOME to do what I want today, but whatabout tomorrow? How much longer can I manually patch my screenshot app?

So, we need to get:
* move clock
* favourites
* application menu
* bottom panel
* static workspaces

Btw, extensions.gnome.org doesn't do International English. Try searching for favourites. And that's just the beginning of the headaches. I had problems finding any extensions compatible with gnome 3.4.

Anyway, as usual frippery (http://intgat.tigress.co.uk/rmy/extensions/index.html) comes to the rescue of the users (and by extension to the rescue of Gnome -- I'd already be long gone if I couldn't revert some of the more insane behaviour of gnome-shell...)

In your ~ folder (in order that the files get untared to the correct location)
wget http://intgat.tigress.co.uk/rmy/extensions/gnome-shell-frippery-0.4.1.tgz
tar xvf gnome-shell-frippery-0.4.1.tgz

Hit alt+f2 to bring up the launcher thingy, type 'r' and hit enter. You're done!

To make life worth living again, also do
sudo apt-get install gnome-tweak-tool
if you haven't already

That way you can get the Minimize/Maximize/Close buttons back on your window border.

Another noticeable change is that it's become very difficult to resize windows using the mouse -- expand horizontally or vertically is like before, but dragging a corner is tough -- it takes a lot of fiddling to be able to grab the corner in the first place.

Finally, ctrl+b is mapped to some bookmark function in epiphany/web which is annoying, since it's universally used to make things bold. The gnome developer instructions even say not to do this:
http://developer.gnome.org/hig-book/3.4/input-keyboard.html.en （see table 10.8)

Interesting side-effect:
my fancy gnome-screenshot.debugged isn't called anymore -- and the metacity/keybinding_commands list is depopulated in addition to the gnome system settings/keyboard/shortcuts/Custom. Gnome shell 3.4 seems to mark the point where gconf-editor is deprecated. See the gnome-screenshot compilation post for more info.

At any rate, the keyboard shortcuts related to Screenshots now contains five different combination commands. Seriously -- they 'simplify' gnome-screenshot, then they want users to learn four different key combinations in addition to vanilla prtscr? And none of them does what I really need -- i.e. a quick and simple way to save screenshot with the name I want in the location I want.

Links to this post:
https://www.linuxquestions.org/questions/debian-26/how-to-add-panel-in-gnome-debian-wheezy-4175463451/

185. Troubleshooting: ECCE

10 June 2012

184. Fixing Gnome screenshot (3.4.1) in Debian Wheezy by patching and compiling

Approach
Putting a hold on gnome-screenshot forever will likely prevent gnome from upgrading properly since I'd suspect it's a required dependency.

Clarification: this fix restores the original behaviour. gnome-screenshot --interactive is NOT an acceptable solution. This guide restores gnome-screenshot to it's good old functional state.

So, time to build our own gnome-screenshot -- but one which actually works in a reasonable way. The gnome-screenshot cockup is just another sign that something is clearly amiss with the way gnome is being developed. And this, if true, is another truly idiotic 'feature' -- turn gnome into windows? Most of us left for a reason...

Anyway, linux is still sane though -- if we don't like something we're not entirely up a creek, which will buy us a bit more time while we're getting ready to move to xmonad -- or for debian to move away from not making downstreams changes to gnome.

We have two options:
Either look here: http://git.gnome.org/browse/gnome-screenshot/commit/?id=3bbc1e158fd58ec7f4f984f6d3c15ec95e65a035&ignorews=1 and try to come up with your own way of reverting the crippling.

Or use the ubuntu patches as a guide: http://packages.ubuntu.com/precise/gnome-screenshot

Normally you shouldn't mix ubuntu and debian packages, and we won't: we'll be compiling our own package, but using the work done by the ubuntu maintainers.

In particular, look at this: http://archive.ubuntu.com/ubuntu/pool/main/g/gnome-screenshot/gnome-screenshot_3.4.1-0ubuntu1.debian.tar.gz

Look in the debian/patches directory and you'll find the ubuntu_interative_screenshots.patch

Building:
sudo apt-get install libgtk-3-dev libcanberra-gtk3-dev intltool
wget http://ftp.de.debian.org/debian/pool/main/g/gnome-screenshot/gnome-screenshot_3.4.1.orig.tar.xz
tar xvf gnome-screenshot_3.4.1.orig.tar.xz
cd gnome-screenshot-3.4.1/src

You can wget http://archive.ubuntu.com/ubuntu/pool/main/g/gnome-screenshot/gnome-screenshot_3.4.1-0ubuntu1.debian.tar.gz and untar it to look at the debian/patches/ubuntu_interactive_screenshots.patch, which what we do below is based on:

In the ubuntu patch there's a test to see whether unity is used. We'll do it a bit cruder -- we'll just make sure the condition is always true by testing for 0<1.

Edit src/screenshot-application.c and change the part in red

130 static void

131 save_pixbuf_handle_error (ScreenshotApplication *self,

132 GError *error)

133 {

134 if (screenshot_config->interactive)

135 {

136 ScreenshotDialog *dialog = self->priv->dialog;

137 GtkWidget *toplevel = screenshot_dialog_get_toplevel (dialog);

138

139 screenshot_dialog_set_busy (dialog, FALSE);

134 if (0 < 1)

Also, change

348 screenshot_play_sound_effect ("screen-capture", _("Screenshot taken"));

349

350 if (screenshot_config->interactive)

351 {

352 self->priv->dialog = screenshot_dialog_new (self->priv->screenshot, self->priv->save_uri);

353 toplevel = screenshot_dialog_get_toplevel (self->priv->dialog);

354 gtk_widget_show (toplevel);

350 if (0 < 1 )

Time to build!
./configure --prefix=${HOME}/.gsc --program-suffix=.debugged
make
make install

Note: the install prefix here works fine for a single-user desktop. If you want everyone to be able to use our shiny new gnome-screenshot, put everything in /usr/bin instead.

We now have a working gnome screenshot in ~/.gsc that behaves as intended.
tree -L 2 -d
.
|-- bin
`-- share
|-- applications
|-- GConf
|-- glib-2.0
|-- gnome-screenshot
|-- locale
`-- man

However, we need to make sure our fixed gnome-screenshot gets invoked.

In Gnome Shell 3.2.X
sudo apt-get install gconf-editor
Start gconf-editor
go to /apps/metacity/keybinding_commands/command_screenshot
change to e.g. /home/verahill/.gsc/bin/gnome-screenshot.debugged
Also, change command_window_screenshot to
/home/verahill/.gsc/bin/gnome-screenshot.debugged --window

Note: defining Print/Alt+print keyboard shortcuts the 'gnome-shell' way (i.e. via system-settings) doesn't seem to work in gnome 3.2. Conversely, doing it the gconf-editor way in gnome 3.4 doesn't work.

In Gnome Shell 3.4.X
Go to System Settings, Keyboard, Shortcuts
Disable the automatically defined shortcuts for gnome-screenshot

And add your own under custom shortcuts:

Done!
Unless you want to add to PATH in which case you can put this in your ~/.bashrc:
export PATH=$PATH:${HOME}/.gsc/bin

Note: If it's still not working, try to launch from the terminal. If you get

(gnome-screenshot.debugged:7493): GLib-GIO-ERROR **: Settings schema 'org.gnome.gnome-screenshot' does not contain a key named 'auto-save-directory'
Trace/breakpoint trap

it's because you had the old, good gnome-screenshot.

sudo su

echo "gnome-screenshot install"|dpkg --set-selections

exit

sudo apt-get install gnome-screenshot

Now try

gsettings get org.gnome.gnome-screenshot auto-save-directory
which should be empty.

gsettings set org.gnome.gnome-screenshot auto-save-directory '/home/verahill/Pictures'

Finally, make sure to re-set your keybindings.

Links to this post:
http://qfox.nl/notes/153

08 June 2012

183. Compiling OpenMM 4.1 on debian testing

OpenMM 4.0 is still somewhat of a traumatic memory. However, having gotten a question about the compilation of v4.1 I can't really resist giving the new version a go.

Having said that, I never ended up using the GPU-enabled gromacs for which I built openmm, so it was all an enormous waste of time -- for those of you thinking about GPU/Gromacs know this:
* not all graphics cards are supported or worth supporting
* there's no speed-up for explicit solvent molecules, and what else would you use gromacs or MD for?
* consumer-grade graphics cards get very hot

I make no attempt at ferreting out what packages are needed other than what I'm explicitly prompted for. Look at http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy_20.html for an indication of what you might need.

Also, I already have openmm 4.0 installed, so e.g. paths and other things defined in the post above are still active.

Start here
Register with simtk.org and download the source file.
sudo apt-get install cmake-curses-gui libgccxml-dev gccxml nvidia-cuda-toolkit
unzip -x OpenMM4.1-Source.zipe
mkdir openmm_build
cd openmm_build/
ccmake -i ../OpenMM4.1-Source/

It'll say Empty Cache. Hit c which will populate the list.

I think we can ignore the EMU libs since they do device emulation. I never figured out what the CUT program was and it's not mentioned in the manual from what I can see.

These are the settings I chose -- I had problems before setting the OPENCL parts (in red) to off.

BUILD_TESTING:BOOL=ON
CMAKE_BUILD_TYPE:STRING=Release
CMAKE_INSTALL_PREFIX:PATH=/home/verahill/.openmm
CUDA_BUILD_TYPE:STRING=Device
CUDA_INSTALL_PREFIX:PATH=/usr/bin
CUDA_NVCC:FILEPATH=/usr/bin/nvcc
DL_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libdl.so
FOUND_CUBLAS:FILEPATH=/usr/lib/x86_64-linux-gnu/libcublas.so
FOUND_CUBLASEMU:FILEPATH=FOUND_CUBLASEMU-NOTFOUND
FOUND_CUFFT:FILEPATH=/usr/lib/x86_64-linux-gnu/libcufft.so
FOUND_CUFFTEMU:FILEPATH=FOUND_CUFFTEMU-NOTFOUND
FOUND_CUT:FILEPATH=FOUND_CUT-NOTFOUND
FOUND_CUT_INCLUDE:PATH=FOUND_CUT_INCLUDE-NOTFOUND
GCCXML_EXTRA_ARGS:STRING=
GCCXML_PATH:FILEPATH=/usr/bin/gccxml
OPENMM_BUILD_AMOEBA_CUDA_LIB:BOOL=ON
OPENMM_BUILD_AMOEBA_PLUGIN:BOOL=ON
OPENMM_BUILD_CUDA_LIB:BOOL=ON
OPENMM_BUILD_CUDA_TESTS:BOOL=TRUE
OPENMM_BUILD_C_AND_FORTRAN_WRAPPERS:BOOL=ON
OPENMM_BUILD_FREE_ENERGY_CUDA_LIB:BOOL=ON
OPENMM_BUILD_FREE_ENERGY_PLUGIN:BOOL=ON
OPENMM_BUILD_OPENCL_LIB:BOOL=OFF
OPENMM_BUILD_OPENCL_TESTS:BOOL=OFF
OPENMM_BUILD_PYTHON_WRAPPERS:BOOL=ON
OPENMM_BUILD_RPMD_OPENCL_LIB:BOOL=OFF
OPENMM_BUILD_RPMD_PLUGIN:BOOL=ON
OPENMM_BUILD_SERIALIZATION_SUPPORT:BOOL=ON
OPENMM_BUILD_STATIC_LIB:BOOL=ON
OPENMM_GENERATE_API_DOCS:BOOL=OFF
OPENMM_SVN_REVISION:STRING=exported
PYTHON_EXECUTABLE:FILEPATH=/usr/bin/python
SVNVERSION_PROGRAM:FILEPATH=/usr/bin/svnversion
SWIG_EXECUTABLE:FILEPATH=/usr/bin/swig
SWIG_VERSION:STRING=2.0.7

Make your changes and hit c again, then hit g which brings you back to the terminal.

make -d|tee make.log
make test

If all goes well you'll see

126/126 Test #126: TestParser ...................................... Passed 0.02 sec
100% tests passed, 0 tests failed out of 126
Total Test time (real) = 345.83 sec

make install

[..]
-- Installing: /home/verahill/.openmm/examples/Makefile
-- Installing: /home/verahill/.openmm/examples/NMakefile
-- Installing: /home/verahill/.openmm/examples/MakefileNotes.txt
-- Installing: /home/verahill/.openmm/examples/Empty.cpp

And you are done!

tree ~/.openmm/ -L 4 -d

.openmm/

|-- bin

|-- docs

| |-- api-c++

| `-- api-python

|-- examples

| `-- VisualStudio

|-- include

| `-- openmm

| |-- internal

| `-- serialization

|-- lib

| `-- plugins

`-- licenses

182. Oracle Java JDK (java, javac and javaws) in debian testing/wheezy

With ECCE I was having problems with getting the same version of java and javac on a computer where I was using sun java 6.0

Since I'm using SGE I (think I) need the closed source SUN java version.
Download (and click on the license agreement) here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk-6u32-downloads-1594644.html

(v7u4 is available here:
http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u4-downloads-1591156.html
)

Then follow this: http://verahill.blogspot.com.au/2012/04/installing-sunoracle-java-in-debian.html
sudo apt-get install java-package
make-jpkg jdk-6u32-linux-x64.bin

and follow the instructions. Once the package is built, install:
sudo dpkg -i oracle-j2sdk1.6_1.6.0+update32_amd64.deb

Unpacking oracle-j2sdk1.6 (from oracle-j2sdk1.6_1.6.0+update32_amd64.deb) ...
Setting up oracle-j2sdk1.6 (1.6.0+update32) ...
update-alternatives: using /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/ControlPanel to provide /usr/bin/ControlPanel (ControlPanel) in auto mode.
update-alternatives: using /usr/lib/jvm/j2sdk1.6-oracle/jre/lib/amd64/libnpjp2.so to provide /usr/lib/iceweasel/plugins/libjavaplugin.so (iceweasel-javaplugin.so) in auto mode.
update-alternatives: using /usr/lib/jvm/j2sdk1.6-oracle/jre/lib/amd64/libnpjp2.so to provide /usr/lib/chromium/plugins/libjavaplugin.so (chromium-javaplugin.so) in auto mode.

sudo update-alternatives --config java

There are 6 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java 1061 auto mode
1 /usr/bin/gij-4.4 1044 manual mode
2 /usr/bin/gij-4.6 1046 manual mode
* 3 /usr/lib/jvm/j2re1.6-oracle/bin/java 314 manual mode
4 /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/java 315 manual mode
5 /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java 1061 manual mode
6 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java 1051 manual mode

sudo update-alternatives --config javac

There are 2 choices for the alternative javac (providing /usr/bin/javac).
Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/lib/jvm/java-7-openjdk-amd64/bin/javac 1051 auto mode
1 /usr/lib/jvm/j2sdk1.6-oracle/bin/javac 315 manual mode
2 /usr/lib/jvm/java-7-openjdk-amd64/bin/javac 1051 manual mode

sudo update-alternatives --config javaws

There are 3 choices for the alternative javaws (providing /usr/bin/javaws).

Selection Path Priority Status

------------------------------------------------------------

0 /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/javaws 1061 auto mode

* 1 /usr/lib/jvm/j2re1.6-oracle/bin/javaws 314 manual mode

2 /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/javaws 315 manual mode

3 /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/javaws 1061 manual mode

While I was making the package this little guy popped up. Don't fret. I think it was meant to take me to the java.com website or something similar. I don't like the sight of that /root/ thingy though -- what's oracle thinking of us punters?

07 June 2012

181. Compiling openmpi on debian wheezy

There's nothing complicated about this compilation. It's not a terribly quick build though, and I'm not yet sure exactly what packages are necessary.

sudo apt-get install build-essential gfortran
wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.tar.bz2
tar xvf openmpi-1.6.tar.bz2
cd openmpi-1.6/

sudo mkdir /opt/openmpi/
sudo chown ${USER} /opt/openmpi/

./configure --prefix=/opt/openmpi/1.6/ --with-sge

make
make install

And you're done.

tree -L2 -d /opt/openmpi

Linking to the libs is done as before, although the path to e.g. libmpi.so is /opt/openmpi/1.6/lib/ and not /opt/openmpi/1.6/lib/openmpi/ like in the regular debian package.

└── 1.6

├── bin

├── etc

├── include

│ ├── openmpi

│ └── vampirtrace

├── lib

│ ├── openmpi

│ └── pkgconfig

└── share

├── man

├── openmpi

└── vampirtrace

You might also want to update the /etc/alternatives/libmpi.so symlink.

This is definitely one of those packages where it's worth doing ./configure --help to see what options are available.

Also, I imagine that on ROCKS there may well be a few packages which will have to be compile first and specified using --with-<> switches.

A sample:

--with-blcr(=DIR) Path to BLCR Installation
--with-blcr-libdir=DIR Search for BLCR libraries in DIR
--with-hwloc(=DIR) Build hwloc support. DIR can take one of three
--with-hwloc-libdir=DIR Search for hwloc libraries in DIR. Should only be
--with-valgrind(=DIR) Directory where the valgrind software is installed
--with-memory-manager=TYPE
--with-libpicl(=DIR) Build libpicl support, optionally adding
--with-libpicl-libdir=DIR
--with-timer=TYPE Build high resolution timer component TYPE
--with-portals=DIR Specify the installation directory of PORTALS
--with-portals-libs=LIBS
Libraries to link with for portals
--with-alps Build ALPS scheduler component (default: no)
--with-lsf(=DIR) Build LSF support
--with-lsf-libdir=DIR Search for LSF libraries in DIR
--with-pmi Build PMI support (default: no)
--with-cray-pmi-ext Include Cray PMI2 extensions (default: no)
--with-slurm Build SLURM scheduler component (default: yes)
--with-tm(=DIR) Build TM (Torque, PBSPro, and compatible) support,
--with-ftb(=DIR) Build FTB (Fault Tolerance Backplane) support,
--with-ftb-libdir=DIR Search for FTB (Fault Tolerance Backplane) libraries
--with-esmtp(=DIR) Build esmtp support, optionally adding DIR/include,
--with-esmtp-libdir=DIR Search for the esmtp libraries in DIR
--with-sge Build SGE or Grid Engine support (default: no)
--with-loadleveler Build LoadLeveler scheduler component (default: yes)
--with-elan(=DIR) Build Elan (QsNet2) support, searching for libraries
--with-elan-libdir=DIR Search for Elan (QsNet2) libraries in DIR
--with-mx(=DIR) Build MX (Myrinet Express) support, optionally
--with-mx-libdir=DIR Search for MX (Myrinet Express) libraries in DIR
--with-openib(=DIR) Build OpenFabrics support, optionally adding
--with-openib-libdir=DIR
--with-portals(=DIR) Build Portals support, optionally adding
--with-portals-config configuration to use for Portals support. One of
--with-portals-libs=LIBS
Libraries to link with for portals
--with-sctp(=DIR) Build SCTP support, searching for libraries in DIR
--with-sctp-libdir=DIR Search for SCTP libraries in DIR
--with-knem(=DIR) Build knem Linux kernel module support, searching
--with-udapl(=DIR) Build uDAPL support, optionally adding DIR/include,
--with-udapl-libdir=DIR Search for uDAPL libraries in DIR
--with-fca(=DIR) Build fca (Mellanox Fabric Collective Accelerator)
--with-io-romio-flags=FLAGS
--with-mxm(=DIR) Build Mellanox Messaging support
--with-mxm-libdir=DIR Search for Mellanox Messaging libraries in DIR
--with-psm(=DIR) Build PSM (Qlogic InfiniPath) support, optionally
--with-psm-libdir=DIR Search for PSM (QLogic InfiniPath PSM) libraries in
--with-contrib-vt-flags=FLAGS
--with-event-rtsig compile with support for real time signals
--with-pic[=PKGS] try to use only PIC/non-PIC objects [default=use
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
--with-sysroot=DIR Search for dependent libraries within DIR

180. Temporary fix for supertuxkart

I don't often play games, but I noticed that supertuxkart had been updated in debian wheezy and having a little bit of free time I figured I'd give it a whirl.

supertuxkart

supertuxkart: error while loading shared libraries: libIrrlicht.so.1.7a.3: cannot open shared object file: No such file or directory

Make sure that libirrlicht1.7a is installed.

sudo apt-get install libirrlicht1.7a

Then

cd /usr/lib

sudo ln -s libIrrlicht.so.1.7a.2 libIrrlicht.so.1.7a.3

It's obviously not a permanent fix, but I haven't had any problems playing.

179. Building ECCE on Debian Testing/Wheezy

UPDATE: Build went fine. Upgrade went fine. But the organizer doesn't show my jobs properly i.e. the files are there but they aren't recognised as jobs. I haven't had a solid look at this yet, and it might just be because I need to restart more services than just the http server. It's been a long day...

UPDATE 2: An update on a different computer went without a hitch, with all the old job files being imported properly.

UPDATE 3: I started ECCE and let the data manger chew on it for four hours. No luck on the troublesome computer. The only difference is the java -- openjdk 7 worked fine, oracle jre/jdk didn't. Dunno if this is the reason, but currently in the process of installing the binaries to see whether that works better. Updates will come...

UPDATE 4: Installing the prebuilt ECCE binaries did the trick. In summary: as far as I know you MUST use openjdk 7. SUN/Oracle Java does not appear to work. It's exhibited in a lack of ability to recognise old jobs as being...jobs rather than just folders and files.

POST BEGINS HERE:
I'm trying to document everything I'm doing these days, no matter how simple or (at least in retrospect) obvious it is.

Here's how to build the 4th of June 2012 version of ecce v6.3.

You need to be registered with EMSL/PNNL to download ecce. There are plans to open-source the software properly (i.e. no need to register) sometime this coming northern summer. But for now you need to be an academic group leader to have access.

(I originally posted a somewhat different post where I recommended making some changes to the build scripts re ECCE_HOME. Eventually I saw the light and realised the error of my ways)

Download ecce-v6.3.src.tar.bz2, and put it in a suitable folder, e.g. ~/tmp

cd ~/tmp

tar -xvf ecce-v6.3-src.tar.bz2
cd ecce-v6.3/
export ECCE_HOME=`pwd`

cd build/

./build_ecce

The first time you run ./build_ecce you'll be asked a series of questions relating to installed packages. If it's all good, answer
Do you want to skip these checks for future build_ecce invocations (y/n)? y
If anything came up, then read the message carefully and install the missing package.

NOTE: on one box I noticed different version of java and javac being found, as I had both openjdk 6 and 7 installed. I couldn't set javac to 6 but I could do
sudo update-alternatives --config java
and set it to openjdk 7.

[From my small, statistically unsound sample set Oracle/SUN java will NOT work.]

Then do ./build_ecce again. And again. And again. In all, I think you do it six or seven times - each time a new package is built.
I always get a

lib: No such file or directory.

at the end of the httpd build. Not sure why, but everything seems to be ok in spite of that.
Anyway, you know that you're done running ./build_ecce when you get

ECCE built and distribution created in /home/verahill/tmp/ecce-v6.3

At this point, you are ready to install
DO NOT USE ./install_ecce

GO UP ONE LEVEL AND DO
./install_ecce.v6.3.csh

But that's a different story. install_ecce will give you weird error messages about missing tar files. install_ecce.v6.3.csh on the other hand will work fine.

178. Gridengine queues on heterogeneous systems

I don't want to risk three slot jobs being submitted to quad core jobs, so I figured I'll try setting up different queues based on the jobs parameters.

Some reading:
http://wiki.gridengine.info/wiki/index.php/StephansBlog
https://www.clumeq.ca/wiki/index.php/Using_SGE#Queues_List
http://ait.web.psi.ch/services/linux/hpc/merlin3/sge/admin/sge_queues.html

qconf -ahgrp @quads

group_name @quads
hostlist tantalum

qconf -aq four.q

qname three.q
hostlist @quads
seq_no 1
slots 4,[tantalum=4]
pe_list make mpi4

qconf -ahgrp @thrice

group_name @thrice
hostlist boron beryllium

qconf -aq three.q

qname three.q
hostlist @thrice
seq_no 1
pe_list make mpi3
slots 3,[boron=3],[beryllium=3]

Finally, to avoid submitting jobs to main.q without deleting it, we change the seq_no to 9 for that particular q.
Also, we'll change the pe_list on main.q to remove mpi3 and mpi4 -- that way main.q is only used if I request only one core.

pe_list make mpi1

And now jobs get sent to the right queue (and node) depending on the number of cores I request.

06 June 2012

177. Jerry-rigging g09 UV/VIS spectra in gnuplot and/or octave

EDIT: I had a nicer post with lots of figures before. Because I realised that the data is good enough to be included in a future paper we're working on, I had to take everything down again. All the data in the plots now is made up (hence 'fakeuv.dat'), and I haven't made the plots look nice.

I don't like proprietary formats for anything. They never, ever benefit anyone other than the software vendor.

Almost as bad as using binary proprietary formats is not providing export facilities to ascii formats.

I may have missed it, but I was using gaussview to look at td-dft calculated uv/vis spectra -- and couldn't find a way of exporting the data. Sure, I could export the graph as a png, svg etc. file. But not double column tab-separated ascii file.

There's a bit of fudging in what I'm doing -- I'll be the first one to admit that.

So here's single line to export the wavelengths and intensities:
cat g03.g03out|grep Excited|grep -v singles|sed 's/=/\t/g'|gawk '{print $7,$10}'>uvvis.dat

You can plot them in gnuplot using
plot 'uvvis.dat' u 1:2 w impulse

The problem is that these are just spikes -- not the smooth uv/vis like spectra we're used to. On the other hand, if I understand things correctly, this is the REAL data, while the smoothed uv/vis spectrum above is more for presentation purposes. I might obviously be wrong, and I am by no stretch a computational or theoretical chemist - I just like their tools.

We've got an immensely powerful tool at our hands: Octave!

data=load('fakeuv.dat');
gauss= @(x,c,r,s) r.*1./(s.*sqrt(2*pi)).*exp(-0.5*((x-c)./s).^2)
x=linspace(250,850,600);
plot(x,cumsum(gauss(x,data(:,1),data(:,2),20)))

where 20 is an abitrary value. Anyway, this is how it looks:

We can try s=30 instead:

We export it

outdata=cumsum(gauss(x,data(:,1),data(:,2),30));
exportdata=[x' outdata'];
save 'uvvis2.sim' exportdata

and plot it in gnuplot
plot 'uvvis2.sim' u 1:48 w lines

It might not look like the UV/VIS spectrum you're used to, but as I said in the beginning, the data's all made up -- using 'real' calculated data I got a beautiful spectrum.

176. Weaning people onto SGE, one script at a time

On a five node (1 front + 4 exec; each node has 8 cores and 8 GB RAM) cluster that I know and hang out with, people have been submitting jobs one by one. As in, doing it manually, without a queue manager.

I got one of the users to start using my Very Simple Python Queue Manager to prevent too much idle time, but not everyone is using it yet.

Another downside when people aren't using queue managers is that they use top and kill to manage jobs, and that has a way of screwing things up for everyone. SGE is a much better solution in every possible sense.

To make it easier for the users to switch to using qsub i.e. make the change as undisruptive as possible, I wrote a little bash function and set up some standard qsub files.

The user navigates to the directory where their .in file is (e.g. test.in) and runs
presub test
which open test.in and creates test.qsub

The user then submits by doing
qsub test.qsub

It's easy enough to customize the function and the output files (.e.g using .com, .g03in etc.). This script obviously only does g09, but I'll post a more general script later.

The .bashrc function:
presub () {

paste -s -d "\n" ~/.qsub/qsub.head $1.in ~/.qsub/qsub.tail > $1.qsub
return 0
}

The files:
I put the following files in ~/.qsub/

qsub.head:

#$ -S /bin/sh
#$ -cwd
#$ -l h_rt=99:30:00
#$ -l h_vmem=8G
#$ -j y
#$ -pe orte 8

export GAUSS_SCRDIR=/tmp
export GAUSS_EXEDIR=/share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09
/share/apps/gaussian/g09/g09 << END >> g09.log

qsub.tail:

END

The empty lines above are on purpose since gaussian can be annoying in that sense.

175. Track Changes in Libreoffice

Since I collaborate I occasionally need to whip out libreoffice. I can never find the track changes function , so I'll make this a brief post:

In Libre Office, go to Edit, Changes, and tick Record.

Other than that it works exactly like the Track Changes function we've come to hate/love in Word.

05 June 2012

174. Setting up Sun Grid Engine with three nodes on Debian

Firstly, I must acknowledge this guide: http://helms-deep.cable.nu/~rwh/blog/?p=159

I FOLLOW THAT POST ALMOST VERBATIM

This post will be more of a "I followed this guide and it actually works on debian testing/wheezy too and here's how" post, since it doesn't add anything significant to the post above, other than detail.

Since I ran into problems over and over again, I'm posting as much as I can here. Hopefully you can ignore most of the post for this reason.

Some reading before you start:
Having toyed with this for a while I've noticed one important factor in getting this to work:
the hostnames you use when you configure SGE MUST match those returned by hostname. It doesn't matter what you've defined in your /etc/host file. This can obviously cause a little bit of trouble when you've got multiple subnets set up (my computers communicate via a 10/100 net for WAN and a 10/100/1000 net for computations). My front node is called beryllium (i.e. this is what is returned when hostname is executed) but it's known as corella on the gigabit LAN. Same goes for one of my sub nodes: it's called borax on the giganet and boron on the slow LAN. hostname here returns boron. I should obviously go back and redo this for the gigabit subnet later -- I'm just posting what worked.

While setting it up on the front node takes a little while, the good news is that very little work needs to be done on each node. This would become important when you are working with a large number of nodes -- with the power of xargs and a name list, setting them up on the front node should be a breeze.

My front node is beryllium, and one of my subnodes is boron. I've got key-based, password-less ssh login set up.

Set up your front node before you touch your subnodes. Add all the node name to your front node before even installing gridengine-exec on the subnode.

I've spent a day struggling with this. The order of events listed here is the first thing that worked. You make modifications at your own peril (and frustration). I tried openjdk with little luck, hence the sun java.

NFS
Finally, I've got nfs set up to share a folder from the front node (~/jobs) to all my subnodes. See here for instructions on how to set it up: http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-sharing-folder.html

When you use ecce, you can and SHOULD use local scratch folders i.e. use your nfs shared folder as the runtime folder, but set scratch to e.g. /tmp which isn't an nfs exported folder.

Before you start, stop and purge
if you've tried installing and configuring gridengine in the past, there may be processes and files which will interfere. On each computer do
ps aux|grep sge
use sudo kill to kill any sge processes
Then
sudo apt-get purge gridengine-*

First install sun/oracle java on all nodes.

[UPDATE 24 Aug 2013: openjdk-6-jre or openjdk-7-jre work fine, so you can skip this]

There's no sun/oracle java in the debian testing repos anymore, so we'll follow this: http://verahill.blogspot.com.au/2012/04/installing-sunoracle-java-in-debian.html

sudo apt-get install java-package
Download the jre-6u31-linux-x64.bin from here: http://java.com/en/download/manual.jsp?locale=en

make-jpkg jre-6u31-linux-x64.bin

sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb

Then select your shiny oracle java by doing:
sudo update-alternatives --config java
sudo update-alternatives --config javaws

Do that one every node, front and subnodes. You don't have to do all the steps though: you just built oracle-j2re1.6_1.6.0+update31_amd64.deb so copy that to your nodes, do sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb and then do the sudo update-alternatives dance.

Front node:
sudo apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
(at the moment this installs v 6.2u5-7)

I used the following:

Configure automatically: yes
Cell name: rupert
Master hostname: beryllium

=> SGE_ROOT: /var/lib/gridengine
=> SGE_CELL: rupert
=> Spool directory: /var/spool/gridengine/spooldb
=> Initial manager user: sgeadmin

Once it was installed, I added myself as an sgeadmin:
sudo -u sgeadmin qconf -am ${USER}

sgeadmin@beryllium added "verahill" to manager list

and to the user list:
qconf -au ${USER} users

added "verahill" to access list "users"

We add beryllium as a submit host
qconf -as beryllium

beryllium added to submit host list

Create the group allhosts
qconf -ahgrp @allhosts

1 group_name @allhosts
2 hostlist NONE

I made no changes

Add beryllium to the hostlist
qconf -aattr hostgroup hostlist beryllium @allhosts

verahill@beryllium modified "@allhosts" in host group list

qconf -aq main.q

This opens another text file. I made no changes.

verahill@beryllium added "main.q" to cluster queue list

Add the host group to the queue:

qconf -aattr queue hostlist @allhosts main.q

verahill@beryllium modified "main.q" in cluster queue list

1 core on beryllium is added to SGE:

qconf -aattr queue slots "[beryllium=1]" main.q

verahill@beryllium modified "main.q" in cluster queue list

Add execution host
qconf -ae
which opens a text file in vim

I edited hostname (boron) but nothing else. Saving returns

added host boron to exec host list

Add boron as a submit host
qconf -as boron

boron added to submit host list

Add 3 cores for boron:
qconf -aattr queue slots "[boron=3]" main.q

Add boron to the queue
qconf -aattr hostgroup hostlist boron @allhosts

Here's my history list in case you can't be bother reading everything in detail above.

2015 sudo apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
2016 sudo -u sgeadmin qconf -am ${USER}
2017 qconf -help
2018 qconf user_list
2019 qconf -au ${USER} users
2020 qconf -as beryllium
2021 qconf -ahgrp @allhosts
2022 qconf -aattr hostgroup hostlist beryllium @allhosts
2023 qconf -aq main.q
2024 qconf -aattr queue hostlist @allhosts main.q
2025 qconf -aattr queue slots "[beryllium=1]" main.q
2026 qconf -as boron
2027 qconf -ae
2028 qconf -aattr hostgroup hostlist beryllium @allhosts
2029 qconf -aattr queue slots "[boron=3]" main.q
2030 qconf -aattr hostgroup hostlist boron @allhosts

Next, set up your subnodes:

My example here is a subnode called boron.

On the subnode:
sudo apt-get install gridengine-exec gridengine-client

Configure automatically: yes
Cell name: rupert
Master hostname: beryllium

This node is called boron.

Check whether sge_execd got start after the install
ps aux|grep sge

sgeadmin 25091 0.0 0.0 31712 1968 ? Sl 13:54 0:00 /usr/lib/gridengine/sge_execd

If not, and only if not, do

/etc/init.d/gridengine-exec start

cat /tmp/execd_messages.*
If there's no message corresponding to the current iteration of sge (i.e. you may have old error messages from earlier attempts) then you're probably in a good place.

Back to the front node:

qhost

HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 6 0.57 7.8G 3.9G 14.9G 597.7M
boron lx26-amd64 3 0.62 3.8G 255.6M 14.9G 0.0

If the exec node isn't recognised (i.e. it's listed but no cpu info or anything else) then you're in a dark place. Probably you'll find a message about "request for user soandso does not match credentials" in your /tmp/execd_messages.* files on the exec node. The only way I got that solved was stopping all sge processes everywhere, purging all gridengine-* packages on all nodes and starting from the beginning -- hence why I posted the history output above.

qstat -f

queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.64 lx26-amd64
---------------------------------------------------------------------------------
main.q@boron BIP 0/0/3 0.72 lx26-amd64

Time to see how far we've got:
Create a file called test.qsub on your front node:

#$ -S /bin/csh
#$ -cwd
tree -L 1 -d
hostname

qsub test.qsub
Your job 2 ("test.qsub") has been submitted
qstat -u ${USER}

job-ID prior name user state submit/start at queue slots ja-task-ID
2 0.00000 test.qsub verahill qw 06/05/2012 14:03:10 1

ls
test.qsub test.qsub.e2 test.qsub.o2
cat test.qsub.[oe]*

.
0 directories
beryllium

Tree could have had more exciting output I s'pose, but I didn't have any subfolders in my run directory.

So far, so good. We still need to set up parallel environments (e.g. orte, mpi).

Before that, we'll add another node, which is called tantalum and has a quadcore cpu.
On the front node:

qconf -as tantalum
qconf -ae

replace template with tantalum

qconf -aattr queue slots "[tantalum=4]" main.q
qconf -aattr hostgroup hostlist tantalum @allhosts

qhost

HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 6 0.67 7.8G 3.7G 14.9G 597.7M
boron lx26-amd64 3 0.14 3.8G 248.0M 14.9G 0.0
tantalum - - - - - - -

On tantalum:

Install java by copying the oracle-j2re1.6_1.6.0+update31_amd64.deb which got created when you set it up the first time.

sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb

sudo update-alternatives --config java

sudo update-alternatives --config javaws

Install gridengine:

sudo apt-get install gridengine-exec gridengine-client

On the front node:

qhost

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      6  0.62    7.8G    3.7G   14.9G  601.0M
boron                   lx26-amd64      3  0.15    3.8G  248.6M   14.9G     0.0
tantalum                lx26-amd64      4  4.02    7.7G  977.0M   14.9G   24.1M

qstat -f

queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.71 lx26-amd64
---------------------------------------------------------------------------------
main.q@boron BIP 0/0/3 0.72 lx26-amd64
---------------------------------------------------------------------------------
main.q@tantalum BIP 0/0/4 4.01 lx26-amd64

It's a beautiful thing when everything suddenly works.

Parallel environments:
In order to use all the cores on each node we need to set up parallel environments.

qconf -ap orte

pe_name orte
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

To use a parallel environement include #$ -pe orte 3 for 3 slots in your test.qsub:

#$ -S /bin/csh
#$ -cwd
#$ -pe orte 3
tree -L 1 -d
hostname

Submit it:
qsub test.qsub

Your job 14 ("test.qsub") has been submitted

qstat

job-ID prior name user state submit/start at queue slots ja-task-ID
14 0.00000 test.qsub verahill qw 06/05/2012 15:43:25 3

verahill@beryllium:~/mine/qsubtest$ cat test.qsub.*

.
0 directories
boron

It got executed on boron.

We are basically done with a basic setup now. To read more, use google. Some additional info that might be helpful is here: http://wiki.gridengine.info/wiki/index.php/StephansBlog

We're going to set up a few more parallel environments:

qconf -ap mpi1

pe_name mpi1
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE

qconf -ap mpi2

pe_name mpi2
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 2
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE

qconf -ap mpi3

pe_name mpi3
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 3
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE

qconf -ap mpi4

pe_name mpi4
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 4
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE

And we'll call these using the #$ -pe mpi$totalprocs $totalprocs directive below

We need to add them (update: you need to add them to a queue. Which one is irrelevant, as long as the environment and queue parameters are consistent) to our main.q file though:
qconf -mq main.q

pe_list make orte mpi1 mpi2 mpi3 mpi4

This obviously isn't the end of my travails -- now I need to get nwchem and gaussian happy.
I've got this in my CONFIG.Dynamic (inside joke) file

NWChem: /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /opt/gaussian/g09/g09
perlPath: /usr/bin/perl
qmgrPath: /usr/bin/

SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
#$ -pe mpi$totalprocs $totalprocs
}

NWChemCommand {
setenv LD_LIBRARY_PATH "/usr/lib/openmpi/lib:/opt/openblas/lib"
setenv PATH "/bin:/usr/bin:/sbin:/usr/sbin"
mpirun -n $totalprocs /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem $infile > $outfile
}

Gaussian-03Command{
setenv GAUSS_SCRDIR /scratch
setenv GAUSS_EXEDIR /opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09
/opt/gaussian/g09/g09 $infile $outfile >g09.log
}

And now everything works!

See below for a few of the annoying errors I encountered during my adventures:

Error -- missing gridengine-client
The gaussian set-up worked fine. The nwchem setup worked on one node but not at all on another -- my problem sounded identical to that described here (two nodes, same binaries, still one works and one doesn't):
http://www.open-mpi.org/community/lists/users/2010/07/13503.php
And it's the same as this one too http://www.digipedia.pl/usenet/thread/11269/867/

[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 161--------------------------------------------------------------------------It looks like orte_init failed for some reason; your parallel process islikely to abort. There are many reasons that a parallel process canfail during orte_init; some of which are due to configuration orenvironment problems. This failure appears to be an internal failure;here's some additional information (which may only be relevant to anOpen MPI developer):
orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS--------------------------------------------------------------------------[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../orte/runtime/orte_init.c at line 132--------------------------------------------------------------------------It looks like orte_init failed for some reason; your parallel process islikely to abort. There are many reasons that a parallel process canfail during orte_init; some of which are due to configuration orenvironment problems. This failure appears to be an internal failure;here's some additional information (which may only be relevant to anOpen MPI developer):
orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS--------------------------------------------------------------------------[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../orte/tools/orterun/orterun.c at line 543

It's tooka while to troubleshoot this one. As always, when you're troubleshooting you discover the odd thing or two. On my front node:

/usr/bin/rsh -> /etc/alternatives/rsh

which is normal, but

/etc/alternatives/rsh -> /usr/bin/krb5-rsh

There are some krb packages on tantalum, but nothing on boron
boron:
locate rsh|grep "usr/bin"
/usr/bin/rsh

tantalum:
locate rsh|grep "usr/bin"
/usr/bin/glib-genmarshal
/usr/bin/qrsh
/usr/bin/rsh

sudo apt-get autoremove krb5-clients

Of course, that did not get it working...
The annoying things is that nwchem/mpirun on boron work perfectly together, also when submitting jobs directly via ECCE. It's just with qsub that I am having trouble. Search continues:
On the troublesome node:
aptitude search mpi|grep ^i
i libblacs-mpi-dev - Basic Linear Algebra Comm. Subprograms - D
i A libblacs-mpi1 - Basic Linear Algebra Comm. Subprograms - S
i A libexempi3 - library to parse XMP metadata (Library)
i libopenmpi-dev - high performance message passing library -
i A libopenmpi1.3 - high performance message passing library -
i libscalapack-mpi-dev - Scalable Linear Algebra Package - Dev. fil
i A libscalapack-mpi1 - Scalable Linear Algebra Package - Shared l
i A mpi-default-bin - Standard MPI runtime programs (metapackage
i A mpi-default-dev - Standard MPI development files (metapackag
i openmpi-bin - high performance message passing library -
i A openmpi-checkpoint - high performance message passing library -
i A openmpi-common - high performance message passing library -

Library conflict?

sudo apt-get autoremove mpi-default-*

And then recompile nwchem. Still no change.

Finally I found the real problem:
gridengine-client was missing on the troublesome node. Once I had installed that, everything worked!

Errors:

If your parallel job won't start (sits with qw forever), and qstat -o jobid gives you

scheduling info: cannot run in PE "orte" because it only offers 0 slots

make sure that qstat -f lists all your nodes.

This is good:

qstat -f

queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.71 lx26-amd64
---------------------------------------------------------------------------------
main.q@boron BIP 0/0/3 0.72 lx26-amd64
---------------------------------------------------------------------------------
main.q@tantalum BIP 0/0/4 4.01 lx26-amd64

This is bad:

qstat -f

queuename qtype resv/used/tot. load_avg arch states

---------------------------------------------------------------------------------

main.q@beryllium BIP 0/0/1 0.64 lx26-amd64

To fix it, do

qconf -aattr hostgroup hostlist tantalum @allhosts

on the front node for all your node names (change tantalum to the correct name)

An unhelpful error message:
qstat -u verahill

job-ID prior name user state submit/start at queue slots ja-task-ID

3 0.50000 test.qsub verahill Eqw 06/05/2012 11:45:18 1

cat test.qsub.[eo]*

/builder/src-buildserver/Platform-7.0/src/linux/lwmsg/src/connection-wire.c:325: Should not be here

This came from a faulty qsub directive: I used
#$ -S csh
instead of
#$ -S /bin/csh
i.e. you should use the latter.

I think it's a potentially common enough mistake that I post it here. See http://helms-deep.cable.nu/~rwh/blog/?p=159 for more errors.

Links to this post:
http://gridengine.org/pipermail/users/2012-November/005207.html
http://web.archiveorange.com/archive/v/JfPLjOHE5fXSiyFH0yzc

Pages

14 June 2012

191. Thinking about Molecular volume -- and is cosmo/nwchem yielding the right ones?

13 June 2012

190. In deep water: NWChem and COSMO

189. Thoughts on restarting NWChem jobs in ECCE

188. Notes: virtualbox and /etc/init.d/vboxdrv

12 June 2012

187. Thunderbird 13.0 from source on debian wheezy

11 June 2012

186. Installing gnome shell extensions in gnome 3.4 on debian wheezy-- frippery panel, menu etc.

185. Troubleshooting: ECCE

10 June 2012

184. Fixing Gnome screenshot (3.4.1) in Debian Wheezy by patching and compiling

08 June 2012

183. Compiling OpenMM 4.1 on debian testing

182. Oracle Java JDK (java, javac and javaws) in debian testing/wheezy

07 June 2012

181. Compiling openmpi on debian wheezy

180. Temporary fix for supertuxkart

179. Building ECCE on Debian Testing/Wheezy

178. Gridengine queues on heterogeneous systems

06 June 2012

177. Jerry-rigging g09 UV/VIS spectra in gnuplot and/or octave

176. Weaning people onto SGE, one script at a time

175. Track Changes in Libreoffice

05 June 2012

174. Setting up Sun Grid Engine with three nodes on Debian

Contributors

Statcounter