01 June 2012

170. Compiling PVM and XPVM on ROCKS 5.4.3

And we're back to ROCKS again.

NOTE: I haven't actually tested the binaries and libs compiled here. I think they should work. But I don't know for sure.

PQS works with openmpi, mpich and PVM. Our vanilla ROCKS install already has openmpi and mpich. There's a package called rocks-pvm, but the size is 50 kb and didn't seem to actually install anything precompiled, so I removed it and decided to go the compilation way instead.

The paths here are specific to the cluster I did this on, so customise as needed.

sudo mkdir /share/apps/pvm
sudo chown ${USER} /share/apps/pvm
cd /share/apps/pvm
wget http://www.netlib.org/pvm3/pvm3.4.6.tgz
tar xvf pvm3.4.6.tgz
cd pvm3/
export PVM_ROOT=`pwd`
make

Time to set up environment variables. Either edit /etc/profile or ~/.bashrc, depending on powers and reach., and add
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:share/apps/pvm/pvm3/lib/LINUX64
export PATH=$PATH:/share/apps/pvm/pvm3/bin/LINUX64
export PVM_ROOT=/share/apps/pvm/pvm3

Changes won't take effect until you source the file, or open a new terminal.


I profess to be ignorant about how to actually use pvm, so no testing just yet.


So I also stumbled across xpvm, which sounds (and looks) neat.

wget http://www.netlib.org/pvm3/xpvm/XPVM.src.1.2.5.tgz
cd /share/apps/pvm
tar xvf XPVM.src.1.2.5.tgz
cd xpvm/

Time to do some housekeeping before compiling:
It requires:
1. PVM 3.3.0 or later.
2. TCL 7.3 or later.
3. TK 3.6.1 or later.

I find
/usr/share/tk8.4
/usr/share/tcl8.4
so I might be ok. We just compiled pvm 3.4.6, so it should be alright.

First figure out where stuff is:

locate libtk|grpe so
/usr/lib/libtk.so
/usr/lib/libtk8.4.so
/usr/lib64/libtk.so
/usr/lib64/libtk8.4.so
locate libtcl|grep so
/usr/lib/libtcl.so
/usr/lib/libtcl8.4.so
/usr/lib64/libtcl.so
/usr/lib64/libtcl8.4.so
/usr/lib64/tclx8.4/libtclx8.4.so
These are fairly standard locations, so they should already be searched by ld -- no need to specify them thus.

Include is potentially worse since they'd typically need the development packages.
locate tcl | grep "\.h"
[..]
/usr/include/tcl-private/generic/tcl.h
[..]
locate tk|grep "\.h"
[..]
/usr/include/tk-private/generic/tk.h
[..]
So we /should/ be fine.

We also need the X11 libs and headers:
locate libX11

/usr/lib/libX11.so
/usr/lib/libX11.so.6
/usr/lib/libX11.so.6.2.0
/usr/lib64/libX11.so
/usr/lib64/libX11.so.6
/usr/lib64/libX11.so.6.2.0
locate X11|grep include

[..]
/usr/include/X11
[..]
Finally,

locate libdl
/lib/libdl-2.5.so
/lib/libdl.so.2
/lib64/libdl-2.5.so
/lib64/libdl.so.2
/usr/lib/libdl.a
/usr/lib/libdl.so
/usr/lib64/libdl.a
/usr/lib64/libdl.so
I'll specify the lib locations even though in some of these particular cases it isn't necessary:


Edit xpvm/src/Makefile.aimk and set (line numbers added by me):
19  PVMVERSION = -DUSE_PVM_34
Comment out line 42:
 42 #TCLTKHOME  =  $(HOME)/TCL
and
 44 TCLTKHOME  =   /usr/include
Change

 47 TCLINCL     =   -I$(TCLTKHOME)/tcl-private/generic
 48 TKINCL      =   -I$(TCLTKHOME)/tk-private/generic
and

 57 TCLLIBDIR   =   -L/usr/lib64/tclx8.4
 58 TKLIBDIR    =   -L/usr/lib64
and
 70 TCLLIB      =   -ltcl8.4
 71 TKLIB       =   -ltk8.4
and
83 XINCL       = -L/usr/include/X11
84 XLIBDIR     = -L/usr/lib64
and finally,
 96 SHLIB       = -ldl



Fell asleep? Time to get compiling.
export XPVM_ROOT=/share/apps/pvm/xpvm

export TCL_LIBRARY=
/usr/share/tcl8.4

export TK_LIBRARY=/usr/share/tk8.4

cd ${XPVM_ROOT}
make
[..]
Installing xpvm.tcl
Installing globs.tcl
Installing procs.tcl
Installing util.tcl
make[1]: Leaving directory `/share/apps/pvm/xpvm/src/LINUX64'

The beautiful thing is that the xpvm binary automagically ends up in the pvm3/bin/LINUX64 directory, so no need to fiddle with path.



In theory everything should work now if you log in with ssh -XC. However I get
xpvm
libpvm [pid2607] /tmp/pvmd.502: No such file or directory
libpvm [pid2607]: Can't Start PVM: Can't start pvmd
I'm not actually running -- nor have I ever run -- anything with pvm.

touch /tmp/pvmd.502
xpvm
libpvm [pid4219]: mksocs() read addr file: wrong length read
Connecting to PVMD already running... libpvm [pid4219]: mksocs() read addr file: wrong length read
libpvm [pid4219]: mksocs() read addr file: wrong length read
libpvm [pid4219]: mksocs() read addr file: wrong length read
libpvm [pid4219]: pvm_mytid(): Can't contact local daemon
libpvm [pid4219]: Error Joining PVM: Can't contact local daemon
I mean, it looks like it should work, once pvm is being used.

169. ECCE (PNNL/EMSL) and documentation

The title makes it sound like this is going to be a very dry and boring post, but that's not the intention.

Any current or fledgling user of ECCE will probably agree with me when I say that ECCE is somewhat under-documented (there are practical and understandable reasons for this. Besides, no-one like writing documentation). Most of the documentation which does exist is written for earlier versions of ECCE which used a different work flow. Other features are completely undocumented or actually finding the information is difficult.

As it turns out a lot of the technical documentation is to be found in the ecce-6.3/apps/siteconfig/ directory as many of the settings files are self-documenting. Another thing that has been pointed out to me is that the release notes are paid extra attention to because of the lack of a large, coherent body of documentation.

In particular look at:
site_runtime
Some minor settings you mostly won't have to bother with. You may want to look at the OPENGL settings though.
Also, if you're trouble-shooting you can uncomment the ECCE_RCOM_LOGMODE line

# send remote communications output to the console for diagnosing problems
#ECCE_RCOM_LOGMODE true

This allows ECCE to echo every remote command it executes.

submit.site
A treasure trove of ideas and information. Anything you set here will be the default. Defaults can be overriden in the site-specific CONFIG.<> files. It's a judgement call whether to put a specific setting here or in the CONFIG.<> files. Personally, I always change to

NWChemCommand {
  mpirun -n $totalprocs  $nwchem $infile > $outfile
}
as a default setting here.


remote_shells.site
Editing this can allow you to adapt to restrictions in what kind of shells you can run on the remote site or node. The documentation here is a bit iffier.

CONFIG-Examples/ 
Examples of configurations, as the name would suggest. It could do with an overview of the different sites and situations in the README file though.


31 May 2012

168. Port redirection with ECCE/nwchem

Update 1/6/2012: There may be an alternative, better way of doing this: "Another feature that may or may not be useful to you with this special node that is setup with a higher ulimit for submitting your ECCE jobs is that ECCE has a "hop" feature that lets it go from a main login node on a machine to other nodes before actually running commands (e.g. submitting jobs). If you look at the $ECCE_HOME/siteconfig/CONFIG-Examples/CONFIG.mpp2 file, you'll see this "frontendMachine" directive that is what is used to do this. I'm thinking this might allow you to skip the port redirect options with ssh and just "hop" to your special node from regular login node on the compute host. But, I don't think I'd worry about it if what you have now is working fine."
What I describe below works fine, but it does require that users are able to set up the port redirect themselves. The solution hinted at above would be a better technical solution for a larger group of users with varying technical abilities since it should be enough to copy CONFIG.<> and <>.Q files.
Here's how to use main-subnode hopping: http://verahill.blogspot.com.au/2012/06/ecce-and-inaccessible-cluster-nodes.html

Original post:

If you for some reason can't connect to your computational cluster on port 22, here's how to sort it out.


Setting it up the first time

Adding a node:
I have access to an idle computer off-campus, which sits behind a trusty old linksys router. To access it directly I need to use port forwarding.

Edit  ecce-6.3/apps/siteconfig/remote_shells.site and add

ssh_p9999: ssh -p 9999|scp -P 9999
The miniscule p for ssh and capital P for scp are important. They do the same thing, but the different programmes expect difference cases. 

In the ECCE gateway (the small floating window with icons that start when you start ECCE) select Machine Browser. Under Machine, select Register Machine:


The localhost bit is the key here -- since you'll be doing port redirect you'll formally connect to a port on localhost. Hit Add/Change.
In the main Machine Browser window, highlight oxygen and click on Setup Remote Access. Your ssh_p9999 thingy should show up there. Don't bother testing anything just yet (i.e. Machine status etc.). Since I'm writing this from memory I don't know whether you need to have the port-redirect active at this point or not. If you do, see below under Running.


It really is that simple. See below for how to actually use it.

Adding a remote site:
I work at an Australian university where I appear to be the only person using ECCE. Thus, while ulimit on the local SGE managed cluster is a meagre 32 procs, this hasn't been a problem up till now. However, ECCE launches five procs per job, so using up my procs allocation I've been locked out of the cluster on a regular basis.

As a solution, I've been offered my very own shiny submit node with a heftier 37k procs allowed. The downside is that it's only accessible from the standard submit node. Luckily, it's not much more difficult than doing port redirects for a remote node.


Edit  ecce-6.3/apps/siteconfig/remote_shells.site and add

ssh_p5454: ssh -p 5454scp -P 5454
The miniscule p for ssh and capital P for scp are important. 


In the terminal, run
ecce -admin

Most of the fields above should fairly self-explanatory. A few things to watch out for:

  • ECCE actually looks at the proc to node ratio and will impose strict limitations on the number of cores you can use per node. 50/20 means that if you want to use four cores ECCE forces the use of two nodes. Depending on how you run your jobs (configuration) this may or may not have any real impact. To be safe, pick something like 700 cores and 20 nodes.
  • Path means path. I think ECCE defaults to giving the perl path as /usr/bin/perl, but it should be /usr/bin. Same goes for the qsub path.
  • You need to create a queue. The queue name isn't used anywhere other than in ECCE, so it can be a smart way of setting up defaults. What I'm saying is: it's not that important to get it 'right' since it bears no relation to anything on your cluster.
Click on close.

In 'regular' ecce (i.e. started without -admin) go to the machine browser window, highlight the added site, hit Set up Remote Access, and pick ssh_p5454 as shown below. Don't bother testing anything just yet (e.g. Machine status etc.). Since I'm writing this from memory I don't know whether you need to have the port-redirect active at this point or not. If you do, see below under Running.



As always, setting up a site takes a bit of customisation. Here's my ecce-6.3/apps/siteconfig/CONFIG.gn54 on my ecce workstation.
NWChem: /opt/sw/nwchem-6.1/bin/nwchem
Gaussian-03: /usr/local/bin/G09
perlPath: /usr/bin
qmgrPath: /opt/n1ge62/bin/lx24-amd64
SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
}

NWChemEnvironment{
    LD_LIBRARY_PATH /usr/lib/openmpi/1.3.2-gcc/lib/
    PATH /opt/n1ge62/bin/lx24-amd64/
}
NWChemCommand {
#$ -pe mpi_smp$totalprocs  $totalprocs
module load nwchem/6.1
mpirun -n $totalprocs $nwchem $infile > $outfile
}
Gaussian-03Command {
#$ -pe g03_smp4 4
module load gaussian/g09
time G09< $infile > $outfile }
and here's my ecce-6.3/apps/siteconfig/gn54.Q  on my ecce workstation.

# Queue details for gn54
Queues:    nwchem squ8
nwchem|minProcessors:       1
nwchem|maxProcessors:       8
nwchem|runLimit:       4320
nwchem|memLimit:       4000
nwchem|scratchLimit:       0
squ8|minProcessors:       1
squ8|maxProcessors:       6
squ8|runLimit:       4320
squ8|memLimit:       4000
squ8|scratchLimit:       0 
Finally, you need to make sure that  nwchem can find everything - put a file called .nwchemrc in your home folder on the remote node with the correct paths in it, e.g.

nwchem_basis_library /opt/sw/nwchem-6.1/data/libraries/
nwchem_nwpw_library /opt/sw/nwchem-6.1/data/libraryps/
ffield amber
amber_1 /opt/sw/nwchem-6.1/data/amber_s/
amber_2 /opt/sw/nwchem-6.1/data/amber_q/
amber_3 /opt/sw/nwchem-6.1/data/amber_x/
amber_4 /opt/sw/nwchem-6.1/data/amber_u/
spce /opt/sw/nwchem-6.1/data/solvents/spce.rst
charmm_s /opt/sw/nwchem-6.1/data/charmm_s/
charmm_x /opt/sw/nwchem-6.1/data/charmm_x/


That's it.

Running

Starting port redirect:
Before you can pipe anything through to your supersecure remote node/remote site, you need to open a teminal window and do
ssh -C username@remoteserver -L 9999:hiddenserver:22
for the remote node above, or
ssh -C username@remoteserver -L 5454:hiddenserver:22
for the remote site above.

Just to make the syntax clear:
Remote node:
My linksys route is at IP address 110.99.99.99, and the remote node behind it is at 192.168.1.106. My username is verahill
ssh -C verahill@110.99.99.99 -L 9999:192.168.1.106:22

Remote site:
The standard submit node is called msgln4.university.au, and the hidden node is called gn54.university.au. My username is lindqvist
ssh -C lindqvist@msgln4.university.au -L 5454:gn54.university.au:22
You may need to install and use autossh if you keep on being booted off due to inactivity! The syntax is identical.

In ECCE:
Everything in ECCE now works exactly like before - just select the target computer/site/node and go.





What doesn't work:
So far I haven't been able to sort out the whole 'open' remote terminal, which means that tail -f output doesn't work either. I'm leaving that for a rainy day with too much time on my hands.