02 June 2012

170. tcsh in ROCKS/CentOS with hardcoded csh.cshrc path

WHAT THIS POST DOES: It shows you how to compile your own tcsh which won't be looking at /etc/csh.cshrc. It doesn't show you how to set up the correct .cshrc files. But it certainly allows you to experiment.

Also, keep in mind that since each local node hdd has it's own /bin directory (not exported) you need to make similar changes on each node (i.e. change the /bin/csh symlink -- see below)

(The) csh (startup files) is(are) horribly broken on ROCKS 5.4.3.

For now I've solved it by just moving  /etc/csh.cshrc out of the way, but what we do here is symlink /bin/csh to our own tcsh which been hardcoded to use a non-standard configuration file, so that you can use the standard ROCKS tcsh with /etc/csh.cshrc and your own csh(tcsh) with your own config files.

To be clear: it's not the csh binary which is borked on ROCKS 5.4.3, but the configuration files.

There's a patch for the broken csh -- but when I applied it to a test computer it only got broken-er and prevented the csh from opening and staying open. Good way of getting locked out. So I'm not keen on doing the same thing on someone else's production cluster. Also, I've opted for tcsh since the csh sources come with a bsd style makefile, and I really can't deal with that right now.

What we'll do is hardcode the location of the csh.cshrc file and change it from /etc/csh.cshrc to /share/apps/utils/tcsh

sudo mkdir /share/apps/utils
sudo chown ${USER} /share/apps/utils
cd /share/apps/utils
 wget http://ftp.de.debian.org/debian/pool/main/t/tcsh/tcsh_6.18.01.orig.tar.gz
tar xvf tcsh_6.18.01.orig.tar.gz
 cd tcsh-6.18.01/

Time for find out what to change:
tail -n 9999 *|strings|egrep "/etc/csh.cshrc|<=="


tells us we need to have a look at pathnames.h
Change

124 # define _PATH_DOTCSHRC     "/etc/csh.cshrc"
to
124 # define _PATH_DOTCSHRC     "/share/apps/utils/custom.tcshrc"

./configure --prefix=/share/apps/utils/tcsh
make
make install

If all went well:
cat tcsh|strings|grep custom.tcsh
/share/apps/utils/custom.tcshrc
and
tree /share/apps/utils/tcsh -L 1
/share/apps/utils/tcsh
|-- bin
`-- share

Obviously, this doesn't really make much of a difference just yet. Now comes the scary part -- and you need root access for this:
 which csh
/bin/csh

ls /bin/csh -lah
lrwxrwxrwx 1 root root 4 Feb 23 16:54 /bin/csh -> tcsh
and here's the 'dangerous' stuff:
sudo rm /bin/csh
sudo ln -s /share/apps/utils/bin/tcsh /bin/csh
sudo chown root:root /bin/csh
sudo chmod 777 /bin/csh

Since /bin/csh isn't a binary but a symmlink to tcsh in the /bin directory, we just delete the symlink and create a new one.

We can now make whatever changes we want to our custom.tcshrc while still being able to easily change back to the old setup. I do recognise that we could just have edited /etc/csh.cshrc and /login.cshrc, but I for some reason feel a lot more comfortable using this method.



01 June 2012

169. ECCE, node hopping and cluster nodes without direct access to WAN

NOTE: I'm still working on this. But this sort of works for now. More details coming soon.
Update:  I've posted better solutions more recently for use with SGE. See aspects of e.g. http://verahill.blogspot.com.au/2012/06/ecce-in-virtual-machine-step-by-step.html and http://verahill.blogspot.com.au/2012/06/ecce-and-inaccessible-cluster-nodes.html

It's not easy choosing a good title for this post which describes the purpose of it succinctly yet clearly (sort of like this sentence), so here's what we're dealing with:

The Problem:
You can access the submit node from off-site. You can't access the subnodes directly from off-site. This post shows you how you can submit to each subnode directly. A better technical solution is obviously to use qsub on the main node. Having said that, with very little modification this method can also be adapted to the situation described here: http://verahill.blogspot.com.au/2012/05/port-redirection-with-eccenwchem.html

A low-tech, home-built example is the following:
-eth0 - Main  node - eth1 - (subnodes: node0, node1, node2, node3)/

eth0 is WAN, and eth1 is LAN. You can ssh to the main node from the 'internet'. There's no queue manager like SGE installed -- submission of jobs is done by logging onto each node and executing the commands there.

The example:
Main node:
    my.cluster.edu
Nodes:
   compute-0-0
   compute-0-1
   compute-0-2
   compute-0-3


The most obvious solution:
do port redirection. The downside is that it requires some technical skills of the users, and anything with networking and ssh is a PITA if they insist on using windows.


The smarter solution:
I was informed of this solution here.

ROCKS-specific stuff
If your cluster is running ROCKS 5.4.3 and you're having issues opening csh shells, just move /etc/csh.cshrc out of the way as a crude fix.
sudo mv /etc/csh.cshrc ~/
Don't forget to do this on all the nodes if they have local /etc folders! And it's not that easy -- the passwords aren't the same on the nodes as on the main node.
So, on the main node:
rocks set host sec_attr compute-0-1 attr=root_pw
sudo su
# cat /etc/hosts
#ssh 10.1.255.253
#mv /etc/csh.cshrc .
#exit

As the user you'll be running as, edit/create your ~/.cshrc

setenv GAUSS_SCRDIR /tmp
setenv ECCE_HOME /share/apps/ecce/apps
set PATH= (/share/apps/nwchem/nwchem-6.1/bin/LINUX64 $PATH)
setenv LD_LIBRARY_PATH /opt/openmpi/lib:/share/apps/openblas/lib
Repeat on all nodes.

SIDE NOTE: What I'm ultimately looking to achieve on the ROCKS cluster is front-node managed SGE submissions. Easy, you say? Well, ECCE submits a
setenv PATH /opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}
and for some reason the ROCKS/CentOS (t)csh can't handle (most of the time) adding ${PATH} to itself since it's too long unless "s are added (it seems) i.e. this works consistently:
setenv PATH "/opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}"
On my debian system either works fine (bsd-csh).

ANYWAY -- continue

Ecce:
Start setting it up using
ecce -admin



After that, do some editing in ecce-6.3/apps/siteconfig/

CONFIG.voldemort:
xappsPath: /usr/X11R6/bin
NWChem: /share/apps/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /share/apps/gaussian/g09/g09
sourceFile:  /etc/csh.login
frontendMachine: my.cluster.edu

Machines:
compute-0-0 compute-0-0.local Dell HPCS3 Phase 1 Cluster AMD 2.6 GHz Opteron 40:5 ssh :NWChem:Gaussian-03MN:RD:SD:UN:PW




170. Compiling PVM and XPVM on ROCKS 5.4.3

And we're back to ROCKS again.

NOTE: I haven't actually tested the binaries and libs compiled here. I think they should work. But I don't know for sure.

PQS works with openmpi, mpich and PVM. Our vanilla ROCKS install already has openmpi and mpich. There's a package called rocks-pvm, but the size is 50 kb and didn't seem to actually install anything precompiled, so I removed it and decided to go the compilation way instead.

The paths here are specific to the cluster I did this on, so customise as needed.

sudo mkdir /share/apps/pvm
sudo chown ${USER} /share/apps/pvm
cd /share/apps/pvm
wget http://www.netlib.org/pvm3/pvm3.4.6.tgz
tar xvf pvm3.4.6.tgz
cd pvm3/
export PVM_ROOT=`pwd`
make

Time to set up environment variables. Either edit /etc/profile or ~/.bashrc, depending on powers and reach., and add
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:share/apps/pvm/pvm3/lib/LINUX64
export PATH=$PATH:/share/apps/pvm/pvm3/bin/LINUX64
export PVM_ROOT=/share/apps/pvm/pvm3

Changes won't take effect until you source the file, or open a new terminal.


I profess to be ignorant about how to actually use pvm, so no testing just yet.


So I also stumbled across xpvm, which sounds (and looks) neat.

wget http://www.netlib.org/pvm3/xpvm/XPVM.src.1.2.5.tgz
cd /share/apps/pvm
tar xvf XPVM.src.1.2.5.tgz
cd xpvm/

Time to do some housekeeping before compiling:
It requires:
1. PVM 3.3.0 or later.
2. TCL 7.3 or later.
3. TK 3.6.1 or later.

I find
/usr/share/tk8.4
/usr/share/tcl8.4
so I might be ok. We just compiled pvm 3.4.6, so it should be alright.

First figure out where stuff is:

locate libtk|grpe so
/usr/lib/libtk.so
/usr/lib/libtk8.4.so
/usr/lib64/libtk.so
/usr/lib64/libtk8.4.so
locate libtcl|grep so
/usr/lib/libtcl.so
/usr/lib/libtcl8.4.so
/usr/lib64/libtcl.so
/usr/lib64/libtcl8.4.so
/usr/lib64/tclx8.4/libtclx8.4.so
These are fairly standard locations, so they should already be searched by ld -- no need to specify them thus.

Include is potentially worse since they'd typically need the development packages.
locate tcl | grep "\.h"
[..]
/usr/include/tcl-private/generic/tcl.h
[..]
locate tk|grep "\.h"
[..]
/usr/include/tk-private/generic/tk.h
[..]
So we /should/ be fine.

We also need the X11 libs and headers:
locate libX11

/usr/lib/libX11.so
/usr/lib/libX11.so.6
/usr/lib/libX11.so.6.2.0
/usr/lib64/libX11.so
/usr/lib64/libX11.so.6
/usr/lib64/libX11.so.6.2.0
locate X11|grep include

[..]
/usr/include/X11
[..]
Finally,

locate libdl
/lib/libdl-2.5.so
/lib/libdl.so.2
/lib64/libdl-2.5.so
/lib64/libdl.so.2
/usr/lib/libdl.a
/usr/lib/libdl.so
/usr/lib64/libdl.a
/usr/lib64/libdl.so
I'll specify the lib locations even though in some of these particular cases it isn't necessary:


Edit xpvm/src/Makefile.aimk and set (line numbers added by me):
19  PVMVERSION = -DUSE_PVM_34
Comment out line 42:
 42 #TCLTKHOME  =  $(HOME)/TCL
and
 44 TCLTKHOME  =   /usr/include
Change

 47 TCLINCL     =   -I$(TCLTKHOME)/tcl-private/generic
 48 TKINCL      =   -I$(TCLTKHOME)/tk-private/generic
and

 57 TCLLIBDIR   =   -L/usr/lib64/tclx8.4
 58 TKLIBDIR    =   -L/usr/lib64
and
 70 TCLLIB      =   -ltcl8.4
 71 TKLIB       =   -ltk8.4
and
83 XINCL       = -L/usr/include/X11
84 XLIBDIR     = -L/usr/lib64
and finally,
 96 SHLIB       = -ldl



Fell asleep? Time to get compiling.
export XPVM_ROOT=/share/apps/pvm/xpvm

export TCL_LIBRARY=
/usr/share/tcl8.4

export TK_LIBRARY=/usr/share/tk8.4

cd ${XPVM_ROOT}
make
[..]
Installing xpvm.tcl
Installing globs.tcl
Installing procs.tcl
Installing util.tcl
make[1]: Leaving directory `/share/apps/pvm/xpvm/src/LINUX64'

The beautiful thing is that the xpvm binary automagically ends up in the pvm3/bin/LINUX64 directory, so no need to fiddle with path.



In theory everything should work now if you log in with ssh -XC. However I get
xpvm
libpvm [pid2607] /tmp/pvmd.502: No such file or directory
libpvm [pid2607]: Can't Start PVM: Can't start pvmd
I'm not actually running -- nor have I ever run -- anything with pvm.

touch /tmp/pvmd.502
xpvm
libpvm [pid4219]: mksocs() read addr file: wrong length read
Connecting to PVMD already running... libpvm [pid4219]: mksocs() read addr file: wrong length read
libpvm [pid4219]: mksocs() read addr file: wrong length read
libpvm [pid4219]: mksocs() read addr file: wrong length read
libpvm [pid4219]: pvm_mytid(): Can't contact local daemon
libpvm [pid4219]: Error Joining PVM: Can't contact local daemon
I mean, it looks like it should work, once pvm is being used.