03 June 2012

172. ECCE and a ROCKS cluster: step by step

This is quite similar to a recent post, but here's a step-by-step, detailed account of how to set up ECCE for remote job submission to a ROCKS 5.4.3 cluster (one front node, 4 subnodes)

Coming soon (give it a week): Setting up a virtualbox machine with ecce for (stubborn) windows and ROCKS/CentOS users.

What isn't shown are all the failed attempts and dead-ends I went through and encountered getting to the point where I had a working system. I compiled ECCE. I compiled tcsh. I tried compiling bsd csh, which required me to compile bmake etc. This stuff looks simple, and it is simple -- but not obvious.

NOTE: From the outside we connect to rocks.university.edu. From inside the cluster the submit node is called rocks.local, and the subnodes are called node0, node1, etc. Refer to this naming if you get confused late.

Step 1. Create the site in ecce
From the terminal, do
ecce -admin
and add a new machine

Don't forget to hit Add/Change queue to make the changes to the queue part take effect. Then hit Add/Change. Oh, and pay attention to the Allocation Account tick box - if it's ticked you can't submit anything unless you add an account.  Important: the machine name you add here is the local name or local IP of the submit node -- it's not the 'public' name or url. We'll add that somewhere else later. Don't forget to select the queue manager (I forgot in the screen shot)

Close.

2. Editing your CONFIG file
Since you're already in the terminal, go to ecce-v6.3/apps/siteconfig

Take a quick peek at your Machines file (no editing):
Machines line
rocks rocks.local Dell beo Intel 40:5 ssh :NWChem:Gaussian-03 MN:RD:SD:UN:PW:Q:TL

Take another look at rocks.Q -- there's probably nothing to edit here either:

rocks.Q# Queue details for rocks
Queues:    nwchem
nwchem|minProcessors:       1
nwchem|maxProcessors:       40
nwchem|runLimit:       100000
nwchem|memLimit:       0
nwchem|scratchLimit:       0
Finally, do some editing of your CONFIG.rocks file.

CONFIG.rocks

NWChem: /share/apps/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /share/apps/gaussian/g09/g09
perlPath: /usr/bin/
qmgrPath: /opt/gridengine/bin/lx26-amd64
sourcefile: /home/rocksuser/.cshrc
frontendMachine: rocks.university.edu

SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$walltime
#$ -l h_vmem=$memoryG
#$ -j y
#$ -pe orte $totalprocs  
}

NWChemEnvironment{
            LD_LIBRARY_PATH /usr/lib/openmpi/1.3.2-gcc/lib/
}

NWChemCommand {
        /opt/openmpi/bin/mpirun -n $totalprocs $nwchem $infile > $outfile
}
Gaussian-03Command {
    setenv GAUSS_SCRDIR /tmp
    setenv GAUSS_EXEDIR /share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09
        time /share/apps/gaussian/g09/g09 $infile  $outfile }

Obviously, your variables will be different. NOTE that memory is in gigabyte here. You could also do $memoryM for megabyte. Just adjust your launcher requirements accordingly.

Step 3. Making csh modifications on the ROCKS cluster
On the main node just use the root password (or become sudo) and move /etc/csh.cshrc and /etc/csh.login out of the way (backing them up is a good idea). It doesn't seem like you need to make any changes csh-wise to the subnodes.

Step 4. Finalising our set up
Start ecce the normal way (e.g. run ecce from the terminal)
In the Gateway, start the Machine Browser, highlight 'rocks' and click on Setup Remote Access.
Do what you're told.

Step 5. Submit to your heart's content!

NOTE: the option to set the amount of memory is not shown in the launcher window above: my mistake. You can edit you apps/siteconfig/Machines file and add :MM at the end of the line, e.g.
Dynamic beryllium       Unspecified     Unspecified     Unspecified     18:3    ssh     :NWChem:Gaussian-03     MN:RD:SD:UN:PW:Q:TL:MM

171. Building ECCE on ROCKS/CentOS


I installed ECCE on a couple of a single workstation with ROCKS, and remotely on a 40 core cluster with ROCKS. The local, workstation install worked fine. I never really bothered much about the cluster install, and only recently looked closer at it. Well, I can launch the 'gateway' but nothing else -- when I click on e.g. the organizer button I get the rocks version of an hourglass that never goes away -- and I don't get any error messages. Turning on logging doesn't yield anything either. 

Ergo, I figured that building it myself  may yield a different result. It didn't on the ROCKS cluster, but everything worked just fine on the single-node ROCKS training box I keep in my office.


CentOS is a bit dated, so you'll need to build your own apr and apr-util. Build apr:
cd /share/apps/utils/
wget http://mirror.mel.bkb.net.au/pub/apache//apr/apr-1.4.6.tar.gz
wget http://mirror.mel.bkb.net.au/pub/apache//apr/apr-util-1.4.1.tar.gz
tar xvf apr-1.4.6.tar.gz
cd apr-1.4.6/
./configure --prefix=/share/apps/utils/apr
make
make install
cd ../
tar xvf apr-util-1.4.1.tar.gz
cd apr-util-1.4.1/
./configure --prefix=/share/apps/utils/apr-util --with-apr=/share/apps/utils/apr/


Time for ecce.
First download 
cd /share/apps/ecce/
tar xvf ecce-v6.3-src.tar.bz2
cd ecce-v6.3/
export ECCE_HOME=/share/apps/ecce/ecce-v6.3
cd build/

Edit build_ecce
889       ./configure --prefix=$ECCE_HOME/${ECCE_SYSDIR}3rdparty/httpd --enable-rewrite --enable-dav --enable-ss-compression
to
889       ./configure --prefix=$ECCE_HOME/${ECCE_SYSDIR}3rdparty/httpd --enable-rewrite --enable-dav --enable-ss-compression --with-apr=/share/apps/utils/apr/bin/apr-1-config --with-apr-util=/share/apps/utils/apr-util/bin/apu-1-config

./build_ecce
Just follow the instructions i.e. hit return, over and over again. Answer no to running tests again. Then run build_ecce again:
./build_ecce
Now stuff should be building. Do this another six times. From the README:
"At this stage the script will build one 3rd party package per invocation,
exiting after each package is built.  In order the 3rd party packages that
will be built are:
1. Apache Xerces XML parser
2. Mesa OpenGL
3. wxWidgets C++ GUI toolkit
4. wxPython GUI toolkit
5. Apache HTTP web server"
The httpd build ends with a minor error about "lib" missing. It's fine.

The sixth time ECCE itself is built, and that's the step that takes by far the longest. It finishes with:
 ECCE built and distribution created in /share/apps/ecce/ecce-v6.3
On a single-node desktop I got it to run a seventh time it seemed. The last step finished with the message above though.

Go to your /share/apps/ecce/ecce-v6.3/ dir where you'll find install_ecce.v6.3.csh
Do the install
csh -f install_ecce.v6.3.csh
Follow the instructions.

You may also want to
sudo mv /etc/csh.* ~/
to get rid of the crappy csh config files.

Edit your ~/.bashrc:

alias startecceserver='csh -f /share/apps/ecce/ecce-v6.3/server/ecce-admin/start_ecce_server'
alias stopecceserver='csh -f /share/apps/ecce/ecce-v6.3/server/ecce-admin/stop_ecce_server
export ECCE_HOME=/share/apps/ecce/ecce-v6.3/apps
export PATH=$PATH:${ECCE_HOME}/scripts

and your ~/.cshrc:

setenv ECCE_HOME /share/apps/ecce/ecce-v6.3/apps
set PATH= (/share/apps/nwchem/nwchem-6.1/bin/LINUX64 $PATH)

On my single-node box I had to edit the apps/siteconfig/DataServers and replace eccetera.emsl.pnl.gov with localhost (two instances), as well as the apps/siteconfig/jndi.properties file (one instance).

In spite of the hassle on the single node box, everything works there -- the builder, organizer etc. all open just fine. The rocks cluster, looks fine, but doesn't work.

The ROCKS Cluster:
Everything seems to work fine -- starting ecce launches the gateway, but clicking on anything sees the centos version of the hourglass churn over and over for all eternity. Nothing happens.

I looked through these two threads, and i also tried the pre-built 32 bit binary. All without luck.

I've also tried editing the site_runtime file:
ECCE_MESA_OPENGL true
ECCE_MESA_EXCEPT x86_64:RedHat:Fedora:CentOS
(matches the lsb_release -is output)




02 June 2012

170. tcsh in ROCKS/CentOS with hardcoded csh.cshrc path

WHAT THIS POST DOES: It shows you how to compile your own tcsh which won't be looking at /etc/csh.cshrc. It doesn't show you how to set up the correct .cshrc files. But it certainly allows you to experiment.

Also, keep in mind that since each local node hdd has it's own /bin directory (not exported) you need to make similar changes on each node (i.e. change the /bin/csh symlink -- see below)

(The) csh (startup files) is(are) horribly broken on ROCKS 5.4.3.

For now I've solved it by just moving  /etc/csh.cshrc out of the way, but what we do here is symlink /bin/csh to our own tcsh which been hardcoded to use a non-standard configuration file, so that you can use the standard ROCKS tcsh with /etc/csh.cshrc and your own csh(tcsh) with your own config files.

To be clear: it's not the csh binary which is borked on ROCKS 5.4.3, but the configuration files.

There's a patch for the broken csh -- but when I applied it to a test computer it only got broken-er and prevented the csh from opening and staying open. Good way of getting locked out. So I'm not keen on doing the same thing on someone else's production cluster. Also, I've opted for tcsh since the csh sources come with a bsd style makefile, and I really can't deal with that right now.

What we'll do is hardcode the location of the csh.cshrc file and change it from /etc/csh.cshrc to /share/apps/utils/tcsh

sudo mkdir /share/apps/utils
sudo chown ${USER} /share/apps/utils
cd /share/apps/utils
 wget http://ftp.de.debian.org/debian/pool/main/t/tcsh/tcsh_6.18.01.orig.tar.gz
tar xvf tcsh_6.18.01.orig.tar.gz
 cd tcsh-6.18.01/

Time for find out what to change:
tail -n 9999 *|strings|egrep "/etc/csh.cshrc|<=="


tells us we need to have a look at pathnames.h
Change

124 # define _PATH_DOTCSHRC     "/etc/csh.cshrc"
to
124 # define _PATH_DOTCSHRC     "/share/apps/utils/custom.tcshrc"

./configure --prefix=/share/apps/utils/tcsh
make
make install

If all went well:
cat tcsh|strings|grep custom.tcsh
/share/apps/utils/custom.tcshrc
and
tree /share/apps/utils/tcsh -L 1
/share/apps/utils/tcsh
|-- bin
`-- share

Obviously, this doesn't really make much of a difference just yet. Now comes the scary part -- and you need root access for this:
 which csh
/bin/csh

ls /bin/csh -lah
lrwxrwxrwx 1 root root 4 Feb 23 16:54 /bin/csh -> tcsh
and here's the 'dangerous' stuff:
sudo rm /bin/csh
sudo ln -s /share/apps/utils/bin/tcsh /bin/csh
sudo chown root:root /bin/csh
sudo chmod 777 /bin/csh

Since /bin/csh isn't a binary but a symmlink to tcsh in the /bin directory, we just delete the symlink and create a new one.

We can now make whatever changes we want to our custom.tcshrc while still being able to easily change back to the old setup. I do recognise that we could just have edited /etc/csh.cshrc and /login.cshrc, but I for some reason feel a lot more comfortable using this method.