Showing posts with label frontendmachine. Show all posts
Showing posts with label frontendmachine. Show all posts

01 June 2012

169. ECCE, node hopping and cluster nodes without direct access to WAN

NOTE: I'm still working on this. But this sort of works for now. More details coming soon.
Update:  I've posted better solutions more recently for use with SGE. See aspects of e.g. http://verahill.blogspot.com.au/2012/06/ecce-in-virtual-machine-step-by-step.html and http://verahill.blogspot.com.au/2012/06/ecce-and-inaccessible-cluster-nodes.html

It's not easy choosing a good title for this post which describes the purpose of it succinctly yet clearly (sort of like this sentence), so here's what we're dealing with:

The Problem:
You can access the submit node from off-site. You can't access the subnodes directly from off-site. This post shows you how you can submit to each subnode directly. A better technical solution is obviously to use qsub on the main node. Having said that, with very little modification this method can also be adapted to the situation described here: http://verahill.blogspot.com.au/2012/05/port-redirection-with-eccenwchem.html

A low-tech, home-built example is the following:
-eth0 - Main  node - eth1 - (subnodes: node0, node1, node2, node3)/

eth0 is WAN, and eth1 is LAN. You can ssh to the main node from the 'internet'. There's no queue manager like SGE installed -- submission of jobs is done by logging onto each node and executing the commands there.

The example:
Main node:
    my.cluster.edu
Nodes:
   compute-0-0
   compute-0-1
   compute-0-2
   compute-0-3


The most obvious solution:
do port redirection. The downside is that it requires some technical skills of the users, and anything with networking and ssh is a PITA if they insist on using windows.


The smarter solution:
I was informed of this solution here.

ROCKS-specific stuff
If your cluster is running ROCKS 5.4.3 and you're having issues opening csh shells, just move /etc/csh.cshrc out of the way as a crude fix.
sudo mv /etc/csh.cshrc ~/
Don't forget to do this on all the nodes if they have local /etc folders! And it's not that easy -- the passwords aren't the same on the nodes as on the main node.
So, on the main node:
rocks set host sec_attr compute-0-1 attr=root_pw
sudo su
# cat /etc/hosts
#ssh 10.1.255.253
#mv /etc/csh.cshrc .
#exit

As the user you'll be running as, edit/create your ~/.cshrc

setenv GAUSS_SCRDIR /tmp
setenv ECCE_HOME /share/apps/ecce/apps
set PATH= (/share/apps/nwchem/nwchem-6.1/bin/LINUX64 $PATH)
setenv LD_LIBRARY_PATH /opt/openmpi/lib:/share/apps/openblas/lib
Repeat on all nodes.

SIDE NOTE: What I'm ultimately looking to achieve on the ROCKS cluster is front-node managed SGE submissions. Easy, you say? Well, ECCE submits a
setenv PATH /opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}
and for some reason the ROCKS/CentOS (t)csh can't handle (most of the time) adding ${PATH} to itself since it's too long unless "s are added (it seems) i.e. this works consistently:
setenv PATH "/opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}"
On my debian system either works fine (bsd-csh).

ANYWAY -- continue

Ecce:
Start setting it up using
ecce -admin



After that, do some editing in ecce-6.3/apps/siteconfig/

CONFIG.voldemort:
xappsPath: /usr/X11R6/bin
NWChem: /share/apps/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /share/apps/gaussian/g09/g09
sourceFile:  /etc/csh.login
frontendMachine: my.cluster.edu

Machines:
compute-0-0 compute-0-0.local Dell HPCS3 Phase 1 Cluster AMD 2.6 GHz Opteron 40:5 ssh :NWChem:Gaussian-03MN:RD:SD:UN:PW