In the example I've used krypton as the node name, and 192.168.1.180 as the IP.
My front node is called beryllium and has an IP of 192.168.1.1.
0. On the front node
Add the new node name to the front node/queue master
Add execution host
qconf -ae
which opens a text file in vim
Edited hostname (krypton) but nothing else. Saving returns
Add krypton as a submit hostadded host krypton to exec host list
qconf -as kryptonDoing this before touching the node makes life a little bit easier.krypton added to submit host list
1. Edit /etc/hosts on the node
Leave
but remove127.0.0.1 localhost
and make sure that it says127.0.1.1 krypton
instead.192.168.1.180 krypton
Throw in
as well.192.168.1.1 beryllium
2. Install SGE on node
sudo apt-get install gridengine-exec gridengine-client
You'll be asked about
3. Add node to queue and groupConfigure automatically: yes Cell name: rupert Master hostname: beryllium
I maintain separate queues and groups depending on how many cores each node has. See e.g. http://verahill.blogspot.com.au/2012/06/setting-up-sun-grid-engine-with-three.html for how to create queues and groups.
If they already exits, just do
qconf -aattr hostgroup hostlist krypton @fourcores qconf -aattr queue slots "[krypton=4]" fourcores.q
to add the new node.
4. Add pe to queue if necessary
Since I have different queues depending on the number of cores of a node, I tend to have to fiddle with this.
See e.g. http://verahill.blogspot.com.au/2012/06/setting-up-sun-grid-engine-with-three.html for how to create pe:s.
If the pe you need is already created, you can do
qconf -mq fourcores.q
and edit pe_list
5. Check
On the front node, do
qhostHOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - beryllium lx26-amd64 3 0.16 7.8G 5.3G 14.9G 398.2M boron lx26-amd64 6 6.02 7.6G 1.6G 14.9G 0.0 helium lx26-amd64 2 - 2.0G - 1.9G - lithium lx26-amd64 3 - 3.9G - 0.0 - neon lx26-amd64 8 8.01 31.4G 1.3G 59.6G 0.0 krypton lx26-amd64 4 4.01 15.6G 2.8G 14.9G 0.0
Thank you for this tutorial. but I've a problem after I followed the above steps. when I submit jobs, they only executes on the 1st node and the rest stay pending while the 2nd host (which I recently added) remains idle. it appears when I use qhost but doesn't execute jobs. what do you think is causing the problem? thanks in advance.
ReplyDeleteNot sure what you mean by "it appears when I use qhost but doesn't execute jobs."
DeleteEither way,
1. what code are you running?
2. have you installed it on each node?
3. are you trying to run jobs across nodes, or each job on a single node?
What does qstat -j jobnumber show for the idle job(s)?