Two issues:
* Sun GridEngine (now Oracle GridEngine) is missing from Debian Jessie. I need a queue manager for my cluster. For now the wheezy package runs fine in debian jessie, but I'd be happier with a supported solution. SLURM is a good alternative here, and I've used it at the
TACC.
* SLURM
conflicts with Jürgen Rinas'
sinfo package, which I use to keep an eye on my cluster. Until this has been resolved, I'll compile and use my own version of sinfo -- basically, I'll rename
sinfo and
sinfod to
sinfo_jr and
sinfod_jr. I can't live without
sinfo.
Compiling sinfo
mkdir ~/tmp/sinfo -p
cd ~/tmp/sinfo
sudo apt-get install build-essential
sudo apt-get autoremove sinfo
apt-get source sinfo
cd sinfo-0.0.47/
vim debian/rules
Change
16 dh_auto_configure -- --enable-SIMPLE_USER_CACHE --enable-CPUNO_ADJUST
21 rm $(CURDIR)/debian/sinfo/usr/bin/sshallsinfo
22 rm $(CURDIR)/debian/sinfo/usr/share/man/man1/sshallsinfo.1
25 rm $(CURDIR)/debian/sinfo/usr/lib/*/sinfo/*.la
37 chmod 755 $(CURDIR)/debian/sinfo/usr/share/sinfo/sinfo.pl.cgi
to
16 dh_auto_configure -- --enable-SIMPLE_USER_CACHE --enable-CPUNO_ADJUST --program-suffix=_jr
21 rm $(CURDIR)/debian/sinfojr/usr/bin/sshallsinfo_jr
22 rm $(CURDIR)/debian/sinfojr/usr/share/man/man1/sshallsinfo_jr.1
25 rm $(CURDIR)/debian/sinfojr/usr/lib/*/sinfo/*.la
37 chmod 755 $(CURDIR)/cgi/sinfo.pl.cgi
That's jr for Jürgen Rinas.
Then edit debian/control
and change
12 Package: sinfo
15 Conflicts: slurm-client, slurm-llnl (<< 14.03.8-1)
to
12 Package: sinfojr
15 Conflicts:
Build:
dpkg-buildpackage -us -uc
cd ../
sudo dpkg -i sinfo_0.0.47-3_amd64.deb
I launch
sinfodjr at boot by putting the following in
/etc/rc.local:
su verahill -c '/usr/sbin/sinfodjr --bcast 192.168.1.255' &
SLURM:
I had a look at this post:
https://paolobertasi.wordpress.com/2011/05/24/how-to-install-slurm-on-debian/
It looked to easy to be true.
Here's what I ended up doing:
On the MASTER node:
sudo apt-get install slurm-wlm slurmctld slurmd
[..]
Generating a pseudo-random key using /dev/urandom completed.
Please refer to /usr/share/doc/munge/README.Debian for instructions to generate more secure key.
Setting up slurm-client (14.03.9-5) ...
Setting up slurm-wlm-basic-plugins (14.03.9-5) ...
Setting up slurmd (14.03.9-5) ...
Setting up slurmctld (14.03.9-5) ...
Setting up slurm-wlm (14.03.9-5) ...
[..]
open
file:///usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=beryllium
ControlAddr=192.168.1.1
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=rupert
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
NodeName=beryllium NodeAddr=192.168.1.1
NodeName=neon NodeAddr=192.168.1.120
PartitionName=All Nodes=beryllium,neon
Copy the above block to /etc/slurm-llnl/slurm.conf
Note the lack of spaces between beryllium and neon in the Nodes= directive.
scontrol show daemons
slurmctld
sudo /usr/sbin/create-munge-key
The munge key /etc/munge/munge.key already exists
Do you want to overwrite it? (y/N) y
Generating a pseudo-random key using /dev/urandom completed.
sudo systemctl enable slurmctld.service
sudo ln -s /var/lib/slurm-llnl /var/lib/slurm
sudo systemctl start slurmctld.service
sudo systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
Active: active (running) since Tue 2015-07-21 11:16:18 AEST; 40s ago
Process: 19958 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 19960 (slurmctld)
CGroup: /system.slice/slurmctld.service
└─19960 /usr/sbin/slurmctld
sudo systemctl status munge.service
● munge.service - MUNGE authentication service
Loaded: loaded (/lib/systemd/system/munge.service; disabled)
Active: active (running) since Wed 2015-07-08 00:11:18 AEST; 1 weeks 6 days ago
Docs: man:munged(8)
Main PID: 25986 (munged)
CGroup: /system.slice/munge.service
└─25986 /usr/sbin/munged
Also, add yourself to the group
slurm and
chmod g+r /var/log/slurm/accounting.
On neon (and later on each node):
Install slurmd and slurm-client as shown below, then copy the
/etc/munge/munge.key from the master node to the execute node. Do the same with
/etc/slurm-llnl/slurm.conf. Then enable and restart the services.
sudo apt-get install slurmd slurm-client
sudo ln -s /var/lib/slurm-llnl /var/lib/slurm
sudo systemctl enable slurmd.service
sudo systemctl restart slurmd.service
sudo systemctl enable munge.service
sudo systemctl restart munge.service
sudo systemctl status slurmd.service
On the main host (beryllium) I checked that everything was well:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
All up infinite 1 idle* beryllium, neon