* Sun GridEngine (now Oracle GridEngine) is missing from Debian Jessie. I need a queue manager for my cluster. For now the wheezy package runs fine in debian jessie, but I'd be happier with a supported solution. SLURM is a good alternative here, and I've used it at the TACC.
* SLURM conflicts with Jürgen Rinas' sinfo package, which I use to keep an eye on my cluster. Until this has been resolved, I'll compile and use my own version of sinfo -- basically, I'll rename sinfo and sinfod to sinfo_jr and sinfod_jr. I can't live without sinfo.
Compiling sinfo
mkdir ~/tmp/sinfo -p cd ~/tmp/sinfo
sudo apt-get install build-essential
sudo apt-get autoremove sinfo
apt-get source sinfo cd sinfo-0.0.47/ vim debian/rules
Change
to16 dh_auto_configure -- --enable-SIMPLE_USER_CACHE --enable-CPUNO_ADJUST 21 rm $(CURDIR)/debian/sinfo/usr/bin/sshallsinfo 22 rm $(CURDIR)/debian/sinfo/usr/share/man/man1/sshallsinfo.1 25 rm $(CURDIR)/debian/sinfo/usr/lib/*/sinfo/*.la 37 chmod 755 $(CURDIR)/debian/sinfo/usr/share/sinfo/sinfo.pl.cgi
That's jr for Jürgen Rinas.16 dh_auto_configure -- --enable-SIMPLE_USER_CACHE --enable-CPUNO_ADJUST --program-suffix=_jr 21 rm $(CURDIR)/debian/sinfojr/usr/bin/sshallsinfo_jr 22 rm $(CURDIR)/debian/sinfojr/usr/share/man/man1/sshallsinfo_jr.1 25 rm $(CURDIR)/debian/sinfojr/usr/lib/*/sinfo/*.la 37 chmod 755 $(CURDIR)/cgi/sinfo.pl.cgi
Then edit debian/control and change
to12 Package: sinfo 15 Conflicts: slurm-client, slurm-llnl (<< 14.03.8-1)
12 Package: sinfojr
15 Conflicts:
Build:
dpkg-buildpackage -us -uc cd ../ sudo dpkg -i sinfo_0.0.47-3_amd64.deb
I launch sinfodjr at boot by putting the following in /etc/rc.local:
su verahill -c '/usr/sbin/sinfodjr --bcast 192.168.1.255' &
SLURM:
I had a look at this post: https://paolobertasi.wordpress.com/2011/05/24/how-to-install-slurm-on-debian/
It looked to easy to be true.
Here's what I ended up doing:
On the MASTER node:
sudo apt-get install slurm-wlm slurmctld slurmdopen file:///usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html[..] Generating a pseudo-random key using /dev/urandom completed. Please refer to /usr/share/doc/munge/README.Debian for instructions to generate more secure key. Setting up slurm-client (14.03.9-5) ... Setting up slurm-wlm-basic-plugins (14.03.9-5) ... Setting up slurmd (14.03.9-5) ... Setting up slurmctld (14.03.9-5) ... Setting up slurm-wlm (14.03.9-5) ... [..]
Copy the above block to /etc/slurm-llnl/slurm.conf# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=beryllium ControlAddr=192.168.1.1 # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=rupert #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES NodeName=beryllium NodeAddr=192.168.1.1 NodeName=neon NodeAddr=192.168.1.120 PartitionName=All Nodes=beryllium,neon
Note the lack of spaces between beryllium and neon in the Nodes= directive.
scontrol show daemonsslurmctldsudo /usr/sbin/create-munge-keyThe munge key /etc/munge/munge.key already exists Do you want to overwrite it? (y/N) y Generating a pseudo-random key using /dev/urandom completed.
sudo systemctl enable slurmctld.service sudo ln -s /var/lib/slurm-llnl /var/lib/slurm sudo systemctl start slurmctld.service sudo systemctl status slurmctld.serviceAlso, add yourself to the group slurm and chmod g+r /var/log/slurm/accounting.● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled) Active: active (running) since Tue 2015-07-21 11:16:18 AEST; 40s ago Process: 19958 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 19960 (slurmctld) CGroup: /system.slice/slurmctld.service └─19960 /usr/sbin/slurmctldsudo systemctl status munge.service● munge.service - MUNGE authentication service Loaded: loaded (/lib/systemd/system/munge.service; disabled) Active: active (running) since Wed 2015-07-08 00:11:18 AEST; 1 weeks 6 days ago Docs: man:munged(8) Main PID: 25986 (munged) CGroup: /system.slice/munge.service └─25986 /usr/sbin/munged
On neon (and later on each node):
Install slurmd and slurm-client as shown below, then copy the /etc/munge/munge.key from the master node to the execute node. Do the same with /etc/slurm-llnl/slurm.conf. Then enable and restart the services.
sudo apt-get install slurmd slurm-client sudo ln -s /var/lib/slurm-llnl /var/lib/slurm sudo systemctl enable slurmd.service sudo systemctl restart slurmd.service sudo systemctl enable munge.service sudo systemctl restart munge.service sudo systemctl status slurmd.service
On the main host (beryllium) I checked that everything was well:
sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST All up infinite 1 idle* beryllium, neon