Lindqvist -- a blog about Linux and Science. Mostly.

14 March 2014

565. Setting up slurm on debian wheezy (very basic)

I have a problem: I've got access to stampede.tacc in Texas which is using slurm as the queue manager. And while I've got SGE figured out (use it on my own cluster, my collaborator's cluster and it's used on the university cluster) I'm having some conceptual issues with SLURM.

I don't have any problems writing slurm scripts -- it's similar enough to SGE. But nowhere do I see anyone use -cwd or any equivalent in their slurm scripts. Either that is because you don't have to, or it's just an oversight in all of the examples that I've seen.

Learning by doing has also been an issue -- whenever I submit a test job it takes many, many hours before it's run. That's no way to learn.

Either way, it's time for me to become more familiar with slurm, so I've decided to set it up on a dedicated box.

I look at this post while setting it up: http://paolobertasi.wordpress.com/2011/05/24/how-to-install-slurm-on-debian/

NOTE: I set up a single node. This won't deal with getting nodes to communicate, configuring master and submit nodes, or anything lik that.

NOTE: the package slurm is a completely different program (network monitor). You need slurm-llnl

I also wonder whether the name has got anything to with this Slurm...

Installation

sudo apt-get install slurm-llnl

Setting up munge (0.5.10-1) ...
Not starting munge (no keys found). Please run /usr/sbin/create-munge-key
Setting up slurm-llnl-basic-plugins (2.3.4-2+b1) ...
Setting up slurm-llnl (2.3.4-2+b1) ...
Not starting slurm-llnl
slurm.conf was not found in /etc/slurm-llnl
Please follow the instructions in /usr/share/doc/slurm-llnl/README.Debian.gz

Open the local file file:///usr/share/doc/slurm-llnl/slurm-llnl-configurator.html in a web browser and fill out the form. I got the following slurm.conf, which I put in /etc/slurm-llnl/
slurm.conf


# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=ecce64bit
#ControlAddr=
#BackupController=
#BackupAddr=
# 
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog=
#PrologSlurmctld= 
#FirstJobId=1 
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0 
#JobRequeue=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxTasksPerNode=128 
MpiDefault=none
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=verahill
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree 
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UnkillableStepTimeout= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0
KillWait=30
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300
#OverTimeLimit=0 
SlurmctldTimeout=300
SlurmdTimeout=300
#UnkillableStepProgram= 
#UnkillableStepTimeout=60 
Waittime=0
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
#EnablePreemption=no 
FastSchedule=1
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
# 
# 
# JOB PRIORITY 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
ClusterName=cluster
#DebugFlags= 
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=ecce64bit Procs=1 State=UNKNOWN 
PartitionName=debug Nodes=ecce64bit Default=YES MaxTime=INFINITE State=UP

sudo /usr/sbin/create-munge-key
sudo service slurm-llnl start
[ ok ] Starting slurm central management daemon: slurmctld.
[ ok ] Starting slurm compute node daemon: slurmd.
sudo service munge start
[ ok ] Starting MUNGE: munged.

At that point I tried sinfo, squeue etc., none of which returned anything other than a connection error:

squeue

slurm_load_jobs error: Unable to contact slurm controller (connect failure)

sinfo

slurm_load_partitions: Unable to contact slurm controller (connect failure)

So I rebooted. Which had no effect.The log file /var/log/slurm-llnl/slurmctld.log contains


fatal: Incorrect permissions on state save loc: /var/lib/slurm-llnl/slurmctld

verahill@ecce64bit:~$ sudo chown verahill /var/lib/slurm-llnl/slurmctld
verahill@ecce64bit:~$ sudo service slurm-llnl restart
[ ok ] Stopping slurm central management daemon: slurmctld.
No /usr/sbin/slurmctld found running; none killed.
[ ok ] Stopping slurm compute node daemon: slurmd.
No /usr/sbin/slurmd found running; none killed.
slurmd dead but pid file exists
[ ok ] Starting slurm central management daemon: slurmctld.
[ ok ] Starting slurm compute node daemon: slurmd.
verahill@ecce64bit:~$ ps aux|grep slurm

verahill  3790  0.0  0.2 116164  2292 ?        Sl   21:12   0:00 /usr/sbin/slurmctld
root      3829  0.0  0.1  95064  1380 ?        S    21:12   0:00 /usr/sbin/slurmd

verahill@ecce64bit:~$ squeue

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
verahill@ecce64bit:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle ecce64bit

Testing

verahill@ecce64bit:~$ srun --ntasks=1  --label /bin/hostname && pwd && whoami

0: ecce64bit
/home/verahill
verahill

Time to write a simple queue script:
job.slurm


#!/bin/bash
#SBATCH -J pbe_delta       # Job name
#SBATCH -o pbe_delta.o%j   # Name of stdout output file(%j expands to jobId)
#SBATCH -e pbe_delta.o%j   # Name of stderr output file(%j expands to jobId)
#SBATCH -N 1                # Total number of nodes requested (16 cores/node)
#SBATCH -n 1
#SBATCH -t 48:00:00         # Run time (hh:mm:ss) 

date> output.out
pwd >> output.out
hostname >> output.out
ls -lah

I submitted it using

sbatch job.slurm

and on running it gives two output files:
output.out

Fri Mar 14 17:16:10 EST 2014
/home/verahill/slurm/test
Ecce64bit

and pbe_delta.o4

total 16K
drwxr-xr-x 2 verahill verahill 4.0K Mar 14 17:16 .
drwxr-xr-x 3 verahill verahill 4.0K Mar 14 17:14 ..
-rw-r--r-- 1 verahill verahill  491 Mar 14 17:16 job.slurm
-rw-r--r-- 1 verahill verahill   59 Mar 14 17:16 output.out
-rw-r--r-- 1 verahill verahill    0 Mar 14 17:15 pbe_delta.o3
-rw-r--r-- 1 verahill verahill    0 Mar 14 17:16 pbe_delta.o4

564.Very, very briefly: ps2pdf not working on jessie 12 March 2014

Update: see second comment below post.

ps2pdf example.ps produces

Error: /typecheck in /findfont
Operand stack:
   -66.7   --nostringval--   0   --nostringval--   --nostringval--   {}
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1910   1   3   %oparray_pop   1909   1   3   %oparray_pop   1893   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %array_continue   --nostringval--   1868   6   6   %oparray_pop
Dictionary stack:
   --dict:1164/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--   --dict:196/300(L)--   --dict:44/200(L)--   --dict:194/256(L)--
Current allocation mode is local
Current file position is 393007
GPL Ghostscript 9.05: Unrecoverable error, exit code 1

The workaround is to remove the package fonts-font-awesome, which will remove texlive-fonts-extra. As far as I understand a fix is included upstreams and will work its way down to Jessie eventually.

11 March 2014

563. High disk i/o caused by find/sort <- updatedb

High disk I/O, leading to system slowdown, has been bothering me a lot recently. Most of the time I've simply blamed it on ECCE, and while the situation gets better when ECCE isn't running, it's still occasionally very bad.

Diagnosis

iotop shows

Total DISK READ:       3.48 M/s | Total DISK WRITE:    1193.67 K/s
TID PRIO USER     DISK READ DISK WRITE SWAPIN     IO>    COMMAND
25565 be/4 root        3.46 M/s    0.00 B/s 0.00 % 76.92 % find / ( -fstype nfs -o -fstype NFS -o -fstype proc -o -fstype afs -o -fstype smbfs -o -~$\)\|$^/var/tmp$$\|$^/afs$$\|$^/amd$$\|$^/sfs$$\|$^/proc$$ ) -prune -o -print0

ps aux|grep 2556[0-9]
root     25562 0.0 0.0 18620   336 ?        S    12:33   0:00 /bin/sh /usr/bin/updatedb
root     25563 26.2 0.1 25996 12400 ?        S    12:33   1:51 /usr/bin/sort -z -f
root     25564 0.0 0.0   4216   116 ?        S    12:33   0:00 /usr/lib/locate/frcode -0
root     25565 24.2 0.0 19024   956 ?        R    12:33   1:09 /usr/bin/find / ( -fstype nfs -o -fstype NFS -o -fstype proc -o -fstype afs -o -fstype smbfs -o -fstype autofs -o -fstype iso9660 -o -fstype ncpfs -o -fstype coda -o -fstype devpts -o -fstype ftpfs -o -fstype devfs -o -fstype mfs -o -fstype sysfs -o -fstype shfs -o -type d -regex $^/tmp$$\|$^/usr/tmp$$\|$^/var/tmp$$\|$^/afs$$\|$^/amd$$\|$^/sfs$$\|$^/proc$$ ) -prune -o -print0

Heading deeper down the rabbit hole:

me@beryllium:~$ ps -p 25565 -o ppid=
25562
me@beryllium:~$ ps -p 25562 -o ppid=
25554
me@beryllium:~$ ps -p 25554 -o ppid=
25553
me@beryllium:~$ ps -p 25553 -o ppid=
25552
me@beryllium:~$ ps -p 25552 -o ppid=
 4315
me@beryllium:~$ ps -p 4315 -o ppid=
    1
me@beryllium:~$ ps aux|grep 4315
root      4315  0.0  0.0  26124   428 ?        Ss   Mar07   0:05 /usr/sbin/cron
me@beryllium:~$ ps aux|grep 25552
root     25552  0.0  0.0  64068   844 ?        S    12:33   0:00 /USR/SBIN/CRON
me@beryllium:~$ ps aux|grep 25554
root     25554  0.0  0.0  18620   588 ?        S    12:33   0:00 /bin/sh /usr/bin/updatedb

So, updatedb is starting 25565, which is bogging down the computer. updatedb is starting 25565, and updatedb is started as a cron job. updatedb is run in order to update the locate database, and locate is a powerful file search function -- whereas find searches on the fly, locate consults a database.

At this point its probably a good idea to mention that I have a 4 Tb system, plus four mounted NFS folders with many Gb of content.

Either way, the only thing that remains is to identify which cron job is launching updatedb:

me@beryllium:~$ egrep "updatedb" /etc/cron.*/*
/etc/cron.daily/locate:# Please consult updatedb(1) and /usr/share/doc/locate/README.Debian
/etc/cron.daily/locate:[ -e /usr/bin/updatedb.findutils ] || exit 0
/etc/cron.daily/locate:# filesystems which are pruned from updatedb database
/etc/cron.daily/locate:# paths which are pruned from updatedb database
/etc/cron.daily/locate:if [ -r /etc/updatedb.findutils.cron.local ] ; then
/etc/cron.daily/locate: . /etc/updatedb.findutils.cron.local
/etc/cron.daily/locate:  cd / && nice -n ${NICE:-10} updatedb.findutils 2>/dev/null

Solution:
locate is a powerful command which I use frequently, but I'd be happy to change the frequency of updatedb to once per week instead of once per day, especially if running it takes hours.

sudo mv /etc/cron.daily/locate /etc/cron.weekly/locate

We can also work on excluding paths.

me@beryllium:~$ cat /etc/cron.weekly/locate |grep PRUNE
PRUNEFS="NFS nfs nfs4 afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre_lite tmpfs usbfs udf ocfs2"
PRUNEPATHS="/tmp /usr/tmp /var/tmp /afs /amd /alex /var/spool /sfs /media /var/lib/schroot/mount"
export FINDOPTIONS PRUNEFS PRUNEPATHS NETPATHS LOCALUSER

So my NFS folders are already excluded through PRUNEFS, but it might be worth throwing more paths into PRUNEPATHS. In my case I'm quite happy with a full run every week.

Update: I also discovered that I'd put an updatedb job manually in /etc/crontab which was run once every three hours. The cron.daily script was run at 6 am, and so was unlikely to cause slowdown during times when I'm actually at work. Instead it was the script I'd set up myself that was the culprit.

Pages

14 March 2014

565. Setting up slurm on debian wheezy (very basic)

564.Very, very briefly: ps2pdf not working on jessie 12 March 2014

11 March 2014

563. High disk i/o caused by find/sort <- updatedb

Contributors

Statcounter