I don't have any problems writing slurm scripts -- it's similar enough to SGE. But nowhere do I see anyone use -cwd or any equivalent in their slurm scripts. Either that is because you don't have to, or it's just an oversight in all of the examples that I've seen.
Learning by doing has also been an issue -- whenever I submit a test job it takes many, many hours before it's run. That's no way to learn.
Either way, it's time for me to become more familiar with slurm, so I've decided to set it up on a dedicated box.
I look at this post while setting it up: http://paolobertasi.wordpress.com/2011/05/24/how-to-install-slurm-on-debian/
NOTE: I set up a single node. This won't deal with getting nodes to communicate, configuring master and submit nodes, or anything lik that.
NOTE: the package slurm is a completely different program (network monitor). You need slurm-llnl
I also wonder whether the name has got anything to with this Slurm...
Installation
sudo apt-get install slurm-llnlSetting up munge (0.5.10-1) ... Not starting munge (no keys found). Please run /usr/sbin/create-munge-key Setting up slurm-llnl-basic-plugins (2.3.4-2+b1) ... Setting up slurm-llnl (2.3.4-2+b1) ... Not starting slurm-llnl slurm.conf was not found in /etc/slurm-llnl Please follow the instructions in /usr/share/doc/slurm-llnl/README.Debian.gz
Open the local file file:///usr/share/doc/slurm-llnl/slurm-llnl-configurator.html in a web browser and fill out the form. I got the following slurm.conf, which I put in /etc/slurm-llnl/
slurm.conf
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=ecce64bit #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=verahill #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UnkillableStepTimeout= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=300 SlurmdTimeout=300 #UnkillableStepProgram= #UnkillableStepTimeout=60 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #EnablePreemption=no FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=ecce64bit Procs=1 State=UNKNOWN PartitionName=debug Nodes=ecce64bit Default=YES MaxTime=INFINITE State=UP
sudo /usr/sbin/create-munge-key sudo service slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. [ ok ] Starting slurm compute node daemon: slurmd. sudo service munge start [ ok ] Starting MUNGE: munged.
At that point I tried sinfo, squeue etc., none of which returned anything other than a connection error:
squeueSo I rebooted. Which had no effect.The log file /var/log/slurm-llnl/slurmctld.log containsslurm_load_jobs error: Unable to contact slurm controller (connect failure)sinfoslurm_load_partitions: Unable to contact slurm controller (connect failure)
fatal: Incorrect permissions on state save loc: /var/lib/slurm-llnl/slurmctldverahill@ecce64bit:~$ sudo chown verahill /var/lib/slurm-llnl/slurmctld verahill@ecce64bit:~$ sudo service slurm-llnl restart[ ok ] Stopping slurm central management daemon: slurmctld. No /usr/sbin/slurmctld found running; none killed. [ ok ] Stopping slurm compute node daemon: slurmd. No /usr/sbin/slurmd found running; none killed. slurmd dead but pid file exists [ ok ] Starting slurm central management daemon: slurmctld. [ ok ] Starting slurm compute node daemon: slurmd.verahill@ecce64bit:~$ ps aux|grep slurmverahill 3790 0.0 0.2 116164 2292 ? Sl 21:12 0:00 /usr/sbin/slurmctld root 3829 0.0 0.1 95064 1380 ? S 21:12 0:00 /usr/sbin/slurmdverahill@ecce64bit:~$ squeueJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)verahill@ecce64bit:~$ sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle ecce64bit
Testing
verahill@ecce64bit:~$ srun --ntasks=1 --label /bin/hostname && pwd && whoami0: ecce64bit /home/verahill verahill
Time to write a simple queue script:
job.slurm
I submitted it using#!/bin/bash #SBATCH -J pbe_delta # Job name #SBATCH -o pbe_delta.o%j # Name of stdout output file(%j expands to jobId) #SBATCH -e pbe_delta.o%j # Name of stderr output file(%j expands to jobId) #SBATCH -N 1 # Total number of nodes requested (16 cores/node) #SBATCH -n 1 #SBATCH -t 48:00:00 # Run time (hh:mm:ss) date> output.out pwd >> output.out hostname >> output.out ls -lah
sbatch job.slurm
and on running it gives two output files:
output.out
and pbe_delta.o4Fri Mar 14 17:16:10 EST 2014 /home/verahill/slurm/test Ecce64bit
total 16K drwxr-xr-x 2 verahill verahill 4.0K Mar 14 17:16 . drwxr-xr-x 3 verahill verahill 4.0K Mar 14 17:14 .. -rw-r--r-- 1 verahill verahill 491 Mar 14 17:16 job.slurm -rw-r--r-- 1 verahill verahill 59 Mar 14 17:16 output.out -rw-r--r-- 1 verahill verahill 0 Mar 14 17:15 pbe_delta.o3 -rw-r--r-- 1 verahill verahill 0 Mar 14 17:16 pbe_delta.o4