20 March 2014

567. Testing daisychain slurm script

I'm using stampede.TACC for jobs that need significantly longer than 48 hours to run. Luckily, John Fonner at the Texas Advanced Computing Centre has been kind enough to prepare a SLURM script that circumvents that through daisychaining jobs.

Note that you should under no circumstances do this unless you've been specifically allowed to do so by your cluster manager. 

If you get clearance, you submit the script and it will run in the background and resubmit scripts until the job is done.

To get the daisychain script, do

mkdir ~/tmp
cd ~/tmp
git clone https://github.com/johnfonner/daisychain.git

This will pull the latest version of daisychain.slurm. Rename it to e.g. edited.slurm

General editing of the slurm script:

 Replace all instances of


to avoid conflicts when several jobs are running concurrently

To run the script on your own system which you've set up like shown in this post, change




If you're using stampede.TACC, stick to login1.  

3. For gaussian jobs on stampede.TACC  

module load gaussian


if [ "$thisJobNumber" -eq "1" ]; then

Set up your restart job scripts. For example, if the job section of your slurm script looks like this
mkdir $SCRATCH/gaussian_tmp export GAUSS_SCRDIR=$SCRATCH/gaussian_tmp if [ "$thisJobNumber" -eq "1" ]; then #first job echo "Starting First Job:" g09 < freq.g09in > output_$thisJobNumber else #continuation echo "Starting Continuation Job:" g09 < freq_restart.g09in > output_$thisJobNumber fi
with freq.g09in looking like
%nprocshared=16 %rwf=/scratch/0XXXX/XXXX/gaussian_tmp/ajob.rwf %Mem=2000000000 %Chk=/home1/0XXX/XXXX/myjob/ajob.chk #P rpbe1pbe/GEN 5D Freq() SCF=(MaxCycle=256 ) Punch=(MO) Pop=()
with freq.g09in being something along the lines of
%nprocshared=16 %Mem=2000000000 %rwf=/scratch/0XXX/XXXX/gaussian_tmp/ajob.rwf %Chk=/home1/0XXXX/XXXX/myjob/ajob.chk #P restart
(note that the above example is a bit special since it 1) saves the .rwf (which is huge) and 2) is restarting a frequency job. For a simple geoopt job it's enough to restart from the .chk file.

Testing at home
I set up a home system with slurm as shown here: http://verahill.blogspot.com.au/2014/03/565-setting-up-slurm-on-debian-wheezy.html

First edit the daisychain.slurm script as shown above. Note that your slurm script must end with .slurm for the script to recognise it as a slurm script. You can get around this by editing your script and specifying a job script name.

Specifically, change the run time to
#SBATCH -t 00:00:10 # Run time (hh:mm:ss)
comment out the partition name
##SBATCH -p normal
and change the job section to
#-------------------Job Goes Here-------------------------- if [ "$thisJobNumber" -eq "1" ]; then echo "Starting First Job:" sh sleeptest.sh else echo "Starting Continuation Job:" sh sleeptest_2.sh fi #----------------------------------------------------------

Next set up key-based log in for localhost (if you haven't got a keypair, use ssh-keygen:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost

Create two  job files. sleeptest.sh:
echo "first job" date sleep 65 date
echo "second job" date sleep 9 echo "Do nothing"

Submit using
sbatch test.slurm

Make sure to change
#SBATCH -J testx          # Job name
for each job so that you can have several running concurrently.

15 March 2014

566. Briefly: Annoying warnings when plotting using gnuplot and octave on wheezy March 2014.

Note: I'm not going to give a proper fix for this, but rather a work-around -- and one which isn't very good at that.

When using gnuplot or plotting in octave on wheezy I keep getting the following warnings.
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-ukai.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-ukai.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-ukai.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-ukai.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-ukai.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 16: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 28: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 28: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 28: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 28: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/41-arphic-uming.conf", line 28: Having multiple  in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/64-arphic-uming.conf", line 8: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/64-arphic-uming.conf", line 21: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/64-arphic-uming.conf", line 34: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/64-arphic-uming.conf", line 47: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/65-droid-sans-fonts.conf", line 103: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/65-droid-sans-fonts.conf", line 138: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/90-fonts-baekmuk.conf", line 10: Having multiple values in  isn't supported and may not work as expected
Fontconfig warning: "/etc/fonts/conf.d/90-fonts-baekmuk.conf", line 23: Having multiple values in  isn't supported and may not work as expected

My 'solution' was a bit radical -- I had already set up a system with apt-pinning (http://verahill.blogspot.com.au/2014/03/562-pulling-in-glibc-214-from-testing.html) so I figured that pulling in the fonts from testing couldn't hurt, assuming there were no dependencies to worry about.

So I did:
sudo apt-get install -t testing fonts-arphic-uming
and this worked fine.

The old 41-arphic-uming.conf:
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">

  Serif faces
                <family>AR PL ShanHeiSun Uni</family>
                <family>AR PL ShanHeiSun Uni MBE</family>
                <family>AR PL UMing CN</family>
                <family>AR PL UMing HK</family>
                <family>AR PL UMing TW</family>
                <family>AR PL UMing TW MBE</family>
  Monospace faces
                <family>AR PL ShanHeiSun Uni</family>
                <family>AR PL ShanHeiSun Uni MBE</family>
                <family>AR PL UMing CN</family>
                <family>AR PL UMing HK</family>
                <family>AR PL UMing TW</family>
                <family>AR PL UMing TW MBE</family>
The new 41-arphic-uming.conf:
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">

  Serif faces
                <family>AR PL ShanHeiSun Uni</family>
                <family>AR PL ShanHeiSun Uni MBE</family>
                <family>AR PL UMing CN</family>
                <family>AR PL UMing HK</family>
                <family>AR PL UMing TW</family>
                <family>AR PL UMing TW MBE</family>
  Monospace faces
                <family>AR PL ShanHeiSun Uni</family>
                <family>AR PL ShanHeiSun Uni MBE</family>
                <family>AR PL UMing CN</family>
                <family>AR PL UMing HK</family>
                <family>AR PL UMing TW</family>
                <family>AR PL UMing TW MBE</family>
It just remained to pull in the rest of the offending fonts:
sudo apt-get install -t testing fonts-arphic-ukai fonts-droid fonts-baekmuk

14 March 2014

565. Setting up slurm on debian wheezy (very basic)

I have a problem: I've got access to stampede.tacc in Texas which is using slurm as the queue manager. And while I've got SGE figured out (use it on my own cluster, my collaborator's cluster and it's used on the university cluster) I'm having some conceptual issues with SLURM.

I don't have any problems writing slurm scripts -- it's similar enough to SGE. But nowhere do I see anyone use -cwd or any equivalent in their slurm scripts. Either that is because you don't have to, or it's just an oversight in all of the examples that I've seen.

Learning by doing has also been an issue -- whenever I submit a test job it takes many, many hours before it's run. That's no way to learn.

Either way, it's time for me to become more familiar with slurm, so I've decided to set it up on a dedicated box.

I look at this post while setting it up: http://paolobertasi.wordpress.com/2011/05/24/how-to-install-slurm-on-debian/

NOTE: I set up a single node. This won't deal with getting nodes to communicate, configuring master and submit nodes, or anything lik that.

NOTE: the package slurm is a completely different program (network monitor). You need slurm-llnl

I also wonder whether the name has got anything to with this Slurm...


sudo apt-get install slurm-llnl
Setting up munge (0.5.10-1) ... Not starting munge (no keys found). Please run /usr/sbin/create-munge-key Setting up slurm-llnl-basic-plugins (2.3.4-2+b1) ... Setting up slurm-llnl (2.3.4-2+b1) ... Not starting slurm-llnl slurm.conf was not found in /etc/slurm-llnl Please follow the instructions in /usr/share/doc/slurm-llnl/README.Debian.gz

Open the local file file:///usr/share/doc/slurm-llnl/slurm-llnl-configurator.html in a web browser and fill out the form. I got the following slurm.conf, which I put in /etc/slurm-llnl/ 
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=ecce64bit #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=verahill #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UnkillableStepTimeout= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=300 SlurmdTimeout=300 #UnkillableStepProgram= #UnkillableStepTimeout=60 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #EnablePreemption=no FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=ecce64bit Procs=1 State=UNKNOWN PartitionName=debug Nodes=ecce64bit Default=YES MaxTime=INFINITE State=UP

sudo /usr/sbin/create-munge-key
sudo service slurm-llnl start
[ ok ] Starting slurm central management daemon: slurmctld.
[ ok ] Starting slurm compute node daemon: slurmd.
sudo service munge start
[ ok ] Starting MUNGE: munged.

At that point I tried sinfo, squeue etc., none of which returned anything other than a connection error:
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
slurm_load_partitions: Unable to contact slurm controller (connect failure)
So I rebooted. Which had no effect.The log file /var/log/slurm-llnl/slurmctld.log contains
fatal: Incorrect permissions on state save loc: /var/lib/slurm-llnl/slurmctld
verahill@ecce64bit:~$ sudo chown verahill /var/lib/slurm-llnl/slurmctld verahill@ecce64bit:~$ sudo service slurm-llnl restart
[ ok ] Stopping slurm central management daemon: slurmctld. No /usr/sbin/slurmctld found running; none killed. [ ok ] Stopping slurm compute node daemon: slurmd. No /usr/sbin/slurmd found running; none killed. slurmd dead but pid file exists [ ok ] Starting slurm central management daemon: slurmctld. [ ok ] Starting slurm compute node daemon: slurmd.
verahill@ecce64bit:~$ ps aux|grep slurm
verahill 3790 0.0 0.2 116164 2292 ? Sl 21:12 0:00 /usr/sbin/slurmctld root 3829 0.0 0.1 95064 1380 ? S 21:12 0:00 /usr/sbin/slurmd
verahill@ecce64bit:~$ squeue
verahill@ecce64bit:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle ecce64bit

verahill@ecce64bit:~$ srun --ntasks=1  --label /bin/hostname && pwd && whoami
0: ecce64bit /home/verahill verahill

Time to write a simple queue script:
#!/bin/bash #SBATCH -J pbe_delta # Job name #SBATCH -o pbe_delta.o%j # Name of stdout output file(%j expands to jobId) #SBATCH -e pbe_delta.o%j # Name of stderr output file(%j expands to jobId) #SBATCH -N 1 # Total number of nodes requested (16 cores/node) #SBATCH -n 1 #SBATCH -t 48:00:00 # Run time (hh:mm:ss) date> output.out pwd >> output.out hostname >> output.out ls -lah
I submitted it using
sbatch job.slurm

and on running it gives two output files:
Fri Mar 14 17:16:10 EST 2014
and pbe_delta.o4
total 16K
drwxr-xr-x 2 verahill verahill 4.0K Mar 14 17:16 .
drwxr-xr-x 3 verahill verahill 4.0K Mar 14 17:14 ..
-rw-r--r-- 1 verahill verahill  491 Mar 14 17:16 job.slurm
-rw-r--r-- 1 verahill verahill   59 Mar 14 17:16 output.out
-rw-r--r-- 1 verahill verahill    0 Mar 14 17:15 pbe_delta.o3
-rw-r--r-- 1 verahill verahill    0 Mar 14 17:16 pbe_delta.o4