Lindqvist -- a blog about Linux and Science. Mostly.: daisychain

I'm using stampede.TACC for jobs that need significantly longer than 48 hours to run. Luckily, John Fonner at the Texas Advanced Computing Centre has been kind enough to prepare a SLURM script that circumvents that through daisychaining jobs.

Note that you should under no circumstances do this unless you've been specifically allowed to do so by your cluster manager.

If you get clearance, you submit the script and it will run in the background and resubmit scripts until the job is done.

To get the daisychain script, do

mkdir ~/tmp
cd ~/tmp
git clone https://github.com/johnfonner/daisychain.git

This will pull the latest version of daisychain.slurm. Rename it to e.g. edited.slurm

General editing of the slurm script:

1.
Replace all instances of

~/.daisychain

with

~/daisychain_$baseSlurmJobName

to avoid conflicts when several jobs are running concurrently

2.
To run the script on your own system which you've set up like shown in this post, change

loginNode="login1"

loginNode="localhost"

If you're using stampede.TACC, stick to login1.

3. For gaussian jobs on stampede.TACC
A.
put

module load gaussian

before

if [ "$thisJobNumber" -eq "1" ]; then

B
Set up your restart job scripts. For example, if the job section of your slurm script looks like this


mkdir $SCRATCH/gaussian_tmp
export GAUSS_SCRDIR=$SCRATCH/gaussian_tmp

if [ "$thisJobNumber" -eq "1" ]; then
 #first job
 echo "Starting First Job:"
 g09 < freq.g09in > output_$thisJobNumber
else
 #continuation
 echo "Starting Continuation Job:"
 g09 < freq_restart.g09in > output_$thisJobNumber
fi

with freq.g09in looking like


%nprocshared=16
%rwf=/scratch/0XXXX/XXXX/gaussian_tmp/ajob.rwf
%Mem=2000000000
%Chk=/home1/0XXX/XXXX/myjob/ajob.chk
#P rpbe1pbe/GEN 5D Freq() SCF=(MaxCycle=256 )  Punch=(MO) Pop=()

with freq.g09in being something along the lines of


%nprocshared=16
%Mem=2000000000
%rwf=/scratch/0XXX/XXXX/gaussian_tmp/ajob.rwf
%Chk=/home1/0XXXX/XXXX/myjob/ajob.chk
#P restart

(note that the above example is a bit special since it 1) saves the .rwf (which is huge) and 2) is restarting a frequency job. For a simple geoopt job it's enough to restart from the .chk file.

Testing at home
I set up a home system with slurm as shown here: http://verahill.blogspot.com.au/2014/03/565-setting-up-slurm-on-debian-wheezy.html

First edit the daisychain.slurm script as shown above. Note that your slurm script must end with .slurm for the script to recognise it as a slurm script. You can get around this by editing your script and specifying a job script name.

Specifically, change the run time to


#SBATCH -t 00:00:10            # Run time (hh:mm:ss)

comment out the partition name


##SBATCH -p normal

and change the job section to


#-------------------Job Goes Here--------------------------
if [ "$thisJobNumber" -eq "1" ]; then
 echo "Starting First Job:"
 sh sleeptest.sh
else
 echo "Starting Continuation Job:"
 sh sleeptest_2.sh
fi
#----------------------------------------------------------

Next set up key-based log in for localhost (if you haven't got a keypair, use ssh-keygen:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
  exit

Create two job files. sleeptest.sh:


echo "first job"
date
sleep 65 
date

and


echo "second job"
date
sleep 9
echo "Do nothing"

Submit using

sbatch test.slurm

Make sure to change

#SBATCH -J testx          # Job name

for each job so that you can have several running concurrently.

Pages

20 March 2014

567. Testing daisychain slurm script

Contributors

Statcounter