20 March 2014

567. Testing daisychain slurm script

I'm using stampede.TACC for jobs that need significantly longer than 48 hours to run. Luckily, John Fonner at the Texas Advanced Computing Centre has been kind enough to prepare a SLURM script that circumvents that through daisychaining jobs.

Note that you should under no circumstances do this unless you've been specifically allowed to do so by your cluster manager. 

If you get clearance, you submit the script and it will run in the background and resubmit scripts until the job is done.


To get the daisychain script, do

mkdir ~/tmp
cd ~/tmp
git clone https://github.com/johnfonner/daisychain.git

This will pull the latest version of daisychain.slurm. Rename it to e.g. edited.slurm

General editing of the slurm script:

1.
 Replace all instances of
 
~/.daisychain

with
 
~/daisychain_$baseSlurmJobName

to avoid conflicts when several jobs are running concurrently

2.
To run the script on your own system which you've set up like shown in this post, change

loginNode="login1"

to

loginNode="localhost"

If you're using stampede.TACC, stick to login1.  

3. For gaussian jobs on stampede.TACC  
A.
put

module load gaussian

before

if [ "$thisJobNumber" -eq "1" ]; then

B
Set up your restart job scripts. For example, if the job section of your slurm script looks like this
mkdir $SCRATCH/gaussian_tmp export GAUSS_SCRDIR=$SCRATCH/gaussian_tmp if [ "$thisJobNumber" -eq "1" ]; then #first job echo "Starting First Job:" g09 < freq.g09in > output_$thisJobNumber else #continuation echo "Starting Continuation Job:" g09 < freq_restart.g09in > output_$thisJobNumber fi
with freq.g09in looking like
%nprocshared=16 %rwf=/scratch/0XXXX/XXXX/gaussian_tmp/ajob.rwf %Mem=2000000000 %Chk=/home1/0XXX/XXXX/myjob/ajob.chk #P rpbe1pbe/GEN 5D Freq() SCF=(MaxCycle=256 ) Punch=(MO) Pop=()
with freq.g09in being something along the lines of
%nprocshared=16 %Mem=2000000000 %rwf=/scratch/0XXX/XXXX/gaussian_tmp/ajob.rwf %Chk=/home1/0XXXX/XXXX/myjob/ajob.chk #P restart
(note that the above example is a bit special since it 1) saves the .rwf (which is huge) and 2) is restarting a frequency job. For a simple geoopt job it's enough to restart from the .chk file.

Testing at home
I set up a home system with slurm as shown here: http://verahill.blogspot.com.au/2014/03/565-setting-up-slurm-on-debian-wheezy.html

First edit the daisychain.slurm script as shown above. Note that your slurm script must end with .slurm for the script to recognise it as a slurm script. You can get around this by editing your script and specifying a job script name.

Specifically, change the run time to
#SBATCH -t 00:00:10 # Run time (hh:mm:ss)
comment out the partition name
##SBATCH -p normal
and change the job section to
#-------------------Job Goes Here-------------------------- if [ "$thisJobNumber" -eq "1" ]; then echo "Starting First Job:" sh sleeptest.sh else echo "Starting Continuation Job:" sh sleeptest_2.sh fi #----------------------------------------------------------

Next set up key-based log in for localhost (if you haven't got a keypair, use ssh-keygen:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
  exit

Create two  job files. sleeptest.sh:
echo "first job" date sleep 65 date
and
echo "second job" date sleep 9 echo "Do nothing"

Submit using
sbatch test.slurm

Make sure to change
#SBATCH -J testx          # Job name
for each job so that you can have several running concurrently.

No comments:

Post a Comment