28 July 2015

618. Modifying ECCE to work with slurm

UPDATE: ecce stops monitoring the job after 10-20 seconds. The job continues fine though. Working on fixing the monitoring issue. This message will be removed once that's fixed. It was due to $q needing to be lowercase (i.e. 'slurm', not 'Slurm') in eccejobmonitor.

Sun Gridengine has been removed from debian jessie (it's in wheezy and sid). This has given me a good excuse to explore setting up SLURM on my debian cluster. So I did: http://verahill.blogspot.com.au/2015/07/617-slurm-on-debian-jessie-and.html

My setup is very simple, with each node having it's own working directory that they export via NFS to the main node. Also, I never run jobs across several nodes. Because of that, each node has it's own queue. Not how beowulf clusters were supposed to work, but it's the best solution for me (e.g. ROCKS does the opposite -- exports the user dir from the main node, but that makes reading and writing slow where it counts i.e. on the work nodes).

I've currently got this slurm.conf:
ControlMachine=beryllium ControlAddr= MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/none FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear AccountingStorageType=accounting_storage/filetxt AccountingStorageLoc=/var/log/slurm/accounting ClusterName=rupert JobAcctGatherType=jobacct_gather/none SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=beryllium NodeAddr= NodeName=neon NodeAddr= state=unknown NodeName=tantalum NodeAddr= state=unknown NodeName=magnesium NodeAddr= state=unknown NodeName=carbon NodeAddr= state=unknown NodeName=oxygen NodeAddr= state=unknown PartitionName=All Nodes=neon,beryllium,tantalum,oxygen,magnesium,carbon default=yes maxtime=infinite state=up PartitionName=mpi4 Nodes=tantalum maxtime=infinite state=up PartitionName=mpi12 Nodes=carbon maxtime=infinite state=up PartitionName=mpi8 Nodes=neon maxtime=infinite state=up PartitionName=mpix8 Nodes=oxygen maxtime=infinite state=up PartitionName=mpix12 Nodes=magnesium maxtime=infinite state=up PartitionName=mpi1 Nodes=beryllium maxtime=infinite state=up
and sinfo returns
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST All* up infinite 6 idle beryllium,carbon,magnesium,neon,oxygen,tantalum mpi4 up infinite 1 idle tantalum mpi12 up infinite 1 idle carbon mpi8 up infinite 1 idle neon mpix8 up infinite 1 idle oxygen mpix12 up infinite 1 idle magnesium mpi1 up infinite 1 idle beryllium

The first step was to figure out what files to edit:

grep -rs "qsub"
apps/siteconfig/QueueManagers:SGE|submitCommand: qsub ##script##
grep -rs "SGE"
apps/scripts/eccejobmonitor: &MsgSendUp("SGE job id '$id' in state '$state'"); [..] apps/siteconfig/Queues:magnesium|queueMgrName: SGE

Here are the files I edited:

12 QueueManagers: LoadLeveler \ 13 Maui \ 14 EASY \ 15 PBS \ 16 LSF \ 17 Moab \ 18 SGE \ 19 Shell\ 20 Slurm
185 Shell|jobIdParseExpression: \ [0-9]+ 186 187 ############################################################################### 188 # SLURM 189 # Simple Linux Utility for Resource Management 190 # 191 # 192 Slurm|submitCommand: sbatch ##script## 193 Slurm|cancelCommand: scancel ##id## 194 Slurm|queryJobCommand: squeue 195 Slurm|queryMachineCommand: sinfo -p ##queue## 196 Slurm|queryQueueCommand: squeue -a 197 Slurm|queryDiskUsageCommand: df -k 198 Slurm|jobIdParseExpression: .* 199 Slurm|jobIdParseLeadingText: job
2124 LogMsg "Globus status from eccejobstore: $state"; 2125 } 2126 elsif ($q eq 'slurm') 2127 { 2128 $cmd = "squeue 2>&1"; 2129 if (open(QUERY, "$cmd |")) 2130 { 2131 $gotState = 0; 2132 while () 2133 { 2134 LogMsg "JobCheck: Slurm qstat line: $_"; 2135 if (/^\s*$id/) 2136 { 2137 my $state = (split)[5]; 2138 2139 &MsgSendUp("Slurm job id '$id' in state '$state'"); 2140 2141 if (grep {$state eq $_} qw{R 2142 t}) 2143 { 2144 $status = $JOB_STATE_RUNNING; 2145 } 2146 elsif (grep {$state eq $_} qw{PD}) 2147 { 2148 $status = $JOB_STATE_PENDING; 2149 } 2150 $gotState = 1; 2151 last; 2152 } 2153 } 2154 if ($gotState == 0) 2155 { 2156 if ($gJobCheckState != $JOB_STATE_NONE) 2157 { 2158 $status = $JOB_STATE_DONE; 2159 } 2160 } 2161 close QUERY;

Next set up a new machine (or queue) using ecce -admin. Set up a queue -- you won't be able to select Slurm, so select e.g. PBS. Edit the apps/siteconfig/CONFIG.machinename file to e.g.
1 NWChem: /opt/nwchem/Nwchem/bin/LINUX64/nwchem 2 Gaussian-03: /opt/gaussian/g09d/g09/g09 3 perlPath: /usr/bin/ 4 qmgrPath: /usr/bin/ 5 xappsPath: /usr/bin/ 6 7 Slurm { 8 #!/bin/csh 9 #SBATCH -p mpi8 10 #SBATCH --time=$walltime 11 #SBATCH --output=slurm.out 12 #SBATCH --job-name=$submitFile 13 } 14 15 NWChemEnvironment { 16 PYTHONPATH /opt/nwchem/Nwchem/contrib/python 17 } 18 19 NWChemFilesToRemove{ core *.aoints.* *.gridpts.* } 20 21 NWChemCommand { 22 setenv PATH "/bin:/usr/bin:/sbin:/usr/sbin" 23 setenv LD_LIBRARY_PATH "/usr/lib/openmpi/lib:/opt/openblas/lib:/opt/acml/acml5.3.1/gfortran64_fma4_int64/lib:/opt/acml/acml5.3.1/gfortran64_int64/lib:/opt/intel/mkl/lib/intel64" 24 hostname 25 mpirun -n $totalprocs /opt/nwchem/Nwchem/bin/LINUX64/nwchem $infile > $outfile 26 } 27 28 Gaussian-03FilesToRemove{ core *.rwf } 29 30 Gaussian-03Command{ 31 set path = ( /opt/nbo6/bin $path ) 32 setenv GAUSS_SCRDIR /home/me/scratch 33 setenv GAUSS_EXEDIR /opt/gaussian/g09d/g09/bsd:/opt/gaussian/g09d/g09/local:/opt/gaussian/g09d/g09/extras:/opt/gaussian/g09d/g09 34 /opt/gaussian/g09d/g09/g09< $infile > $outfile 35 echo 0 36 } 37 38 Wrapup{ 39 dmesg|tail 40 find ~/scratch/* -name "*" -user me|xargs -I {} rm {} -rf 41 }
Next, edit apps/siteconfig/Queues -- in my case the machine I created is called neon-slurm:
neon-slurm|queueMgrName: Slurm neon-slurm|queueMgrVersion: 2.0~ neon-slurm|prefFile: neon-slurm.Q
And that's all. You should now be able to submit jobs via slurm. There's obviously a lot more than can be done and configured with SLURM, but this was enough to get me up and running, so that I'm now 'future-proofed' in case SGE never comes back into debian stable.

And here's what it looks like when my ecce-submitted jobs are running:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 30 mpi8 In_monom me PD 0:00 1 (Resources) 29 mpi8 b_monome me R 16:12 1 neon 31 mpix12 tl_dimer me R 34:28 1 magnesium

617. SLURM on debian jessie (and compiling Jürgen Rinas' sinfo)

Two issues:
* Sun GridEngine (now Oracle GridEngine) is missing from Debian Jessie. I need a queue manager for my cluster. For now the wheezy package runs fine in debian jessie, but I'd be happier with a supported solution. SLURM is a good alternative here, and I've used it at the TACC.

* SLURM conflicts with Jürgen Rinas' sinfo package, which I use to keep an eye on my cluster. Until this has been resolved, I'll compile and use my own version of sinfo -- basically, I'll rename sinfo and sinfod to sinfo_jr and sinfod_jr. I can't live without sinfo.

Compiling sinfo
mkdir ~/tmp/sinfo -p
cd ~/tmp/sinfo
sudo apt-get install build-essential
sudo apt-get autoremove sinfo 
apt-get source sinfo
cd sinfo-0.0.47/
vim debian/rules 

16 dh_auto_configure -- --enable-SIMPLE_USER_CACHE --enable-CPUNO_ADJUST 21 rm $(CURDIR)/debian/sinfo/usr/bin/sshallsinfo 22 rm $(CURDIR)/debian/sinfo/usr/share/man/man1/sshallsinfo.1 25 rm $(CURDIR)/debian/sinfo/usr/lib/*/sinfo/*.la 37 chmod 755 $(CURDIR)/debian/sinfo/usr/share/sinfo/sinfo.pl.cgi
16 dh_auto_configure -- --enable-SIMPLE_USER_CACHE --enable-CPUNO_ADJUST --program-suffix=_jr 21 rm $(CURDIR)/debian/sinfojr/usr/bin/sshallsinfo_jr 22 rm $(CURDIR)/debian/sinfojr/usr/share/man/man1/sshallsinfo_jr.1 25 rm $(CURDIR)/debian/sinfojr/usr/lib/*/sinfo/*.la 37 chmod 755 $(CURDIR)/cgi/sinfo.pl.cgi
That's jr for Jürgen Rinas.

Then edit debian/control and change
12 Package: sinfo
15 Conflicts: slurm-client, slurm-llnl (<< 14.03.8-1)
12 Package: sinfojr
15 Conflicts: 
dpkg-buildpackage -us -uc
cd ../
sudo dpkg -i sinfo_0.0.47-3_amd64.deb

I launch sinfodjr at boot by putting the following in /etc/rc.local:
su verahill -c '/usr/sbin/sinfodjr --bcast' &

I had a look at this post: https://paolobertasi.wordpress.com/2011/05/24/how-to-install-slurm-on-debian/

It looked to easy to be true.

Here's what I ended up doing:

On the MASTER node:
sudo apt-get install slurm-wlm slurmctld slurmd
[..] Generating a pseudo-random key using /dev/urandom completed. Please refer to /usr/share/doc/munge/README.Debian for instructions to generate more secure key. Setting up slurm-client (14.03.9-5) ... Setting up slurm-wlm-basic-plugins (14.03.9-5) ... Setting up slurmd (14.03.9-5) ... Setting up slurmctld (14.03.9-5) ... Setting up slurm-wlm (14.03.9-5) ... [..]
open file:///usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html
# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=beryllium ControlAddr= # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=rupert #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES NodeName=beryllium NodeAddr= NodeName=neon NodeAddr= PartitionName=All Nodes=beryllium,neon
Copy the above block to /etc/slurm-llnl/slurm.conf

Note the lack of spaces between beryllium and neon in the Nodes= directive.
scontrol show daemons
sudo /usr/sbin/create-munge-key
The munge key /etc/munge/munge.key already exists Do you want to overwrite it? (y/N) y Generating a pseudo-random key using /dev/urandom completed.
sudo systemctl enable slurmctld.service
sudo ln -s /var/lib/slurm-llnl /var/lib/slurm
sudo systemctl start slurmctld.service
sudo systemctl status slurmctld.service 
● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled) Active: active (running) since Tue 2015-07-21 11:16:18 AEST; 40s ago Process: 19958 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 19960 (slurmctld) CGroup: /system.slice/slurmctld.service └─19960 /usr/sbin/slurmctld
sudo systemctl status munge.service
● munge.service - MUNGE authentication service Loaded: loaded (/lib/systemd/system/munge.service; disabled) Active: active (running) since Wed 2015-07-08 00:11:18 AEST; 1 weeks 6 days ago Docs: man:munged(8) Main PID: 25986 (munged) CGroup: /system.slice/munge.service └─25986 /usr/sbin/munged
Also, add yourself to the group slurm and chmod g+r /var/log/slurm/accounting.

On neon (and later on each node): 
Install slurmd and slurm-client as shown below, then copy the /etc/munge/munge.key from the master node to the execute node. Do the same with /etc/slurm-llnl/slurm.conf. Then enable and restart the services.

sudo apt-get install slurmd slurm-client
sudo ln -s /var/lib/slurm-llnl /var/lib/slurm
sudo systemctl enable slurmd.service
sudo systemctl restart slurmd.service
sudo systemctl enable munge.service
sudo systemctl restart munge.service
sudo systemctl status slurmd.service

On the main host (beryllium) I checked that everything was well:

All          up   infinite      1  idle* beryllium, neon

26 July 2015

616. SIESTA on debian jessie with debian blacs, scalapack and SIESTA blas

Three posts showing slightly different ways of building SIESTA on debian may seem a bit excessive, but I figured I'd do a post on a simple 'bullet-proof' way of building SIESTA on debian jessie.

For ACML, see http://verahill.blogspot.com.au/2015/07/614-siesta-with-mpi-on-debian-jessie.html
For MKL (not all MKL versions work using that post): http://verahill.blogspot.com.au/2015/07/615-siesta-on-debian-jessie-with-intel.html

See those posts for detailed build instructions. I'll only give you the arch.make here -- for the rest, see either of the above posts.

I've got a node with intel mkl 2013.sp1.3.174 which hasn't got the blacs openmpi libs, so I ended up building SIESTA with the SIESTA BLAS and LAPACK libraries, and using the debian BLACS and SCALAPACK libs.

Install the libs with
sudo apt-get install libblacs-openmpi1 libopenmpi-dev libscalapack-openmpi1 libblacs-mpi-dev libscalapack-mpi-dev

Here's the arch.make:
# # This file is part of the SIESTA package. # # Copyright (c) Fundacion General Universidad Autonoma de Madrid: # E.Artacho, J.Gale, A.Garcia, J.Junquera, P.Ordejon, D.Sanchez-Portal # and J.M.Soler, 1996- . # # Use of this software constitutes agreement with the full conditions # given in the SIESTA license, as signed by all legitimate users. # .SUFFIXES: .SUFFIXES: .f .F .o .a .f90 .F90 SIESTA_ARCH=x86_64-unknown-linux-gnu--unknown FPP= FPP_OUTPUT= FC=mpif90 RANLIB=ranlib SYS=nag SP_KIND=4 DP_KIND=8 KINDS=$(SP_KIND) $(DP_KIND) FFLAGS=-g -O2 FPPFLAGS= -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT LDFLAGS= ARFLAGS_EXTRA= FCFLAGS_fixed_f= FCFLAGS_free_f90= FPPFLAGS_fixed_F= FPPFLAGS_free_F90= BLAS_LIBS= LAPACK_LIBS= BLACS_LIBS=-L/usr/lib -lblacs-openmpi SCALAPACK_LIBS=-L/usr/lib -lscalapack-openmpi COMP_LIBS=dc_lapack.a liblapack.a libblas.a NETCDF_LIBS= NETCDF_INTERFACE= MPI_LIBS=-L/usr/lib/openmpi/lib -lmpi -lmpi_f90 -lmpi_f77 LIBS=$(SCALAPACK_LIBS) $(BLACS_LIBS) $(LAPACK_LIBS) $(BLAS_LIBS) $(NETCDF_LIBS) $(MPI_LIBS) -lpthread #SIESTA needs an F90 interface to MPI #This will give you SIESTA's own implementation #If your compiler vendor offers an alternative, you may change #to it here. MPI_INTERFACE=libmpi_f90.a MPI_INCLUDE=. #Dependency rules are created by autoconf according to whether #discrete preprocessing is necessary or not. .F.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FPPFLAGS) $(FPPFLAGS_fixed_F) $< .F90.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FPPFLAGS) $(FPPFLAGS_free_F90) $< .f.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FCFLAGS_fixed_f) $< .f90.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FCFLAGS_free_f90) $<
At some point in the future, when my nodes are free, I might do a bit of basic performance testing using the different versions.