I've spent the past few days trying to get to grips with the Sun Gridengine (SGE) but have given up for now. While it seems capable, it's just overkill for my purposes, especially taking into account the difficulties in simply configuring it. It's a bit similar to my experience with OpenDX, a very capable plotting program, but which I couldn't make work to satisfaction in spite of being one of the lucky few in possession of the "Open DX -- Paths to Visualisation" book.
Long story short -- I wrote a small script in python. It
- reads a file, list, with the name of shell scripts
- the shell scripts, job1.sh..jobn.sh, are executed sequentially - when the execution of one script is finished, the next one is executed
- jobs can be added and removed from list during execution
It's a 'dumb' script -- it does not try to balance jobs across nodes or look for idle cpus/cores. It just executes one job after the other, and mark jobs as done after execution.
To test it:
create a file called list and put the following lines in it:
pi40.shThe scripts are the following:
pi400.sh
pi2000.sh
pi40.sh
echo "pi to 40 decimals"pi400.sh
echo "scale=40; 4*a(1)" | bc -l -q
echo "done"
echo "scale=400; 4*a(1)" | bc -l -qpi200.sh
echo "scale=2000; 4*a(1)" | bc -l -qThe python code for vspqm.py is below
I've aliased my vspqm (edit ~/.bashrc):
alias vspqm='/home/me/work/vspqm/vspqm.py'Then sourced ~/.bashrc
Launch in the directory you keep your list file using
me@beryllium:~/work/vspqm/jobs$ vspqm list > log &
[1] 23925
me@beryllium:~/work/vspqm/jobs$ cat log
pi to 40 decimals
3.1415926535897932384626433832795028841968
done
3.141592653589793238462643383279502884197169399375105820974944592307\
[..]
3.141592653589793238462643383279502884197169399375105820974944592307\
81640628620899862803482534211706798214808651328230664709384460955058\
[..]
An nwchem example would be
list:
ac.shac.sh:
bn.sh
cd acetone/bn.sh:
mpirun -n 4 nwchem ac.nw>ac.out
cd ../
cd benzene/
mpirun -n 4 nwchem bn.nw>bn.out
cd ../
Our python queue manager (which we'll call vspqm.py and chmod +x to make executable) is below. Don't forget to change #!/usr/bin/python2.4 if necessary -- I use 2.4 on ROCKS and 2.7 on Debian testing/wheezy
#!/usr/bin/python2.4
# rudimentary queue manager. Handles a single node,
# submitting a series of jobs in sequence. use python v2.4-2.7
import os
import time
import sys
infile=sys.argv[1]
print "pyqm v 0.0.3"
def launchjob(job):
i=0
print "######"
job=job.rstrip('\n')
i=os.system("sh "+job)
if i==0:
print "Job successful"
else:
print "Job failed"
print "######"
return i
def remake_list(infile):
qfile=open(infile,"w")
bakfile=open(infile+".bak",'r')
for i in bakfile:
qfile.write(i)
return 0
def rewind(infile):
qfile=open(infile,"w")
bakfile=open(infile+".bak",'r')
for i in bakfile:
qfile.write(i[1:])
return 0
def get_next_job(infile):
qfile=open(infile,"r")
bakfile=open(infile+".bak",'w')
lines=""
job=""
for line in qfile:
if line[0]=="*":
print "Marked as done: ",line[1:]
if line[0]!="*" and job=="":
print "Launching: ", line
job=line
line="*"+line
lines+=line
bakfile.write(lines)
qfile.close
bakfile.close
return job
def main(infile):
jobs=1
while (jobs==1):
newjob=get_next_job(infile)
remake_list(infile)
if newjob!="":
jobs=1
echojob=launchjob(newjob)
else:
print "No more jobs found at "+str(time.asctime())
jobs=0
return 0
if __name__ == "__main__":
main(infile)
rewind(infile)