A good resource for SGE related stuff is http://rous.mit.edu/index.php/SGE_Instructions_and_Tips#Submitting_jobs_to_specific_queues
Either way, first figure out what node the job ran on. Assuming that the job number was 445:
qacct -j 445|grep hostnamehostname compute-0-6.local
Next figure out the PID, as this is used to name the Gau-[PID].rwf file:
grep PID g03.g03outEntering Link 1 = /share/apps/gaussian/g09/l1.exe PID= 24286.
You can now craft your restart file, g09_freq.restart -- you'll need to make sure that the paths are appropriate for your system:
(having empty lines at the end of the file is important) and a qsub file, g09_freq.qsub:%nprocshared=8 %Mem=900000000 %rwf=/scratch/Gau-24286.rwf %Chk=/home/me/jobs/testing/delta_631gplusstar-freq/delta_631gplusstar-freq.chk #P restart
Then submit it to the correct queue by doing#$ -S /bin/sh #$ -cwd #$ -l h_rt=999:30:00 #$ -l h_vmem=8G #$ -j y #$ -pe orte 8 export GAUSS_SCRDIR=/tmp export GAUSS_EXEDIR=/share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09 /share/apps/gaussian/g09/g09 g09_freq.restart > g09_freq.out
qsub -q all.q@compute-0-6.local g09_freq.qsub
The output goes to g09_freq.log. You know if the restart worked properly if it says
andSkip MakeAB in pass 1 during restart.
Resume CPHF with iteration 214.
Note that restarting analytical frequency jobs in g09 can be a hit and miss affair. Jobs that run out of time are easy to restart, and some jobs that die silently have also been restarted successfully. On the other hand, a job that died because my resource allocations ran out couldn't be restarted i.e. restart started the freq job from scratch. The same happened with one a node of mine that has what seems like a dodgy PSU. Finally, I also couldn't restart jobs that died silently due to allocation all the RAM to g09 without leaving any to the OS (or at least that's the current best theory). It may thus be a good idea to back up the rwf file every now and again, in spite of the unwieldy size.
Thanks for posting this - I've been trying to restart analytic frequency jobs, however I run into a problem where the restarted job looks for a new .inp file with a new name than the the first job (e.g., Gau-1112.inp vs Gau-1111.inp). The run immediately quits because it can't find it. I haven't been able to find any way to change the name of the .inp file so that it looks for the same filename upon restarting. Do you know how to fix this?
ReplyDeleteHi Jonathan, did you ever find a solution for this problem? I've just run into this exact problem myself and have only found your comment on the internet regarding this!
DeleteThanks,
Rory
Could you let me know more details about the issue? I might be able to help you troubleshoot what's going on.
DeleteWhen you run a gaussian job you get to series of files in the scratch directory -- one Gau-1111.inp, and a bunch of files beginning with Gau-1112 (e.g. Gau-1112.rwf). The PID will be 1112.
ReplyDeleteEither way, it shouldn't cause an issue -- you're creating a new input file, telling it to use the .rwf file only, and you're re-submitting your job to the same node as you ran the original job on. It should work.
I don't have a good answer as to why it doesn't -- I really only load the .rwf file, and nothing else should matter.
Thanks - is it crucial that it be resubmitted on the same node? This is not possible for me, but I'm not sure I see why that's necessary. If I have the rwf and input file the same directories, the node shouldn't matter, right?
ReplyDeleteJonathan, if you have the rwf and chk files in your directory it shouldn't matter what node you're submitting to, but make sure that the paths are correct.
DeleteCould you post your qsub and gaussian restart file, together with a description of the cluster you're using? It might help, although no guarantees of course..