There's a cluster (running ROCKS with Sun Grid Engine) which I manage remotely and which I did not set up. Instead it was the IT people at that uni who first configured it. For some reason they named the nodes
compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-6.local compute-0-7.local
Recently a few extra disks were added to the system, so all jobs were suspended. However, while installing the disks the local IT peep decided to change the node names without consulting us. Now the nodes were called
compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local
instead. Suddenly there were two node-queues with jobs in them, but with no corresponding nodes.Trying to delete the jobs in those queues only lead to:
all.q@compute-0-5.local BIP 0/8/8 9.12 lx26-amd64
5142 0.55500 submit__v3 me r 02/27/2013 15:02:11 8
---------------------------------------------------------------------------------
all.q@compute-0-6.local BIP 0/8/8 -NA- lx26-amd64 auo
5074 0.55500 submit__nb me dr 02/02/2013 21:53:59 8
The Solution
It wasn't immediately obvious how to fix this, but it turned out to be simple:
qconf -cq all.q@compute-0-6.local
That clears and deletes the queue. That's all.