The issue:
On doing qhost I get
error: commlib error: got select error (Connection refused)Beryllium is my hostnode.
error: unable to send message to qmaster using port 6444 on host "beryllium": got send error
Same thing happens with qstat and any other imaginable SGE command.
The solution:
It's an obvious one -- just restart the services. I mean, it took me twenty minutes to re-remember that, but it should have been obvious. Most of the time, services are managed using scripts in /etc/init.d/ and that's the case here too. So, hanging my head in shame, here's the solution:
ls /etc/init.d/grid*
/etc/init.d/gridengine-exec /etc/init.d/gridengine-master
sudo service gridengine-master restart
qhost should now slowly be populated
Done.
How I got there:
ps aux|grep sge
sgeadmin 3173 0.0 0.0 56844 3428 ? Sl Aug20 6:29 /usr/lib/gridengine/sge_execd
tree /var/spool/gridengine -L 4 -d
/var/spool/gridengine
|-- execd
| `-- beryllium
| |-- active_jobs
| |-- jobs
| `-- job_scripts
|-- qmaster
| `-- job_scripts
`-- spooldb
Looking at /var/spool/gridengine/execd/beryllium/messages
08/20/2012 10:47:17| main|beryllium|I|starting up GE 6.2u5 (lx26-amd64)
08/30/2012 15:06:57| main|beryllium|E|commlib error: got read error (closing "beryllium/qmaster/1")
08/30/2012 15:06:58| main|beryllium|W|can't register at qmaster "beryllium": abort qmaster registration due to communication errors
less /var/lib/gridengine/rupert/common/act_qmaster
beryllium
So looks ok.
Oddly, there's nothing funny in /tmp -- no execd_messages.* files.
ps aux|grep sge
sgeadmin 3173 0.0 0.0 56844 3428 ? Sl Aug20 6:29 /usr/lib/gridengine/sge_execd
sudo kill 3173
start-stop-daemon --exec /usr/sbin/sge_execd --start --user sgeadmin
Which didn't seem to do anything.
start-stop-daemon --exec /usr/sbin/sge_qmaster --start --user sgeadmin
which doesn't seem to do anything either.
/usr/lib/gridengine/gethostname -aname
critical error: Please set the environment variable SGE_ROOT.
export SGE_ROOT=/var/lib/gridengine
/usr/lib/gridengine/gethostname -aname
beryllium
service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmasterrm: cannot remove `/var/run/gridengine/qmaster.pid': Permission denied
.
cat /var/run/gridengine/qmaster.pid
3198
ps aux|grep 3198
yields nothing
sudo rm /var/run/gridengine/qmaster.pid
sudo service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmaster.
ps aux|grep sge
sgeadmin 32178 2.5 0.0 69004 6112 ? Sl 09:40 0:00 /usr/lib/gridengine/sge_qmaster
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
715 0.75000 submit__la me r 08/22/2012 08:10:32 six.q@boron 6
720 0.25194 submit__63 me r 08/22/2012 11:15:02 four.q@tantalum 4
716 0.74817 submit__la me qw 08/22/2012 08:11:28 6
719 0.70429 submit__la me qw 08/22/2012 08:38:17 6
721 0.25071 submit__63 me qw 08/22/2012 11:15:35 4
722 0.25000 submit__32 me qw 08/22/2012 11:16:01 4
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 3 - 7.8G - 14.9G -
boron lx26-amd64 6 6.10 7.6G 1.4G 14.9G 240.8M
tantalum lx26-amd64 4 4.01 7.7G 1.6G 14.9G 0.0
sudo service gridengine-exec restart
Restarting Sun Grid Engine Execution Daemon: sge_execd.
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 3 0.22 7.8G 3.0G 14.9G 141.3M
boron lx26-amd64 6 6.09 7.6G 1.4G 14.9G 240.8M
tantalum lx26-amd64 4 4.01 7.7G 1.6G 14.9G 0.0