The issue:
On doing qhost I get
error: commlib error: got select error (Connection refused)Beryllium is my hostnode.
error: unable to send message to qmaster using port 6444 on host "beryllium": got send error
Same thing happens with qstat and any other imaginable SGE command.
The solution:
It's an obvious one -- just restart the services. I mean, it took me twenty minutes to re-remember that, but it should have been obvious. Most of the time, services are managed using scripts in /etc/init.d/ and that's the case here too. So, hanging my head in shame, here's the solution:
ls /etc/init.d/grid*
/etc/init.d/gridengine-exec /etc/init.d/gridengine-master
sudo service gridengine-master restart
qhost should now slowly be populated
Done.
How I got there:
ps aux|grep sge
sgeadmin 3173 0.0 0.0 56844 3428 ? Sl Aug20 6:29 /usr/lib/gridengine/sge_execd
tree /var/spool/gridengine -L 4 -d
/var/spool/gridengine
|-- execd
| `-- beryllium
| |-- active_jobs
| |-- jobs
| `-- job_scripts
|-- qmaster
| `-- job_scripts
`-- spooldb
Looking at /var/spool/gridengine/execd/beryllium/messages
08/20/2012 10:47:17| main|beryllium|I|starting up GE 6.2u5 (lx26-amd64)
08/30/2012 15:06:57| main|beryllium|E|commlib error: got read error (closing "beryllium/qmaster/1")
08/30/2012 15:06:58| main|beryllium|W|can't register at qmaster "beryllium": abort qmaster registration due to communication errors
less /var/lib/gridengine/rupert/common/act_qmaster
beryllium
So looks ok.
Oddly, there's nothing funny in /tmp -- no execd_messages.* files.
ps aux|grep sge
sgeadmin 3173 0.0 0.0 56844 3428 ? Sl Aug20 6:29 /usr/lib/gridengine/sge_execd
sudo kill 3173
start-stop-daemon --exec /usr/sbin/sge_execd --start --user sgeadmin
Which didn't seem to do anything.
start-stop-daemon --exec /usr/sbin/sge_qmaster --start --user sgeadmin
which doesn't seem to do anything either.
/usr/lib/gridengine/gethostname -aname
critical error: Please set the environment variable SGE_ROOT.
export SGE_ROOT=/var/lib/gridengine
/usr/lib/gridengine/gethostname -aname
beryllium
service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmasterrm: cannot remove `/var/run/gridengine/qmaster.pid': Permission denied
.
cat /var/run/gridengine/qmaster.pid
3198
ps aux|grep 3198
yields nothing
sudo rm /var/run/gridengine/qmaster.pid
sudo service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmaster.
ps aux|grep sge
sgeadmin 32178 2.5 0.0 69004 6112 ? Sl 09:40 0:00 /usr/lib/gridengine/sge_qmaster
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
715 0.75000 submit__la me r 08/22/2012 08:10:32 six.q@boron 6
720 0.25194 submit__63 me r 08/22/2012 11:15:02 four.q@tantalum 4
716 0.74817 submit__la me qw 08/22/2012 08:11:28 6
719 0.70429 submit__la me qw 08/22/2012 08:38:17 6
721 0.25071 submit__63 me qw 08/22/2012 11:15:35 4
722 0.25000 submit__32 me qw 08/22/2012 11:16:01 4
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 3 - 7.8G - 14.9G -
boron lx26-amd64 6 6.10 7.6G 1.4G 14.9G 240.8M
tantalum lx26-amd64 4 4.01 7.7G 1.6G 14.9G 0.0
sudo service gridengine-exec restart
Restarting Sun Grid Engine Execution Daemon: sge_execd.
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 3 0.22 7.8G 3.0G 14.9G 141.3M
boron lx26-amd64 6 6.09 7.6G 1.4G 14.9G 240.8M
tantalum lx26-amd64 4 4.01 7.7G 1.6G 14.9G 0.0
Hi, I can't fix this error. I dont know what to do. I've done step by step your tips, but the issue just does not get solved.
ReplyDeleteCould you help me?
Difficult to troubleshoot without more information, and I'm not much of an expert on SGE. What does
Deleteps aux|grep sge
give?
Thanks for answer! I've restart the services, but it does not work (I've done some tries after restart, ps aux|grep sge shows sge_execd and qmaster, but still the problem)
ReplyDeleteps aux|grep sge
sgeadmin 1651 0.0 0.1 53072 1660 ? Sl 11:45 0:00 /usr/lib/gridengine/sge_execd
/etc/hosts.conf
127.0.0.1 localhost
172.25.80.144 prueba.borja #qmaster
172.25.80.140 clienteprueba1 #qclient
You say you get both sge_qmaster and sge_qmaster but only show sge_execd?
Deleteps aux|grep sge
sgeadmin 3125 0.3 0.0 137024 4972 ? Sl Nov27 50:39 /usr/lib/gridengine/sge_qmaster
sgeadmin 3169 0.0 0.0 54796 1560 ? Sl Nov27 3:51 /usr/lib/gridengine/sge_execd
Last question: is this an issue that has suddenly appeared, or is have you never managed to get SGE working? I'm asking since it could also be due to the way you set it up -- SGE was very temperamental during set-up, in particular when it comes to hostnames.
It has been solved! The problem were in the hostnames: SGE does not like . or numbers! I changed the hostname prueba.borja to pruebaborja; clienteprueba1 to clienteprueba and the issue dissapeared.
DeleteThanks! Now Im following your guide for setting up three nodes.
Congrats for your blog
Thanks for reporting back!
Delete