31 August 2012

225. Sun GridEngine: commlib error: got select error (Connection refused)


The issue:
On doing qhost I get
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "beryllium": got send error
Beryllium is my hostnode.

Same thing happens with qstat and any other imaginable SGE command.

The solution:
It's an obvious one -- just restart the services. I mean, it took me twenty minutes to re-remember that, but it should have been obvious. Most of the time, services are managed using scripts in /etc/init.d/ and that's the case here too. So, hanging my head in shame, here's the solution:

ls /etc/init.d/grid*
/etc/init.d/gridengine-exec  /etc/init.d/gridengine-master

sudo service gridengine-master restart

qhost should now slowly be populated

Done.

How I got there:


 ps aux|grep sge
sgeadmin  3173  0.0  0.0  56844  3428 ?        Sl   Aug20   6:29 /usr/lib/gridengine/sge_execd

 tree /var/spool/gridengine -L 4 -d
/var/spool/gridengine
|-- execd
|   `-- beryllium
|       |-- active_jobs
|       |-- jobs
|       `-- job_scripts
|-- qmaster
|   `-- job_scripts
`-- spooldb

Looking at /var/spool/gridengine/execd/beryllium/messages

08/20/2012 10:47:17|  main|beryllium|I|starting up GE 6.2u5 (lx26-amd64)
08/30/2012 15:06:57|  main|beryllium|E|commlib error: got read error (closing "beryllium/qmaster/1")
08/30/2012 15:06:58|  main|beryllium|W|can't register at qmaster "beryllium": abort qmaster registration due to communication errors
less /var/lib/gridengine/rupert/common/act_qmaster
beryllium

So looks ok.

Oddly, there's nothing funny in /tmp -- no execd_messages.* files.

 ps aux|grep sge
sgeadmin  3173  0.0  0.0  56844  3428 ?        Sl   Aug20   6:29 /usr/lib/gridengine/sge_execd
sudo kill 3173


start-stop-daemon --exec /usr/sbin/sge_execd --start --user sgeadmin
Which didn't seem to do anything.

start-stop-daemon --exec /usr/sbin/sge_qmaster --start --user sgeadmin
which doesn't seem to do anything either.

/usr/lib/gridengine/gethostname -aname
critical error: Please set the environment variable SGE_ROOT.
export SGE_ROOT=/var/lib/gridengine
/usr/lib/gridengine/gethostname -aname
beryllium

 service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmasterrm: cannot remove `/var/run/gridengine/qmaster.pid': Permission denied
.
cat /var/run/gridengine/qmaster.pid
3198
ps aux|grep 3198
yields nothing
sudo rm  /var/run/gridengine/qmaster.pid

sudo service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmaster.
ps aux|grep sge
sgeadmin 32178  2.5  0.0  69004  6112 ?        Sl   09:40   0:00 /usr/lib/gridengine/sge_qmaster
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    715 0.75000 submit__la me         r     08/22/2012 08:10:32 six.q@boron                        6        
    720 0.25194 submit__63 me         r     08/22/2012 11:15:02 four.q@tantalum                    4        
    716 0.74817 submit__la me         qw    08/22/2012 08:11:28                                    6        
    719 0.70429 submit__la me         qw    08/22/2012 08:38:17                                    6        
    721 0.25071 submit__63 me         qw    08/22/2012 11:15:35                                    4        
    722 0.25000 submit__32 me         qw    08/22/2012 11:16:01                                    4    

 qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      3     -    7.8G       -   14.9G       -
boron                   lx26-amd64      6  6.10    7.6G    1.4G   14.9G  240.8M
tantalum                lx26-amd64      4  4.01    7.7G    1.6G   14.9G     0.0

sudo service gridengine-exec restart
Restarting Sun Grid Engine Execution Daemon: sge_execd.

qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      3  0.22    7.8G    3.0G   14.9G  141.3M
boron                   lx26-amd64      6  6.09    7.6G    1.4G   14.9G  240.8M
tantalum                lx26-amd64      4  4.01    7.7G    1.6G   14.9G     0.0




22 August 2012

224. Disabling tracker-miner-fs

Yes, yes, I shouldn't have a full desktop install on a computational node, but the nodes serve as instant replacement desktops if something goes awry with my main desktop, and occasionally visitors get to use them to access the internet in order to avoid getting bored.

Anyway, tracker-miner-fs is eating up 26% of my 8 Gb RAM on one of my nodes running KDE, and I really don't need it. I mean, I don't know if it's useful to most people running a full DE, but on my node I most certainly, definitely don't need it.

Given the number of posts online with questions about tracker ('What is it?", "Why is it using up all my resources?" etc.) I think that there's a bit of a PR problem. If it's a program that is noticeable because it makes demands on your computer system, the users should be allowed to know why putting up with this extra drain on resources is desirable -- or not.

Anyway.

aptitude show tracker says:
"Tracker is an advanced framework for first class objects with associated metadata and tags. It provides a one stop solution for all metadata, tags, shared object databases, search tools and indexing."

...which means what exactly in practical terms?

man tracker-miner-fs 
NAME
       tracker-miner-fs - Used to crawl the file system to mine data.
man tracker-store
NAME
       tracker-store - database indexer and query daemon
My guess would be that tracker-miner is basically indexing files for faster search, but I really don't know. It's one daemon I'm happy to expel.

There's a standard place for stuff that's supposed to be brought up with x:

ls /etc/xdg/autostart/

-rw-r--r-- 1 root root   306 May  3 09:42 at-spi-dbus-bus.desktop
-rw-r--r-- 1 root root  6216 Jun 20 06:58 evolution-alarm-notify.desktop
-rw-r--r-- 1 root root  7404 Oct 14  2011 gdu-notification-daemon.desktop
-rw-r--r-- 1 root root  5340 May 24 08:46 gnome-keyring-gpg.desktop
-rw-r--r-- 1 root root  6711 May 24 08:46 gnome-keyring-pkcs11.desktop
-rw-r--r-- 1 root root  6282 May 24 08:46 gnome-keyring-secrets.desktop
-rw-r--r-- 1 root root  5138 May 24 08:46 gnome-keyring-ssh.desktop
-rw-r--r-- 1 root root  6681 May 30 21:02 gnome-sound-applet.desktop
-rw-r--r-- 1 root root  7018 Apr 28 09:27 gsettings-data-convert.desktop
-rw-r--r-- 1 root root   460 Oct 21  2011 guake.desktop
-rw-r--r-- 1 root root   301 Jun 24 16:52 hplip-systray.desktop
-rw-r--r-- 1 root root   238 Dec  2  2011 kerneloops-applet.desktop
-rw-r--r-- 1 root root  4673 Mar 25 08:49 nm-applet.desktop
-rw-r--r-- 1 root root   250 Sep 10  2011 notification-daemon.desktop
-rw-r--r-- 1 root root  4651 Nov 12  2011 polkit-gnome-authentication-agent-1.desktop
-rw-r--r-- 1 root root  7112 Dec 23  2011 print-applet.desktop
-rw-r--r-- 1 root root  3864 Oct  1  2011 pulseaudio.desktop
-rw-r--r-- 1 root root   633 May 20 06:08 pulseaudio-kde.desktop
-rw-r--r-- 1 root root  3288 Aug 12 12:05 tracker-miner-fs.desktop
-rw-r--r-- 1 root root  3004 Aug 12 12:05 tracker-store.desktop

-rw-r--r-- 1 root root 11041 Apr  4 22:02 user-dirs-update-gtk.desktop
-rw-r--r-- 1 root root   433 Nov  3  2011 wicd-tray.desktop
-rw-r--r-- 1 root root   150 Feb  8  2012 xfce4-settings-helper-autostart.desktop
-rw-r--r-- 1 root root   357 Aug  1  2011 xfce4-volumed.desktop
Incidentally, the folder on that particular node betrays a history of previously installed desktop environments...

To stop and remove the tracker-miner processes, do
tracker-control -r 
It removes the databases it has created as well.


To disable:
Launch tracker-preferences from the KDE menu. Uncheck all options under 'Semantics'. Uncheck all places under locations. Clicking apply doesn't seem to have any effect, but if you open tracker-preferences again you'll probably find that it worked.

To disable tracker-miner-fs and tracker-store from the terminal you can probably edit:

tracker-miner-fs.desktop:
51 Icon=
 52 Exec=/usr/lib/tracker/tracker-miner-fs
 53 Terminal=false
 54 Type=Application
 55 Categories=Utility;
 56 X-GNOME-Autostart-enabled=false
 57 X-KDE-autostart-enabled=false
 58 X-KDE-StartupNotify=false
 59 X-KDE-UniqueApplet=true
 60 NoDisplay=true

tracker-store.desktop
54 Icon=
 55 Exec=/usr/lib/tracker/tracker-store
 56 Terminal=false
 57 Type=Application
 58 Categories=Utility;
 59 X-GNOME-Autostart-enabled=false
 60 X-KDE-autostart-enabled=false
 61 X-KDE-StartupNotify=false
 62 X-KDE-UniqueApplet=true
 63 NoDisplay=true
 64 OnlyShowIn=GNOME;KDE;XFCE;

Reasons why we don't simply uninstall it:

apt-cache rdepends tracker
         tracker
Reverse Depends:
  tracker-miner-fs
  tracker-gui
  tracker-gui
  tracker-gui
  nautilus
  tracker-miner-evolution
  tracker-utils
  tracker-extract
  brasero
  shared-mime-info
  tracker-utils
  tracker-miner-fs
  tracker-miner-evolution
  tracker-gui
  tracker-gui
  tracker-gui
  tracker-extract
  tracker-explorer
  tracker-dbg
  shared-mime-info
  rygel-tracker
  nautilus (twice?)
  catfish

21 August 2012

223. Moving disks, devices from one box to another -- issues with network interfaces

Long story short: edit /etc/udev/rules.d/70-persistent-net.rules

Long story:
I have a very small beowulf cluster keeping my office warm in these antipodean winter months. For some silly reason I was using the front node, a six core + 8 Gb box, as my daily desktop. That of course meant I wasn't really using it for computations. In addition to the front node I have a four core i5-somethingorother with 8 Gb RAM (fast!) and a slovenly AMD X3 /4 Gb to actually run the jobs. They are connected via a gigabit switch (192.168.1.0/24) for nfs exports and a 10/100 router (192.168.2.0/24) for WAN access.

I finally decided that 1) I didn't need a six-core box to prepare latex documents, run octave jobs and make pretty gnuplot plots and that 2) having a slow 3-core AMD box to run heavy nwchem jobs was not fast enough. On the other hand, I didn't want to set up/reinstall/move all my stuff from one harddrive to another.

Linux is wonderful in that it's often just a case of ripping out a harddrive and moving it to a different physical. Windows will scream bloody murder, but linux normally does it pretty well. Same here.

The main issue was the three network cards that I wanted to set up (three separate subnets) and which I configure via /etc/network/interfaces. I simply couldn't call the networks cards what I wanted.

Well, as is obvious in hindsight, you should pay a visit to /etc/udev, and more specifically, /etc/udev/rules.d/70-persistent-net.rules

It looks something like this:

# This file was automatically generated by the /lib/udev/write_net_rules# program, run by the persistent-net-generator.rules rules file.## You can modify it, as long as you keep each rule on a single# line, and change only the value of the NAME= key.
# PCI device 0x10ec:0x8168 (r8169)SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:YY:XX:96:XX:32", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:XX:YY:83:0a:48", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:XX:YY:64:0b:46", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
# PCI device 0x1814:0x3062 (rt2860)SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="c8:YY:XX:cf:1f:5d", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="ra*", NAME="ra0"
# USB device 0x:0x (rt2800usb)SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="c8:YY:XX:c8:91:e6", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="wlan*", NAME="wlan0"






Basically, make sure you can figure out the mac addresses of the different network cards (ip addr helped me more than ifconfig) you can simply go in and edit the ATTR{address}=="" statements and the NAME="" variables. Make sure that there are no conflicts, obviously.

After that, everything should be fine.

If you are using network-manager (i.e. stock GNOME setup) then you will want to pay attention to the /etc/NetworkManager/system-connections/ as well -- open and edit suspiciously named files like e.g. eth0.

They'll look like this:

[802-3-ethernet]
duplex=full
mac-address=00:YY:XX:96:93:32
[connection]
id=eth0
uuid=fa5YYYY-XXXX-43a3-8502-f8ba2d28ZZZZ
type=802-3-ethernet
timestamp=1326324509
[ipv6]
method=auto
[ipv4]
method=auto