04 September 2012

226. ACML libs and nwchem -- what libs to choose to avoid 'Singularity in Pulay matrix' hang.

The problem:
If I compile nwchem against the acml libs (gfortran64_fma4 in acml-5-2-0-gfortran-64bit.tgz) everything appears to go fine, but once I try to run stuff I get


           Memory utilization after 1st SCF pass:
           Heap Space remaining (MW):       12.94            12937848
          Stack Space remaining (MW):       13.11            13107006
   convergence    iter        energy       DeltaE   RMS-Dens  Diis-err    time
 ---------------- ----- ----------------- --------- --------- ---------  ------
 d= 0,ls=0.0,diis     1    -74.9488845804 -7.49D+01  1.85D-02  1.70D-01     0.4
  Singularity in Pulay matrix. Error and Fock matrices removed.


and then the node hangs with 100% CPU.

The (obvious) solution:
To some this will be obvious, but to someone not skilled in the art, like myself, it isn't.
Of course, I could've just RTFM...but what academic does that?
"ACML and MKL can support 64-bit integers if the appropriate library is chosen. For MKL, one can choose the ILP64 Version of Intel® MKL, while for ACML the int64 libraries should be chosen, e.g. in the case of ACML 4.4.0 using a PGI compiler /opt/acml/4.4.0/pgi64_int64/lib/libacml.a"
So, when you go to download your libraries from the AMD website make to download at a minimum the 64 integer file (e.g.acml-5-2-0-gfortran-64bit-int64.tgz).

How I built nwchem:

export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all"
#export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export BLASOPT="-L/opt/acml/acml5.2.0/gfortran64_fma4_int64/lib -lacml"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.2.0/gfortran64_fma4_int64/lib"
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran 2> make.err 1>make.log
export FC=gfortran
cd ../contrib ./getmem.nwchem


Don't forget to add the acml libs to the LD_LIBRARY_PATH in your ~/.bashrc, e.g.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.2.0/gfortran64_fma4_int64/lib



31 August 2012

225. Sun GridEngine: commlib error: got select error (Connection refused)


The issue:
On doing qhost I get
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "beryllium": got send error
Beryllium is my hostnode.

Same thing happens with qstat and any other imaginable SGE command.

The solution:
It's an obvious one -- just restart the services. I mean, it took me twenty minutes to re-remember that, but it should have been obvious. Most of the time, services are managed using scripts in /etc/init.d/ and that's the case here too. So, hanging my head in shame, here's the solution:

ls /etc/init.d/grid*
/etc/init.d/gridengine-exec  /etc/init.d/gridengine-master

sudo service gridengine-master restart

qhost should now slowly be populated

Done.

How I got there:


 ps aux|grep sge
sgeadmin  3173  0.0  0.0  56844  3428 ?        Sl   Aug20   6:29 /usr/lib/gridengine/sge_execd

 tree /var/spool/gridengine -L 4 -d
/var/spool/gridengine
|-- execd
|   `-- beryllium
|       |-- active_jobs
|       |-- jobs
|       `-- job_scripts
|-- qmaster
|   `-- job_scripts
`-- spooldb

Looking at /var/spool/gridengine/execd/beryllium/messages

08/20/2012 10:47:17|  main|beryllium|I|starting up GE 6.2u5 (lx26-amd64)
08/30/2012 15:06:57|  main|beryllium|E|commlib error: got read error (closing "beryllium/qmaster/1")
08/30/2012 15:06:58|  main|beryllium|W|can't register at qmaster "beryllium": abort qmaster registration due to communication errors
less /var/lib/gridengine/rupert/common/act_qmaster
beryllium

So looks ok.

Oddly, there's nothing funny in /tmp -- no execd_messages.* files.

 ps aux|grep sge
sgeadmin  3173  0.0  0.0  56844  3428 ?        Sl   Aug20   6:29 /usr/lib/gridengine/sge_execd
sudo kill 3173


start-stop-daemon --exec /usr/sbin/sge_execd --start --user sgeadmin
Which didn't seem to do anything.

start-stop-daemon --exec /usr/sbin/sge_qmaster --start --user sgeadmin
which doesn't seem to do anything either.

/usr/lib/gridengine/gethostname -aname
critical error: Please set the environment variable SGE_ROOT.
export SGE_ROOT=/var/lib/gridengine
/usr/lib/gridengine/gethostname -aname
beryllium

 service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmasterrm: cannot remove `/var/run/gridengine/qmaster.pid': Permission denied
.
cat /var/run/gridengine/qmaster.pid
3198
ps aux|grep 3198
yields nothing
sudo rm  /var/run/gridengine/qmaster.pid

sudo service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmaster.
ps aux|grep sge
sgeadmin 32178  2.5  0.0  69004  6112 ?        Sl   09:40   0:00 /usr/lib/gridengine/sge_qmaster
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    715 0.75000 submit__la me         r     08/22/2012 08:10:32 six.q@boron                        6        
    720 0.25194 submit__63 me         r     08/22/2012 11:15:02 four.q@tantalum                    4        
    716 0.74817 submit__la me         qw    08/22/2012 08:11:28                                    6        
    719 0.70429 submit__la me         qw    08/22/2012 08:38:17                                    6        
    721 0.25071 submit__63 me         qw    08/22/2012 11:15:35                                    4        
    722 0.25000 submit__32 me         qw    08/22/2012 11:16:01                                    4    

 qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      3     -    7.8G       -   14.9G       -
boron                   lx26-amd64      6  6.10    7.6G    1.4G   14.9G  240.8M
tantalum                lx26-amd64      4  4.01    7.7G    1.6G   14.9G     0.0

sudo service gridengine-exec restart
Restarting Sun Grid Engine Execution Daemon: sge_execd.

qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
beryllium               lx26-amd64      3  0.22    7.8G    3.0G   14.9G  141.3M
boron                   lx26-amd64      6  6.09    7.6G    1.4G   14.9G  240.8M
tantalum                lx26-amd64      4  4.01    7.7G    1.6G   14.9G     0.0




22 August 2012

224. Disabling tracker-miner-fs

Yes, yes, I shouldn't have a full desktop install on a computational node, but the nodes serve as instant replacement desktops if something goes awry with my main desktop, and occasionally visitors get to use them to access the internet in order to avoid getting bored.

Anyway, tracker-miner-fs is eating up 26% of my 8 Gb RAM on one of my nodes running KDE, and I really don't need it. I mean, I don't know if it's useful to most people running a full DE, but on my node I most certainly, definitely don't need it.

Given the number of posts online with questions about tracker ('What is it?", "Why is it using up all my resources?" etc.) I think that there's a bit of a PR problem. If it's a program that is noticeable because it makes demands on your computer system, the users should be allowed to know why putting up with this extra drain on resources is desirable -- or not.

Anyway.

aptitude show tracker says:
"Tracker is an advanced framework for first class objects with associated metadata and tags. It provides a one stop solution for all metadata, tags, shared object databases, search tools and indexing."

...which means what exactly in practical terms?

man tracker-miner-fs 
NAME
       tracker-miner-fs - Used to crawl the file system to mine data.
man tracker-store
NAME
       tracker-store - database indexer and query daemon
My guess would be that tracker-miner is basically indexing files for faster search, but I really don't know. It's one daemon I'm happy to expel.

There's a standard place for stuff that's supposed to be brought up with x:

ls /etc/xdg/autostart/

-rw-r--r-- 1 root root   306 May  3 09:42 at-spi-dbus-bus.desktop
-rw-r--r-- 1 root root  6216 Jun 20 06:58 evolution-alarm-notify.desktop
-rw-r--r-- 1 root root  7404 Oct 14  2011 gdu-notification-daemon.desktop
-rw-r--r-- 1 root root  5340 May 24 08:46 gnome-keyring-gpg.desktop
-rw-r--r-- 1 root root  6711 May 24 08:46 gnome-keyring-pkcs11.desktop
-rw-r--r-- 1 root root  6282 May 24 08:46 gnome-keyring-secrets.desktop
-rw-r--r-- 1 root root  5138 May 24 08:46 gnome-keyring-ssh.desktop
-rw-r--r-- 1 root root  6681 May 30 21:02 gnome-sound-applet.desktop
-rw-r--r-- 1 root root  7018 Apr 28 09:27 gsettings-data-convert.desktop
-rw-r--r-- 1 root root   460 Oct 21  2011 guake.desktop
-rw-r--r-- 1 root root   301 Jun 24 16:52 hplip-systray.desktop
-rw-r--r-- 1 root root   238 Dec  2  2011 kerneloops-applet.desktop
-rw-r--r-- 1 root root  4673 Mar 25 08:49 nm-applet.desktop
-rw-r--r-- 1 root root   250 Sep 10  2011 notification-daemon.desktop
-rw-r--r-- 1 root root  4651 Nov 12  2011 polkit-gnome-authentication-agent-1.desktop
-rw-r--r-- 1 root root  7112 Dec 23  2011 print-applet.desktop
-rw-r--r-- 1 root root  3864 Oct  1  2011 pulseaudio.desktop
-rw-r--r-- 1 root root   633 May 20 06:08 pulseaudio-kde.desktop
-rw-r--r-- 1 root root  3288 Aug 12 12:05 tracker-miner-fs.desktop
-rw-r--r-- 1 root root  3004 Aug 12 12:05 tracker-store.desktop

-rw-r--r-- 1 root root 11041 Apr  4 22:02 user-dirs-update-gtk.desktop
-rw-r--r-- 1 root root   433 Nov  3  2011 wicd-tray.desktop
-rw-r--r-- 1 root root   150 Feb  8  2012 xfce4-settings-helper-autostart.desktop
-rw-r--r-- 1 root root   357 Aug  1  2011 xfce4-volumed.desktop
Incidentally, the folder on that particular node betrays a history of previously installed desktop environments...

To stop and remove the tracker-miner processes, do
tracker-control -r 
It removes the databases it has created as well.


To disable:
Launch tracker-preferences from the KDE menu. Uncheck all options under 'Semantics'. Uncheck all places under locations. Clicking apply doesn't seem to have any effect, but if you open tracker-preferences again you'll probably find that it worked.

To disable tracker-miner-fs and tracker-store from the terminal you can probably edit:

tracker-miner-fs.desktop:
51 Icon=
 52 Exec=/usr/lib/tracker/tracker-miner-fs
 53 Terminal=false
 54 Type=Application
 55 Categories=Utility;
 56 X-GNOME-Autostart-enabled=false
 57 X-KDE-autostart-enabled=false
 58 X-KDE-StartupNotify=false
 59 X-KDE-UniqueApplet=true
 60 NoDisplay=true

tracker-store.desktop
54 Icon=
 55 Exec=/usr/lib/tracker/tracker-store
 56 Terminal=false
 57 Type=Application
 58 Categories=Utility;
 59 X-GNOME-Autostart-enabled=false
 60 X-KDE-autostart-enabled=false
 61 X-KDE-StartupNotify=false
 62 X-KDE-UniqueApplet=true
 63 NoDisplay=true
 64 OnlyShowIn=GNOME;KDE;XFCE;

Reasons why we don't simply uninstall it:

apt-cache rdepends tracker
         tracker
Reverse Depends:
  tracker-miner-fs
  tracker-gui
  tracker-gui
  tracker-gui
  nautilus
  tracker-miner-evolution
  tracker-utils
  tracker-extract
  brasero
  shared-mime-info
  tracker-utils
  tracker-miner-fs
  tracker-miner-evolution
  tracker-gui
  tracker-gui
  tracker-gui
  tracker-extract
  tracker-explorer
  tracker-dbg
  shared-mime-info
  rygel-tracker
  nautilus (twice?)
  catfish