15 March 2012

108. Building local version of sinfo without root/sudo on ROCKS/CentOS

Edit 04/04/2012: there were several errors and omissions. These have been fixed now.

Because I don't want to mess up a cluster which is on a different continent I'm trying to use my superuser powers as little as possible.

Here's how to make a local version of sinfo -- you'll still need to make sinfod runs as a service on all the nodes.

There's no reason the instructions here shouldn't work on most linux distros, including Debian.

boost:
cd ~/tmp
wget http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download
tar -xvf boost_1_49_0.tar.gz
cd boost_1_49_0/
./bootstrap.sh --prefix=/export/home/me/.libboost


Edit tools/build/user-config.jam and add
using mpi ;
The space between mpi and ; is needed.

Start installation:
./b2 install

cd /export/home/me/.libboost/lib
ln -s libboost_signals.so libboost_signals-mt.so
ln -s libboost_serialization.so libboost_serialization-mt.so
ln -s libboost_date_time.so libboost_date_time-mt.so
ln -s libboost_wserialization.so libboost_wserialization-mt.so
ln -s libboost_regex.so libboost_regex-mt.so


asio:
cd ~/tmp
wget "http://downloads.sourceforge.net/project/asio/asio/1.5.3%20%28Development%29/asio-1.5.3.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fasio%2F&ts=1331441086&use_mirror=aarnet"
tar -xvf asio-1.5.3.tar.bz2
cd asio-1.5.3/

./configure --prefix=/export/home/me/.asio --with-boost=/export/home/me/.libboost/include
make
make install

sinfo/d:
wget http://www.ant.uni-bremen.de/whomes/rinas/sinfo/download/sinfo-0.0.45.tar.gz
tar -xvf sinfo-0.0.45.tar.gz
cd sinfo-0.0.45/

export LIBS=-L/export/home/me/.libboost/lib
export LDFLAGS=$LIBS
export CPPFLAGS="-I/export/home/me/.libboost/include -I/export/home/me/.asio/include/"
./configure --prefix=/export/home/me/.sinfo --disable-IPv6
make

make install 

Getting started:
In order to make something happen at boot you need sudo/root access. However, HPC clusters are rarely rebooted, so even if you launch something as a user it will persist for a long time. If you're lucky the right ports are open -- and they should be open between nodes.

You also need to add this to your ~/.bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/export/home/me/.libboost/lib

Start sinfod (the daemon) using:
~/.sinfo/sbin/./sinfod --quiet

ps aux |grep sinfod 
will show it it's running

And check that everything is ok using
~/.sinfo/bin/./sinfo



13 March 2012

107. Fun with gnu screen -- setting up a screenrc

This is going to be a short post.

The problem:
If I have three nodes on a cluster, each with htop installed, how do I set up gnu screen so that it automatically signs in to each node and starts a copy of htop on each node, and display the output in a split window? Basically, how to you open screen instances and inject commands in them?

...like so. Guess which computer is running a full gnome-shell environment...


The solution:
create a file called multitop.screenrc:

screen ssh 192.168.1.101
stuff htop\015
title node01
split
focus
screen ssh 192.168.1.102
title node02
stuff htop\015
split
focus
screen ssh 192.168.1.103
title node03
stuff htop\015
stuff string injects string into the active window. \015 basically injects <enter>.

start screen using
screen -c multitop.screenrc 
and as usual,don't exit using exit or q, but use C+a - d to leave it running.


To make things a bit prettier and to enable users to reconnect to a running session, edit /etc/screenrc:

multiuser on
acladd me
defscrollback 5432
termcapinfo xterm|xterms|xs|rxvt ti@:te@
caption     always        "%{+b rk}%H%{gk} |%c %{yk}%d.%m.%Y | %72=Load: %l %{wk}"
hardstatus alwayslastline "%?%{yk}%-Lw%?%{wb}%n*%f %t%?(%u)%?%?%{yk}%+Lw%"
and
sudo chmod +s /usr/bin/screen
sudo chmod 755 /var/run/screen


If you start a session using e.g.
screen -S mytest
as user me

then other users can connect using
screen -x me/mytest

106. htop 1.0.1 and sinfo-0.0.45 on rock 5.4.3/centos 5.6

There are a number of performance monitor tools in the debian repos. ROCKS 5.4.3/Centos doesn't seem quite as well-equipped.

First out, htop:

htop:
wget http://downloads.sourceforge.net/project/htop/htop/1.0.1/htop-1.0.1.tar.gz
tar -xvf htop-1.0.1.tar.gz
cd htop-1.0.1/
./configure --prefix=/home/me/.htop
make
make install

It's as simple as that.
Add e.g.
alias htop='/home/me/.htop/bin/htop'
to your ~/.bashrc
Note: this works on Scientific Linux (boron) 5.4 as well.

sinfo:
Update 13/03/2012:
Sinfo <0.0.44 has IPv6 enabled by default.
On sinfo >=0.0.45 you can disable IPv6 using ./configure --disable-IPv6

Sinfo is probably the snazziest cluster monitoring tool that I know of. Sure, ganglia etc. are nice too, but they run as web service. Sinfo is a 'simple' curses program, but building it on CentOS was a bit of a challenge.

Be aware that sinfo versions prior to 0.045 expect ipv6 to work -- by default ROCKS disables IPv6, so use sinfo 0.0.45 and above.





First boost:
(yum install boost-devel didn't do anything for me)
cd ~/tmp
wget http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download
tar -xvf boost_1_49_0.tar.gz
cd boost_1_49_0/
./bootstrap.sh --prefix=/usr

Edit Jamroot and add
using mpi ;
The space between mpi and ; is needed.

Symlink to your mpic++, e.g. if your mpic++ is in /opt/openmpi:
sudo ln -s /opt/openmpi/bin/mpic++ /usr/bin/mpic++

The following step takes a long time:
sudo ./b2 -a install --layout=versioned --build-type=complete

These days all the libboost libs are multithread aware (or so I hear), and in debian it turns out that the -mt.so libs are just symbolic links to the 'regular' libs.
sudo ln -s /usr/lib/libboost_signals.so /usr/lib/libboost_signals-mt.so
sudo ln -s /usr/lib/libboost_date_time.so /usr/lib/libboost_date_time-mt.so
sudo ln -s /usr/lib/libboost_serialization.so /usr/lib/libboost_serialization-mt.so
sudo ln -s /usr/lib/libboost_wserialization.so /usr/lib/libboost_wserialization-mt.so
sudo ln -s /usr/lib/libboost_regex.so /usr/lib/libboost_regex-mt.so

sudo ln -s /usr/lib/libboost_signals.so.1.49.0 /usr/lib64/libboost_signals.so.1.49.0

Then asio
cd ~/tmp
wget "http://downloads.sourceforge.net/project/asio/asio/1.5.3%20%28Development%29/asio-1.5.3.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fasio%2F&ts=1331441086&use_mirror=aarnet"
tar -xvf asio-1.5.3.tar.bz2
cd asio-1.5.3/
./configure
make
sudo make install

Then sinfo
cd ~/tmp
wget http://www.ant.uni-bremen.de/whomes/rinas/sinfo/download/sinfo-0.0.45.tar.gz
tar -xvf sinfo-0.0.45.tar.gz
cd sinfo-0.0.45/
./configure --disable-IPv6

The build should be fine.

Configuration:
you'll end up with
/usr/local/sbin/sinfod
/usr/local/bin/sinfo
You may want to make sure there are paths to them by adding the following to your ~/.bashrc:
export PATH=$PATH:/usr/local/bin:/usr/local/sbin
The changes take effect next time you log in to a shell, or just run
source ~/.bashrc
for immediate effect.

Also, create a file called /etc/default/sinfo with the following in it:
OPTS="--quiet --bcastaddress=192.168.1.255"

Start sinfod with
sinfod --quiet --bcastaddress=192.168.1.255

then check that it's running
ps aux | grep sinfod

If it's not running, then try
sinfod -F

If it gives something along the lines of
exception:open:address family not supported
you most likely
1) haven't enabled ipv6 for your interface and
2) didn't disable IPv6 during compilation and/or
3) used version<0.045

Check by doing ifconfig -- does it return both an ipv4 and an ipv6 address?

Enabling ipv6
Unless you know what you're doing, don't fiddle with the network interfaces on a production cluster -- network interfaces on a multinode cluster are typically highly tuned to minimise latency, so don't mess it up.

Anyway. First check your /etc/modules.conf and - if present - comment out
alias ipv6 off
options ipv6 disable=1
Edit your /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1:0
IPADDR=192.168.1.111
NETMASK=255.255.255.0
BOOTPROTO=none
MTU=1500
TYPE=Ethernet
GATEWAY=192.168.1.1
USERCTL=no
IPV6INIT=yes
PEERDNS=yes
ONPARENT=yes
IPV6ADDR=fe80::2f0:4dff:f383:b44/64
IPV6_DEFAULTGW=fe80::2f0:4dff:fe83:a48/64
I just made up the IPV6ADDR, and took the IPV6_DEFAULTGW from my gateway machine (running debian, so ipv6 enabled by default)

Assuming that your firewall is allowing traffic at port 60003 and free traffic in and out on 192.168.1.255 things should work fine.



Errors


Error (boost):
MPI auto-detection failed: unknown wrapper compiler mpic++
Please report this error to the Boost mailing list: http://www.boost.org
You will need to manually configure MPI support.
Solution:
make sure you've symlinked to your mpic++ instance in /usr/bin
e.g. if your mpic++ is in /opt/openmpi/bin/mpic++
sudo ln -s /opt/openmpi/bin/mpic++ /usr/bin/mpic++


Error (sinfo):
message.cc: In member function 'void Message::popFrontMemory(void*, size_t)':
message.cc:183: error: 'memory' was not declared in this scope
message.cc:193: error: 'boost' has not been declared
message.cc:193: error: expected primary-expression before 'char'
message.cc:193: error: expected `;' before 'char'
message.cc:196: error: 'newMemory' was not declared in this scope
message.cc:196: error: 'memory' was not declared in this scope
make[2]: *** [message.lo] Error 1
make[2]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessage'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessage'
make: *** [all-recursive] Error 1
Solution:
You need to make sure that the libs are found -- either symlink manually between your build directory and /usr/lib, or use boostrap.sh --prefix=/usr. See above for how to do it.

Error (sinfo):
udpmessagereceiver.h:14: error: 'asio' has not been declared
udpmessagereceiver.h:14: error: ISO C++ forbids declaration of 'endpoint' with no type
udpmessagereceiver.h:14: error: expected ';' before 'sender_endpoint'
udpmessagereceiver.h:16: error: 'asio' has not been declared
udpmessagereceiver.h:16: error: ISO C++ forbids declaration of 'io_service' with no type
udpmessagereceiver.h:16: error: expected ';' before '&' token
udpmessagereceiver.h:17: error: 'asio' has not been declared
udpmessagereceiver.h:17: error: ISO C++ forbids declaration of 'socket' with no type
udpmessagereceiver.h:17: error: expected ';' before 'sock'
udpmessagereceiver.h:20: error: expected ',' or '...' before '::' token
udpmessagereceiver.h:20: error: ISO C++ forbids declaration of 'asio' with no type
udpmessagereceiver.h:23: error: 'asio' has not been declared
udpmessagereceiver.h:23: error: expected `)' before '&' token
udpmessagereceiver.cc:5: error: 'asio' has not been declared
udpmessagereceiver.cc:5: error: expected `)' before '&' token
make[1]: *** [udpmessagereceiver.lo] Error 1
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessageio'
make: *** [all-recursive] Error 1

Solution: you've only got boost::asio installed, not the independent asio. See above for how to compile and install asio.

Error (sinfo):

/usr/bin/ld: cannot find -lboost_signals-mt
collect2: ld returned 1 exit status
make[2]: *** [sinfod] Error 1
make[2]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/sinfod'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/sinfod'
make: *** [all-recursive] Error 1
Solution:
You need a symlink pointing form /usr/lib/libboost_signals-mt.so to /usr/lib/libboost_signals.so
ln -s /usr/lib/libboost_signals.so /usr/lib/libboost_signals-mt.so 

Error (sinfod):
sinfod --quiet --bcastaddress=192.168.1.255 gives nothing and sinfod exits silently immediately
sinfod -F gives
exception:open:address family not supported
Here's the relevant strace output:
[..]
 socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6
[..]
 socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDP) = -1 EAFNOSUPPORT (Address family not supported by protocol)
futex(0x333a40d350, FUTEX_WAKE_PRIVATE, 2147483647) = 0
close(6)                                = 0
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
write(2, "Exception: ", 11)             = 11
write(2, "open: Address family not support"..., 46) = 46
write(2, "\n", 1)                       = 1
exit_group(0)                           = ?

Solution: enable ipv6 (see above)