07 September 2012

229. Compile ATLAS (+ gromacs, nwchem) on AMD FX 8150 on Debian Testing (Wheezy)

Xianyi's openblas doesn't seem to be ready for AMD FX 8150 yet. I've played with ATLAS in the past, but  for some reason didn't see the same performance with NWChem and ATLAS as I saw with NWChem and Openblas, so I never ended up using it.

I'm also having issues using openblas with CPMD and quantum espresso, and ATLAS is a well-established, respectable project, so it's time to give it another shot. As in most cases in these situations, it's probably a matter of PEBKAC.

Building ATLAS
Anyway. On we go...

mkdir /opt/ATLAS
chown ${USER} /opt/ATLAS
mkdir ~/tmp
cd ~/tmp

wget http://www.netlib.org/lapack/lapack-3.4.1.tgz
 wget http://downloads.sourceforge.net/project/math-atlas/Developer%20%28unstable%29/3.9.72/atlas3.9.72.tar.bz2
tar xvf atlas3.9.72.tar.bz2
cd ATLAS/

Edit ATLAS/Make.top
change the V on line 6 to lowercase i.e. from
- $(ICC) -V 2>&1 >> bin/INSTALL_LOG/ERROR.LOGto
- $(ICC) -v 2>&1 >> bin/INSTALL_LOG/ERROR.LOG
mkdir build/
cd build/


sudo apt-get install cpufreq-utils
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand
sudo cpufreq-set -g performance

Unfortunately that only takes care of cpu0:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance
but

cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
ondemand
So...since we have 8 cores (cpu0-cpu7):

sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
sudo cp /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

OK, we're ready to compile:
.././configure --prefix=/opt/ATLAS -Fa alg '-fPIC' --with-netlib-lapack-tarfile=$HOME/tmp/lapack-3.4.1.tgz --shared

Some of the info that's important is:
OS configured as Linux (1)
Assembly configured as GAS_x8664 (2)
Vector ISA Extension configured as  AVXFMA4 (4,496)
Architecture configured as  AMDDOZER (34)
Clock rate configured as 3600Mhz
If that checks out you don't need to manually set your architecture. To get a list over options, do
 make xprint_enums ; ./xprint_enums

If all is well,

make
make install

You should now be done.

Linking Gromacs against ATLAS

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/ATLAS/lib
#single precision
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/ATLAS/lib -lsatlas -ltatlas -lgfortran"
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/ATLAS/include"
./configure --disable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_spatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make.err 1>make.log
make install

#double precision
make distclean
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/double/lib -L/opt/ATLAS/lib -lsatlas -ltatlas -lgfortran" 
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/double/include -I/opt/ATLAS/include"
./configure --disable-mpi --disable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_dpatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make2.err 1>make2.log
make install

#single + mpi
make distclean
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/single/lib -L/opt/ATLAS/lib -lsatlas -ltatlas -lgfortran"
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/single/include -I/opt/ATLAS/include"
./configure --enable-mpi --enable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_spmpiatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make3.err 1>make3.log
make install

#double + mpi
make distclean
export LDFLAGS="-L/opt/fftw/fftw-3.3.2/double/lib -L/opt/ATLAS/lib -lsatlas  -ltatlas  -lgfortran" 
export CPPFLAGS="-I/opt/fftw/fftw-3.3.2/double/include -I/opt/ATLAS/include"
./configure --enable-mpi --disable-float --with-fft=fftw3 --with-external-blas --with-external-lapack --program-suffix=_dpmpiatlas --prefix=/opt/gromacs/gromacs-4.5.5
make -j6 2>make4.err 1>make4.log
make install

Linking NWChem against ATLAS

export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all"
export BLASOPT="-L/opt/ATLAS/lib -lsatlas -ltatlas"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/ATLAS/lib"
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
export LDFLAGS="-I/opt/ATLAS/include"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran 2> make.err 1>make.log
export FC=gfortran
cd $NWCHEM_TOP/contrib
./getmem.nwchem

228. Setting up Asus (nvidia) GF 210 on Debian Testing

NOTE:  Unless I remove the legacy driver, *DM will not start. Instead I only get a blank screen with a blinking cursor. See below for solution.

Here's how to get ASUS (nvidia) GF210 up and running in debian testing (wheezy)

First edit /etc/modules, and add
blacklist nouveau

You can either reboot at this point or try
sudo rmmod nouveau

To see whether nouveau got unloaded, do
lsmod |grep nouv

If nothing is returned, then you're good.

Make sure that your card got recognised:

lspci
01:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)

I like smxi, so here's how to get the drivers up and running using smxi, which is a fancy shell script.

sudo su
cd /usr/local/bin
wget -Nc smxi.org/smxi.zip
unzip smxi.zip
smxi

The first time you run smxi you have a couple of things to sort out -- lots of little questions to answer. If you don't feel comfortable yet with linux, avoid liquorix since it'll make your debian box deviate more from the standard setup (the liquorix kernel is fine and safe and I've used it in the past before I started rolling my own kernel, but it's more difficult for someone to troubleshoot your system the more it deviates from their own). Other than that most questions aren't that important. I enable non-free immediately after setting up a new box, and while there are sound political reasons for NOT doing it, there are plenty of practical reasons in favour of it.

Anyway, eventually you're done with the setup, and with making sure that your system is up to date. Select Continue to Graphics, then select debian-nvidia

If all goes well you'll get the dkms install of the nvidia driver. You're probably asked whether to generate a new xorg.conf, which you should. You may also get a message about the nvidia driver needing to be added to your xorg.conf.

Once you're done installing the driver, you're asked whether to start your desktop or to quit. While it's fine to start your desktop at this point, why not select quit and check that all went well?

lsmod|grep nvi
nvidia               8028141  0
i2c_core               24002  2 i2c_piix4,nvidia

cat /etc/X11/xorg.conf|grep nvi
        Driver  "nvidia"
        Driver  "nvidia"
        Driver  "nvidia"

Looks fine.

I often have problems with the legacy drivers (blank screen with blinking cursor), so
sudo apt-get purge nvidia-*-legacy-173xx-*

Do
aptitude search 173 
to make sure all the legacy drivers are gone.

Purge if there's still something around.

Framebuffer 
If possible I like to enable framebuffer (it gives you fancier graphics capabilities in terminal mode e.g. browsing with images using w3m). I've had all manner of headaches doing so with the newer nvidia drivers though, so don't be too surprised if it doesn't pan out.

Edit the following line in your /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet text vga=0x0318 nomodeset"
To see what code to use, look here.
This method is supposed to be deprecated, but I don't have any experience using vbetool.

As for the other options, only 'text' is important -- it will make you boot into the terminal and you will have to start your (default) desktop by doing startx. IF you want to boot into e.g. gdm3, kdm or another *dm, then DO NOT ADD text.

Reboot. To see whether your framebuffer is active do
ls /etc/fb*
/dev/fb0


04 September 2012

227. New compute node using AMD FX-8150. Gromacs, nwchem performance/benchmarks

Update: reconfiguring your nwchem binary using getmem.nwchem can speed things up considerably. Most of the runtimes are obtained without using getmem.nwchem and are thus all using the same amount of memory, regardless of what is available. Binaries which have been reconfigured are shown as such.

The short summary: I first wasn't that happy with my choice of the the AMD FX-8150, but after sorting out the ACML libs and getting things benchmarked I'm much more satisfied. The only situation in which I'm not seeing this processor outclass the other systems seems to be that using the Commercial Ab Initio Package, which arrived as a precompiled binary (Portland Fortran).

In general it seems that the FX 8150 is about 10% faster than the i5-2400 for the computations I've tested here -- but beware that the AMD processor is using the machine vendor math libs, while the intel unit is using openblas.

Note that the AMD Phenom II X6 1055T is SLOWER with the ACML libs than with Openblas.

The Lengthy Preamble
I seem to remember promising myself not to get another AMD since, while they may or may not be 'the good guys' (e.g. the Intel/Dell thingy), empirically I keep on seeing my Quadcore Intel i5-2400 3.1 GHz sweeping the floor with my Phenom II X6 1055T 2.8 GHz. Sure, part of the issue is the clock frequency, but  the difference seems to be a lot bigger than that.

At any rate, I ended up building a new node for my little cluster. Remember that these are Australian prices. Oh, how I miss you, Newegg -- not just because of the price, but because of the choice.

Luckily it seems like my choice of the FX-8150 has paid off. Also, at the moment of writing the intel i5-2400 and the AMD fx-8150 sell for the same price locally.


The Setup

It's basically an eight-core 3.6 GHz box with 16 GB RAM (expandable to 32 Gb, 4 slots) and a 7200 rpm HDD. I've heard the eight-core FX 8150 uncharitably described as a quad-core with advanced hyper-threading, but I wouldn't be qualified to comment. Interestingly, sinfo registers it as a quad core, while htop and all other programs considers it an 8-core. Finally, looking at this image it looks like the whole 8 core thing is a bit of a cheat -- the whole 4 floating point vs 8 integer processing units.

Gigabyte GA-990FXA-D3 AM3 990FX DDR3 Motherboard AU$ 128 Link
AMD AM3+ x8 FX-8150 3.6Ghz Boxed CPU AU$209 Link
PV316G186C0K 16G Kit(8Gx2) DDR3 1866 AU$ 129 Link
Hitachi 3.5" Desktar 1TB SATA3 HDD 7200rpm AU$83 Link
Corsair GS800 V2 ATX Power Supply Unit AU$ 138 Link
TP-LINK TG-3269 PCI Gigabit PCI Network Card AU$ 8Link
ASUS Vento TA-U11 without PSU AU$99 Link
ASUS 1GB GF210 PCI-E VGA Card Link

NOTE that the mobo does NOT have onboard video. I didn't pick up on that before buying the parts, but luckily had an old ATI card floating around.

The fan on the PSU is a bit annoying. It stays off for the most part (some posts say it should never be completely off, one post said it should be) but starts up in a weird way -- basically the electricity is given in small jolts. Or it's just broken. Other than that it works fine.

Preparation
It's for reasons like these I write this blog. After having installed debian testing I set up NFS, added the box as a node under Sun Grid Engine (Link), set up Gaussian (Link), and compiled Gromacs.

I encountered separate issues trying to compile Openblas (Bulldozer cores aren't supported) and Nwchem with internal libs (odd stuff). I've given up on Openblas and managed to compile nwchem against the AMD ACML. Same went for gromacs -- I eventually recompiled gromacs against ACML. Maybe it's unfair to compare ACML vs Openblas on the i5-2400, but ACML is free, MKL isn't.


Performance -- setup
Note that while I do use NFS it's not in the 'traditional way'. Each node exports a local folder to the front node so that SGE can see it. However, when you run your calcs everything is stored in a local folder, and using a locally compiled version of the number crunching software. In other words, network performance should not affect the benchmarks.

Neon is NOT using openblas, while Boron and Tantalum are. Xianyi's version of openblas won't compile on Bulldozer at the moment (it seems). I will rebuild gromacs with the ACML libs and do the benchmark again.

Also, please note that these 'benchmarks' aren't absolute -- I'm not an expert on optimising performance. You can probably use them to get an idea of the relative computational grunt of the different hardware combinations though.

FX 8150 is a lot more fun with ACML. The Phenom II 1055T is no fun with ACML.

I recompiled nwchem and gromacs on Boron (see below) to see what ACML vs Openblas would be like. I've yet to run those jobs, but will post the results when I have.

Unlike the FX-8150, the Phenom II X6 1055T does not support AVX, FMA3 or FMA4.

Configuration:
Boron (B): Phenom II X6 2.8 GHz, 8Gb RAM (2.8*6=16.8 GFLOPS predicted)
Neon (Ne): FX-8150 X8 3.6 GHz, 16 Gb RAM (3.6*8=28.8 GFLOPS predicted (int), 3.6*4=14.4 GFLOPS (fpu))
Tantalum (Ta): Quadcore i5-2400 3.1 GHz, 8 Gb RAM (3.1*4=12.4 GFLOPS predicted)
Vanadium (V):  Dual socket 2x Quadcore Xeon X3480 3.06 GHz, 8Gb. CentOS (ROCKS 5.4.3)/openblas.

Results

Gromacs --double (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, double precision, 500k steps)
B  :  10.662 ns/day (11.8  GFLOPS, runtime 8104 seconds)***
B  :    9.921 ns/day ( 10.9 GFLOPS, runtime 8709 seconds)**
Ne:  10.606 ns/day (11.7  GFLOPS, runtime 8146 seconds) *
Ne:  12.375 ns/day (13.7  GFLOPS, runtime 6982 seconds)**
Ne:  12.385 ns/day (13.7  GFLOPS, runtime 6976 seconds)****
Ta:  10.825 ns/day (11.9  GFLOPS, runtime 7981 seconds)***
V :   10.560 ns/dat (11.7  GFLOPS, runtime 8182 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

Gromacs --single (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, single precision, 500 k steps)
B  :   17.251 ns/day (19.0 GFLOPS, runtime 5008 seconds)***
Ne:   21.874 ns/day (24.2 GFLOPS, runtime  3950 seconds)**
Ne:   21.804 ns/day (24.1 GFLOPS, runtime 3963  seconds)****
Ta:   17.345 ns/day (19.2 GFLOPS, runtime  4982 seconds)***
V :   17.297 ns/day (19.1 GFLOPS, runtime 4995 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

NWChem (opt biphenyl cation, cp-md/pspw):
B  :   5951 seconds**
B  :   4084 seconds ***
B  :   1988 seconds***x
Ne:   3689 seconds**
Ta :   4102 seconds***
V :    5396 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem

NWChem (opt biphenyl cation, geovib, 6-31G**/ub3lyp):
B  :  2841 seconds **
B  :  2410 seconds***
B  :  2101 seconds ***x
Ne: 1665 seconds **
Ta : 1785 seconds***
Ta : 1789 seconds***x
V  : 2600 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem

A Certain Commercial Ab Initio Package (Freq calc of pre-optimised H14C19O3 at 6-31+G*/rb3lyp):
B  :    2h 00 min (CPU time 10h 37 min)
Ne:   1h 37 min (CPU time: 11h 13 min)
Ta:   1h 26 min (CPU time: 5h 27 min)
V  :   2h 15 min (CPU time 15h 50 min)
Using precompiled binaries.

More:
Since I couldn't use Xianyi's openblas with FX 8150 I downloaded the AMD ACML. I've had issues with that before, which is why I haven't been using that as a rule. This time I was motivated enough to hammer it out though. Anyway, here's the cpuid output from the acml 5.2.0:
./cpuid.exe 
Chip manufacturer: AuthenticAMD
AuthenticAMD family 15 extended family 6 model 1
Model Name: AMD FX(tm)-8150 Eight-Core Processor        
Chip supports SSE
Chip supports SSE2
Chip supports SSE3
Chip supports AVX
Chip does not support FMA3
Chip supports FMA4
See the other post from today about build nwchem with acml (hint: use the fma4_int64 libs but avoid mp).

Here's 1055T:
Chip manufacturer: AuthenticAMD
AuthenticAMD family 15 extended family 1 model 10
Model Name: AMD Phenom(tm) II X6 1055T Processor
Chip supports SSE
Chip supports SSE2
Chip supports SSE3
Chip does not support AVX
Chip does not support FMA3
Chip does not support FMA4


Issues

Openblas:
You will get SGEMM related errors trying to build openblas according to the instructions I've posted on this site before. Apparently it has to do with the way the architecture is autoselected during build. Or something. I couldn't make it work.

NwChem:
I tried building nwchem with the internal libs, but had no luck. See other posts on this blog for general instructions. Building with the AMD ACML worked fine though.


Files:

NWChem (opt biphenyl cation, cp-md/pspw):
Title "Test 1"
Start  biphenyl_cation_twisted-1
echo
charge 1
geometry autosym units angstrom
 C     0.00000     -3.54034     0.00000
 C     -1.20296     -2.84049     -0.216000
 C     -1.20944     -1.46171     -0.206253
 C     0.00000     -0.721866     0.00000
 C     1.20944     -1.46171     0.206253
 C     1.20296     -2.84049     0.216000
 C     0.00000     0.721866     0.00000
 C     1.20944     1.46171     -0.206253
 C     1.20296     2.84049     -0.216000
 C     -1.20944     1.46171     0.206253
 C     0.00000     3.54034     0.00000
 C     -1.20296     2.84049     0.216000
 H     0.00000     -4.62590     0.00000
 H     -2.12200     -3.38761     -0.395378
 H     -2.13673     -0.938003     -0.401924
 H     2.12200     -3.38761     0.395378
 H     2.12200     3.38761     -0.395378
 H     -2.13673     0.938003     0.401924
 H     0.00000     4.62590     0.00000
 H     -2.12200     3.38761     0.395378
 H     2.13673     0.938003     -0.401924
 H     2.13673     -0.938003     0.401924
end
nwpw
  simulation_cell
     lattice_vectors
      2.000000e+01 0.000000e+00 0.000000e+00
      0.000000e+00 2.000000e+01 0.000000e+00
      0.000000e+00 0.000000e+00 2.000000e+01
  end
  mult 2
  np_dimensions -1  -1
  tolerances 1e-7  1e-7
end
driver
  default
end
task pspw optimize
NWChem (opt biphenyl cation, geovib, 6-31G**/ub3lyp):
Title "Test 2"
Start  biphenyl_cation_twisted
echo
charge 1
geometry autosym units angstrom
 C     0.00000     -3.56301     0.00000
 C     -1.13927     -2.85928     -0.393841
 C     -1.13879     -1.46545     -0.394153
 C     0.00000     -0.742814     0.00000
 C     1.13879     -1.46545     0.394153
 C     1.13927     -2.85928     0.393841
 C     0.00000     0.742814     0.00000
 C     1.13879     1.46545     -0.394153
 C     1.13927     2.85928     -0.393841
 C     -1.13879     1.46545     0.394153
 C     0.00000     3.56301     0.00000
 C     -1.13927     2.85928     0.393841
 H     0.00000     -4.64896     0.00000
 H     -2.02827     -3.39662     -0.711607
 H     -2.02148     -0.928265     -0.727933
 H     2.02827     -3.39662     0.711607
 H     2.02827     3.39662     -0.711607
 H     -2.02148     0.928265     0.727933
 H     0.00000     4.64896     0.00000
 H     -2.02827     3.39662     0.711607
 H     2.02148     0.928265     -0.727933
 H     2.02148     -0.928265     0.727933
end
basis "ao basis" cartesian print
  H library "6-31G**"
  C library "6-31G**"
END
dft
  mult 2
  XC b3lyp
  mulliken
end
driver
end
task dft optimize
task dft freq numerical