04 September 2012

227. New compute node using AMD FX-8150. Gromacs, nwchem performance/benchmarks

Update: reconfiguring your nwchem binary using getmem.nwchem can speed things up considerably. Most of the runtimes are obtained without using getmem.nwchem and are thus all using the same amount of memory, regardless of what is available. Binaries which have been reconfigured are shown as such.

The short summary: I first wasn't that happy with my choice of the the AMD FX-8150, but after sorting out the ACML libs and getting things benchmarked I'm much more satisfied. The only situation in which I'm not seeing this processor outclass the other systems seems to be that using the Commercial Ab Initio Package, which arrived as a precompiled binary (Portland Fortran).

In general it seems that the FX 8150 is about 10% faster than the i5-2400 for the computations I've tested here -- but beware that the AMD processor is using the machine vendor math libs, while the intel unit is using openblas.

Note that the AMD Phenom II X6 1055T is SLOWER with the ACML libs than with Openblas.

The Lengthy Preamble
I seem to remember promising myself not to get another AMD since, while they may or may not be 'the good guys' (e.g. the Intel/Dell thingy), empirically I keep on seeing my Quadcore Intel i5-2400 3.1 GHz sweeping the floor with my Phenom II X6 1055T 2.8 GHz. Sure, part of the issue is the clock frequency, but  the difference seems to be a lot bigger than that.

At any rate, I ended up building a new node for my little cluster. Remember that these are Australian prices. Oh, how I miss you, Newegg -- not just because of the price, but because of the choice.

Luckily it seems like my choice of the FX-8150 has paid off. Also, at the moment of writing the intel i5-2400 and the AMD fx-8150 sell for the same price locally.


The Setup

It's basically an eight-core 3.6 GHz box with 16 GB RAM (expandable to 32 Gb, 4 slots) and a 7200 rpm HDD. I've heard the eight-core FX 8150 uncharitably described as a quad-core with advanced hyper-threading, but I wouldn't be qualified to comment. Interestingly, sinfo registers it as a quad core, while htop and all other programs considers it an 8-core. Finally, looking at this image it looks like the whole 8 core thing is a bit of a cheat -- the whole 4 floating point vs 8 integer processing units.

Gigabyte GA-990FXA-D3 AM3 990FX DDR3 Motherboard AU$ 128 Link
AMD AM3+ x8 FX-8150 3.6Ghz Boxed CPU AU$209 Link
PV316G186C0K 16G Kit(8Gx2) DDR3 1866 AU$ 129 Link
Hitachi 3.5" Desktar 1TB SATA3 HDD 7200rpm AU$83 Link
Corsair GS800 V2 ATX Power Supply Unit AU$ 138 Link
TP-LINK TG-3269 PCI Gigabit PCI Network Card AU$ 8Link
ASUS Vento TA-U11 without PSU AU$99 Link
ASUS 1GB GF210 PCI-E VGA Card Link

NOTE that the mobo does NOT have onboard video. I didn't pick up on that before buying the parts, but luckily had an old ATI card floating around.

The fan on the PSU is a bit annoying. It stays off for the most part (some posts say it should never be completely off, one post said it should be) but starts up in a weird way -- basically the electricity is given in small jolts. Or it's just broken. Other than that it works fine.

Preparation
It's for reasons like these I write this blog. After having installed debian testing I set up NFS, added the box as a node under Sun Grid Engine (Link), set up Gaussian (Link), and compiled Gromacs.

I encountered separate issues trying to compile Openblas (Bulldozer cores aren't supported) and Nwchem with internal libs (odd stuff). I've given up on Openblas and managed to compile nwchem against the AMD ACML. Same went for gromacs -- I eventually recompiled gromacs against ACML. Maybe it's unfair to compare ACML vs Openblas on the i5-2400, but ACML is free, MKL isn't.


Performance -- setup
Note that while I do use NFS it's not in the 'traditional way'. Each node exports a local folder to the front node so that SGE can see it. However, when you run your calcs everything is stored in a local folder, and using a locally compiled version of the number crunching software. In other words, network performance should not affect the benchmarks.

Neon is NOT using openblas, while Boron and Tantalum are. Xianyi's version of openblas won't compile on Bulldozer at the moment (it seems). I will rebuild gromacs with the ACML libs and do the benchmark again.

Also, please note that these 'benchmarks' aren't absolute -- I'm not an expert on optimising performance. You can probably use them to get an idea of the relative computational grunt of the different hardware combinations though.

FX 8150 is a lot more fun with ACML. The Phenom II 1055T is no fun with ACML.

I recompiled nwchem and gromacs on Boron (see below) to see what ACML vs Openblas would be like. I've yet to run those jobs, but will post the results when I have.

Unlike the FX-8150, the Phenom II X6 1055T does not support AVX, FMA3 or FMA4.

Configuration:
Boron (B): Phenom II X6 2.8 GHz, 8Gb RAM (2.8*6=16.8 GFLOPS predicted)
Neon (Ne): FX-8150 X8 3.6 GHz, 16 Gb RAM (3.6*8=28.8 GFLOPS predicted (int), 3.6*4=14.4 GFLOPS (fpu))
Tantalum (Ta): Quadcore i5-2400 3.1 GHz, 8 Gb RAM (3.1*4=12.4 GFLOPS predicted)
Vanadium (V):  Dual socket 2x Quadcore Xeon X3480 3.06 GHz, 8Gb. CentOS (ROCKS 5.4.3)/openblas.

Results

Gromacs --double (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, double precision, 500k steps)
B  :  10.662 ns/day (11.8  GFLOPS, runtime 8104 seconds)***
B  :    9.921 ns/day ( 10.9 GFLOPS, runtime 8709 seconds)**
Ne:  10.606 ns/day (11.7  GFLOPS, runtime 8146 seconds) *
Ne:  12.375 ns/day (13.7  GFLOPS, runtime 6982 seconds)**
Ne:  12.385 ns/day (13.7  GFLOPS, runtime 6976 seconds)****
Ta:  10.825 ns/day (11.9  GFLOPS, runtime 7981 seconds)***
V :   10.560 ns/dat (11.7  GFLOPS, runtime 8182 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

Gromacs --single (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, single precision, 500 k steps)
B  :   17.251 ns/day (19.0 GFLOPS, runtime 5008 seconds)***
Ne:   21.874 ns/day (24.2 GFLOPS, runtime  3950 seconds)**
Ne:   21.804 ns/day (24.1 GFLOPS, runtime 3963  seconds)****
Ta:   17.345 ns/day (19.2 GFLOPS, runtime  4982 seconds)***
V :   17.297 ns/day (19.1 GFLOPS, runtime 4995 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

NWChem (opt biphenyl cation, cp-md/pspw):
B  :   5951 seconds**
B  :   4084 seconds ***
B  :   1988 seconds***x
Ne:   3689 seconds**
Ta :   4102 seconds***
V :    5396 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem

NWChem (opt biphenyl cation, geovib, 6-31G**/ub3lyp):
B  :  2841 seconds **
B  :  2410 seconds***
B  :  2101 seconds ***x
Ne: 1665 seconds **
Ta : 1785 seconds***
Ta : 1789 seconds***x
V  : 2600 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem

A Certain Commercial Ab Initio Package (Freq calc of pre-optimised H14C19O3 at 6-31+G*/rb3lyp):
B  :    2h 00 min (CPU time 10h 37 min)
Ne:   1h 37 min (CPU time: 11h 13 min)
Ta:   1h 26 min (CPU time: 5h 27 min)
V  :   2h 15 min (CPU time 15h 50 min)
Using precompiled binaries.

More:
Since I couldn't use Xianyi's openblas with FX 8150 I downloaded the AMD ACML. I've had issues with that before, which is why I haven't been using that as a rule. This time I was motivated enough to hammer it out though. Anyway, here's the cpuid output from the acml 5.2.0:
./cpuid.exe 
Chip manufacturer: AuthenticAMD
AuthenticAMD family 15 extended family 6 model 1
Model Name: AMD FX(tm)-8150 Eight-Core Processor        
Chip supports SSE
Chip supports SSE2
Chip supports SSE3
Chip supports AVX
Chip does not support FMA3
Chip supports FMA4
See the other post from today about build nwchem with acml (hint: use the fma4_int64 libs but avoid mp).

Here's 1055T:
Chip manufacturer: AuthenticAMD
AuthenticAMD family 15 extended family 1 model 10
Model Name: AMD Phenom(tm) II X6 1055T Processor
Chip supports SSE
Chip supports SSE2
Chip supports SSE3
Chip does not support AVX
Chip does not support FMA3
Chip does not support FMA4


Issues

Openblas:
You will get SGEMM related errors trying to build openblas according to the instructions I've posted on this site before. Apparently it has to do with the way the architecture is autoselected during build. Or something. I couldn't make it work.

NwChem:
I tried building nwchem with the internal libs, but had no luck. See other posts on this blog for general instructions. Building with the AMD ACML worked fine though.


Files:

NWChem (opt biphenyl cation, cp-md/pspw):
Title "Test 1"
Start  biphenyl_cation_twisted-1
echo
charge 1
geometry autosym units angstrom
 C     0.00000     -3.54034     0.00000
 C     -1.20296     -2.84049     -0.216000
 C     -1.20944     -1.46171     -0.206253
 C     0.00000     -0.721866     0.00000
 C     1.20944     -1.46171     0.206253
 C     1.20296     -2.84049     0.216000
 C     0.00000     0.721866     0.00000
 C     1.20944     1.46171     -0.206253
 C     1.20296     2.84049     -0.216000
 C     -1.20944     1.46171     0.206253
 C     0.00000     3.54034     0.00000
 C     -1.20296     2.84049     0.216000
 H     0.00000     -4.62590     0.00000
 H     -2.12200     -3.38761     -0.395378
 H     -2.13673     -0.938003     -0.401924
 H     2.12200     -3.38761     0.395378
 H     2.12200     3.38761     -0.395378
 H     -2.13673     0.938003     0.401924
 H     0.00000     4.62590     0.00000
 H     -2.12200     3.38761     0.395378
 H     2.13673     0.938003     -0.401924
 H     2.13673     -0.938003     0.401924
end
nwpw
  simulation_cell
     lattice_vectors
      2.000000e+01 0.000000e+00 0.000000e+00
      0.000000e+00 2.000000e+01 0.000000e+00
      0.000000e+00 0.000000e+00 2.000000e+01
  end
  mult 2
  np_dimensions -1  -1
  tolerances 1e-7  1e-7
end
driver
  default
end
task pspw optimize
NWChem (opt biphenyl cation, geovib, 6-31G**/ub3lyp):
Title "Test 2"
Start  biphenyl_cation_twisted
echo
charge 1
geometry autosym units angstrom
 C     0.00000     -3.56301     0.00000
 C     -1.13927     -2.85928     -0.393841
 C     -1.13879     -1.46545     -0.394153
 C     0.00000     -0.742814     0.00000
 C     1.13879     -1.46545     0.394153
 C     1.13927     -2.85928     0.393841
 C     0.00000     0.742814     0.00000
 C     1.13879     1.46545     -0.394153
 C     1.13927     2.85928     -0.393841
 C     -1.13879     1.46545     0.394153
 C     0.00000     3.56301     0.00000
 C     -1.13927     2.85928     0.393841
 H     0.00000     -4.64896     0.00000
 H     -2.02827     -3.39662     -0.711607
 H     -2.02148     -0.928265     -0.727933
 H     2.02827     -3.39662     0.711607
 H     2.02827     3.39662     -0.711607
 H     -2.02148     0.928265     0.727933
 H     0.00000     4.64896     0.00000
 H     -2.02827     3.39662     0.711607
 H     2.02148     0.928265     -0.727933
 H     2.02148     -0.928265     0.727933
end
basis "ao basis" cartesian print
  H library "6-31G**"
  C library "6-31G**"
END
dft
  mult 2
  XC b3lyp
  mulliken
end
driver
end
task dft optimize
task dft freq numerical

No comments:

Post a Comment