12 September 2012

233. Compiling netlib's lapack and blas on Debian Testing (Wheezy)

In addition to specific BLAS/LAPACK libs such as ACML, MKL, and ATLAS netlib provides (what I understand to be) reference versions of BLAS and LAPACK. Presumably these are slower than optimised versions of blas/lapack, but it doesn't hurt being familiar with them.

Here's how to compile those versions.



BLAS
sudo mkdir /opt/netlib
sudo chown $USER /opt/netlib
mkdir /opt/netlib/blas/lib -p
wget http://www.netlib.org/blas/blas.tgz
tar xvf blas.tgz
cd BLAS/

Edit make.inc
OPTS = -O3 -shared -m64 -march=native -fPIC


make all
gfortran -shared -Wl,-soname,libnetblas.so -o libblas.so.1.0.1 *.o -lc
ln -s libblas.so.1.0.1 libnetblas.so
cp lib*blas* /opt/netlib/blas/lib


To see whether everything linked ok:
 ldd libnetblas.so 
        linux-vdso.so.1 =>  (0x00007ffff1bc6000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002b42ec030000)
        libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00002b42ec3b8000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00002b42ec6ce000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00002b42ec950000)
        libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00002b42ecb67000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b42ebaf3000)




LAPACK
(inspired by this and this)
mkdir -p /opt/netlib/lapack
sudo apt-get install cmake-curses-gui
cd ~/tmp
wget http://www.netlib.org/lapack/lapack-3.4.1.tgz
tar xvf lapack-3.4.1.tgz
cd lapack-3.4.1/
mkdir build
cd build
ccmake ../

Hit 'c' to generate a configuration. Navigate with arrow keys and hit enter to change values. Change to the values in red:
 
 BUILD_COMPLEX                   *ON
 BUILD_COMPLEX16                 *ON
 BUILD_DOUBLE                    *ON
 BUILD_SHARED_LIBS               *ON
 BUILD_SINGLE                    *ON
 BUILD_STATIC_LIBS               *ON
 BUILD_TESTING                   *ON
 CMAKE_BUILD_TYPE                *     
 CMAKE_INSTALL_PREFIX            */opt/netlib/lapack
 LAPACKE                         *OFF
 LAPACKE_WITH_TMG                *OFF
 USE_OPTIMIZED_BLAS              *ON
 USE_XBLAS                       *OFF

Then hit 'c' which might give you (change the values in red) -- I got some errors about ACML/eula here, but don't worry about that.

NOTE: this will only work if you already have blas installed in a standard location. If you don't get the BLAS_FOUND etc. then you should hit 'c' again and then 'g'. Next edit your CMakeCache.txt and paste the variables (without line numbers) you find below this section, then do ccmake ../ again and make sure everything looks ok, and generate using 'g'.

 BLAS_FOUND                       TRUE
 BLAS_GENERIC_FOUND               ON
 BLAS_GENERIC_blas_LIBRARY        /opt/netlib/blas/lib/libnetblas.so
 BLAS_LIBRARIES                   /opt/netlib/blas/lib/libnetblas.so
 BLAS_LINKER_FLAGS
 BUILD_COMPLEX                   *ON
 BUILD_COMPLEX16                 *ON
 BUILD_DOUBLE                    *ON
 BUILD_SHARED_LIBS               *OFF
 BUILD_SINGLE                    *ON
 BUILD_STATIC_LIBS               *ON
 BUILD_TESTING                   *ON
 CMAKE_BUILD_TYPE                *     
 CMAKE_INSTALL_PREFIX            */usr/local 
 LAPACKE                         *OFF
 LAPACKE_WITH_TMG                *OFF
 USE_OPTIMIZED_BLAS              *ON
 USE_XBLAS                       *OFF
The hit 'c' again. If there were no issues, hit 'g' which writes the configuration and exits.


make
[100%] Building Fortran object TESTING/EIG/CMakeFiles/xeigtstz.dir/__/__/INSTALL/dsecnd_INT_ETIME.f.o
Linking Fortran executable ../../bin/xeigtstz
[100%] Built target xeigtstz

make install
Install the project...
-- Install configuration: ""
-- Installing: /opt/netlib/lapack/lib/pkgconfig/lapack.pc
-- Installing: /opt/netlib/lapack/lib/cmake/lapack-3.4.1/lapack-config.cmake
-- Installing: /opt/netlib/lapack/lib/cmake/lapack-3.4.1/lapack-config-version.cmake
-- Installing: /opt/netlib/lapack/lib/cmake/lapack-3.4.1/lapack-targets.cmake
-- Installing: /opt/netlib/lapack/lib/cmake/lapack-3.4.1/lapack-targets-noconfig.cmake
-- Installing: /opt/netlib/lapack/lib/liblapack.so
-- Removed runtime path from "/opt/netlib/lapack/lib/liblapack.so"
-- Installing: /opt/netlib/lapack/lib/libtmglib.so
-- Removed runtime path from "/opt/netlib/lapack/lib/libtmglib.so"


tree /opt/netlib/ -d
/opt/netlib/
|-- blas
|   `-- lib
`-- lapack
    `-- lib
        |-- cmake
        |   `-- lapack-3.4.1
        `-- pkgconfig

7 directories

CMakeCache.txt variables:
 16 
 17 BLAS_FOUND:STRING=TRUE
 18 
 19 //Whether not the GENERIC library was found and is usable
 20 BLAS_GENERIC_FOUND:BOOL=TRUE
 21 
 22 //Path to a library.
 23 BLAS_GENERIC_blas_LIBRARY:FILEPATH=/opt/netlib/blas/lib/libnetblas.so
 24 
 25 BLAS_LIBRARIES:PATH=/opt/netlib/blas/lib/libnetblas.so
 26 


Testing the libraries:
I built gromacs against the new libs to make sure they 'worked'

sudo mkdir /opt/gromacs
sudo chown ${USER} /opt/gromacs
cd ~/tmp
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-4.5.5.tar.gz
tar xvf gromacs-4.5.5.tar.gz
cd gromacs-4.5.5/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/netlib/blas/lib:/opt/netlib/lapack/lib
export LDFLAGS="-l:/opt/netlib/blas/lib/libnetblas.so -l:/opt/netlib/lapack/lib/liblapack.so"
./configure --disable-mpi --enable-float --with-external-blas --with-external-lapack --program-suffix=_netlib --prefix=/opt/gromacs/gromacs-4.5.5
make
make install

Check that it linked ok:
ldd /opt/gromacs/gromacs-4.5.5/bin/grompp_netlib
        linux-vdso.so.1 =>  (0x00007fffb83f2000)
        libgmxpreprocess.so.6 => /opt/gromacs/gromacs-4.5.5/lib/libgmxpreprocess.so.6 (0x00002b6411cfa000)
        libmd.so.6 => /opt/gromacs/gromacs-4.5.5/lib/libmd.so.6 (0x00002b6411fcd000)
        libfftw3f.so.3 => /usr/lib/x86_64-linux-gnu/libfftw3f.so.3 (0x00002b64123ad000)
        libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2 (0x00002b64127b0000)
        libgmx.so.6 => /opt/gromacs/gromacs-4.5.5/lib/libgmx.so.6 (0x00002b6412b10000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00002b6412fe5000)
        libnetblas.so => /opt/netlib/blas/lib/libnetblas.so (0x00002b64131e9000)
        liblapack.so => /opt/netlib/lapack/lib/liblapack.so (0x00002b64134cc000)
        libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00002b6413ece000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00002b64140e6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00002b6414369000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002b6414585000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00002b641490c000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00002b6414b24000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b6411ad8000)
        libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00002b6414d47000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00002b641505d000)
        libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00002b6415274000)

Here are some input files (it's not a 'real' md run -- I just needed something small and quick to run):
step1.top:
#include "/opt/gromacs/gromacs-4.5.5/share/gromacs/top/ffoplsaa.itp"
#include "/opt/gromacs/gromacs-4.5.5/share/gromacs/top/oplsaa.ff/tip4p.itp"

[system]
test 

[molecules]

step1.mdp:

integrator = md
define      = -DFLEXIBLE
emtol      = 1000.0
emstep     = 0.001
nsteps     = 5000
nstlist    = 1
ns_type    = grid 
rlist      = 0.9
coulombtype= PME  
rcoulomb   = 0.9  
rvdw       = 1.0  
pbc        =  xyz

genbox_netlib -o step1.gro -cs /opt/gromacs/gromacs-4.5.5/share/gromacs/top/tip4p.gro -box 4x4x4 -p step1.top

grompp_netlib -f step1.mdp -po step2.mdp -p step1.top -pp step2.top -c step1.gro -o step2.tpr

mdrun_netlib -v -s step2.tpr -o step3.trr -x step3.xtc -cpo step3.cpt -c step3.gro -e step3.edr -g step3.log

On my old AMD II X3 I got about 7.7 GFLOPS with Openblas and 7.8 GFLOPS with the above libs. Note that the run is shorter than a minute so it's pretty useless for benchmarking. However, there's no obvious MAJOR penalty.


If you don't have cmake:
cp INSTALL/make.inc.gfortran make.inc

Edit make.inc
 15 FORTRAN  = gfortran
 16 OPTS     = -O2 -fPIC -m64
 17 DRVOPTS  = $(OPTS)
 18 NOOPT    = -O0 -fPIC -m64
 19 LOADER   = gfortran
 20 LOADOPTS =
Edit Makefile
 11 #lib: lapacklib tmglib
 12 lib: blaslib variants lapacklib tmglib
Run make
make
-->  Tests passed: 13176


   -->   LAPACK TESTING SUMMARY  <--
  Processing LAPACK Testing output found in the TESTING direcory
SUMMARY              nb test run  numerical error    other error  
================    =========== ================= ================  
REAL              1077227  0 (0.000%) 0 (0.000%) 
DOUBLE PRECISION 1078039  0 (0.000%) 0 (0.000%) 
COMPLEX           522814  0 (0.000%) 0 (0.000%) 
COMPLEX16          552410  0 (0.000%) 0 (0.000%) 

--> ALL PRECISIONS 3230490  0 (0.000%) 0 (0.000%) 




Older version:
In the oldest version of this post I did the blas compilation by hand:

gfortran -O2 -fPIC -m64 -march=native -funroll-all-loops -c *.f

To build a static library:
ar rvs libblas.a *.o

To build a shared/dynamic library:
gfortran -shared -Wl,-soname,libnetblas.so -o libblas.so.1.0.1 *.o -lc 


ldd libblas.so.1.0.1
        linux-vdso.so.1 =>  (0x00007fff301af000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002aeeac390000)
        libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00002aeeac718000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00002aeeaca2e000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00002aeeaccb0000)
        libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00002aeeacec7000)
        /lib64/ld-linux-x86-64.so.2 (0x00002aeeabedd000)

Either way:
cp libblas* /opt/netlib/blas/lib

To test:
wget http://www.netlib.org/blas/sblat1
mv sblat1 sblat1.f

And EITHER
gfortran sblat1.f -l:libblas.a

OR
ln -s libblas.so.1.0.1 libnetblas.so
gfortran sblat1.f -l:libnetblas.so

THEN
./a.out
 
Real BLAS Test Program Results
Test of subprogram number  1             SDOT 
                                    ----- PASS -----

 Test of subprogram number  2            SAXPY 
                                    ----- PASS -----

 Test of subprogram number  3            SROTG 
                                    ----- PASS -----

 Test of subprogram number  4             SROT 
                                    ----- PASS -----

 Test of subprogram number  5            SCOPY 
                                    ----- PASS -----

 Test of subprogram number  6            SSWAP 
                                    ----- PASS -----

 Test of subprogram number  7            SNRM2 
                                    ----- PASS -----

 Test of subprogram number  8            SASUM 
                                    ----- PASS -----

 Test of subprogram number  9            SSCAL 
                                    ----- PASS -----

 Test of subprogram number 10            ISAMAX
                                    ----- PASS -----

11 September 2012

232. Compile parallel (threaded) povray 3.7-rc6 on Debian Wheezy

Update 13 May 2013: This build won't work with v3.7-rc7 on debian wheezy if you have libjpeg62 installed. See http://verahill.blogspot.com.au/2013/05/413-povray-37-rc7-on-debian-wheezy.html.

Remove libjpeg62 and it works fine though.

Original post
Expanding my little cluster has got me thinking about additional uses for it. The primary purpose is obviously work i.e. MD simulations using gromacs and ab initio calcs using NWChem and Gaussian. I'm also testing it with John the Ripper to see how well the users of the linux box in the lab are choosing their passwords.

At that point I realised that it'd be sweet to have at least an OMP capable version of povray to speed things up when polishing figures for those elusive journal covers.

Debian testing currently uses v. 3.6.1 of povray but

  POV-Ray 3.6 does not support multithreaded rendering. POV-Ray 3.7 does.

So compile we will although v 3.7 is beta, so be aware.
sudo mkdir /opt/povray
sudo chown $USER /opt/povray

wget http://povray.org/redirect/www.povray.org/beta/source/povray-3.7.0.RC6.tar.gz
tar xvf povray-3.7.0.RC6.tar.gz
cd povray-3.7.0.RC6/
sudo apt-get install libboost-all-dev libpng-dev libjpeg-dev libtiff-dev build-essential libsdl-dev

Note: libboost-all-dev is big. It might be enough with libboost-thread-dev

./configure --prefix=/opt/povray --program-suffix=_3.7 COMPILED_BY="me@here"
===============================================================================
POV-Ray 3.7.0.RC5 has been configured.

Built-in features:
  I/O restrictions:          enabled
  X Window display:          disabled
  Supported image formats:   gif tga iff ppm pgm hdr png jpeg tiff
  Unsupported image formats: openexr

Compilation settings:
  Build architecture:  x86_64-unknown-linux-gnu
  Built/Optimized for: x86_64-unknown-linux-gnu (using -march=native)
  Compiler vendor:     gnu
  Compiler version:    g++ 4.7
  Compiler flags:      -pipe -Wno-multichar -Wno-write-strings -fno-enforce-eh-specs -s -O3 -ffast-math -march=native -pthread

Type 'make check' to build the program and run a test render.
Type 'make install' to install POV-Ray on your system.

The POV-Ray components will be installed in the following directories:
  Program (executable):       /opt/povray/bin
  System configuration files: /opt/povray/etc/povray/3.7
  User configuration files:   $HOME/.povray/3.7
  Standard include files:     /opt/povray/share/povray-3.7/include
  Standard INI files:         /opt/povray/share/povray-3.7/ini
  Standard demo scene files:  /opt/povray/share/povray-3.7/scenes
  Documentation (text, HTML): /opt/povray/share/doc/povray-3.7
  Unix man page:              /opt/povray/share/man
===============================================================================

The way it is configured we can keep our debian version of povray and install the newer version (povray_3.7)

make
make install

Seems like -geometry 1000x1000 doesn't work anymore. Instead use -H1000 -W1000

I've played around with it a little bit and it does parallel (threaded) execution nicely.

wget http://www.ms.uky.edu/~lee/visual05/povray/fourcube7.pov
./povray_3.7 -H1000 -W1000 fourcube7.pov +A0.1
takes 9 seconds on an AMD II X3. The standard, serial Debian version takes 21 seconds.

231. Compiling john the ripper: single/serial, parallel/OMP and MPI

Update: updated for v1.7.9-jumbo-7 since hccap2john in 1.7.9-jumbo-6 was broken

For no particular reason at all, here's how to compile John the Ripper on Debian Testing (Wheezy). It's very easy, and this post is probably a bit superfluous. The standard version only supports serial and parallel (OMP). See below for MPI.


The regular version: 

mkdir ~/tmp
cd ~/tmp
wget http://www.openwall.com/john/g/john-1.7.9.tar.gz
tar xvf john-1.7.9.tar.gz
cd john-1.7.9/src

If you don't edit the Makefile you build a serial/single-threaded binary.
If you want to build a threaded version for a single node with a multicore processor (OMP) do:
Edit Makefile and uncomment row 19 or 20

 18 # gcc with OpenMP
 19 OMPFLAGS = -fopenmp
 20 OMPFLAGS = -fopenmp -msse2
make clean linux-x86-64
cd ../run

You now have a binary called john in your ../run folder.


The Jumbo version:
If you want to build a distributed version with MPI (can split jobs across several nodes) you need the enhanced, community version:

sudo apt-get install openmpi-bin libopenmpi-dev
cd ~/tmp
wget http://www.openwall.com/john/g/john-1.7.9-jumbo-7.tar.gz
tar xvf john-1.7.9-jumbo-7.tar.gz 
cd john-1.7.9-jumbo-7/src

Edit the Makefile
  20 ## Uncomment the TWO lines below for MPI (can be used together with OMP as well)
  21 ## For experimental MPI_Barrier support, add -DJOHN_MPI_BARRIER too.
  22 ## For experimental MPI_Abort support, add -DJOHN_MPI_ABORT too.
  23 CC = mpicc -DHAVE_MPI
  24 MPIOBJ = john-mpi.o

and do
make clean linux-x86-64-native
cd ../run

I had a look at the passwords on one of our lab boxes -- it immediately discovered that someone had used 'password' as the password...


These test were run on my old AMD II X3 445. Processes which don't speed up with MP are highlighted in red. LM DES is borderline -- it's faster, but doesn't scale well.

Here's the single thread/serial version:
./john --test
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     2906K c/s real, 2918K c/s virtual
Only one salt:  2796K c/s real, 2807K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     95564 c/s real, 95948 c/s virtual
Only one salt:  93593 c/s real, 93781 c/s virtual
Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    14094 c/s real, 14122 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    918 c/s real, 919 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
Short:  474316 c/s real, 475267 c/s virtual
Long:   1350K c/s real, 1356K c/s virtual
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    39843K c/s real, 39923K c/s virtual
Benchmarking: generic crypt(3) [?/64]... DONE
Many salts:     262867 c/s real, 263393 c/s virtual
Only one salt:  260121 c/s real, 260642 c/s virtual
Benchmarking: Tripcode DES [48/64 4K]... DONE
Raw:    369843 c/s real, 370584 c/s virtual
Benchmarking: dummy [N/A]... DONE
Raw:    99512K c/s real, 99712K c/s virtual
Here's the OMP version:
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     6706K c/s real, 2555K c/s virtual
Only one salt:  5015K c/s real, 2091K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     205670 c/s real, 85411 c/s virtual
Only one salt:  238524 c/s real, 86720 c/s virtual
Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    38400 c/s real, 13812 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    2306 c/s real, 845 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
Short:  474675 c/s real, 476581 c/s virtual
Long:   1332K c/s real, 1335K c/s virtual
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    49046K c/s real, 16785K c/s virtual
Benchmarking: generic crypt(3) [?/64]... DONE
Many salts:     721670 c/s real, 246640 c/s virtual
Only one salt:  699168 c/s real, 239605 c/s virtual
Benchmarking: Tripcode DES [48/64 4K]... DONE
Raw:    367444 c/s real, 369657 c/s virtual
Benchmarking: dummy [N/A]... DONE
Raw:    100351K c/s real, 100552K c/s virtual
And here's the MPI version:
mpirun -n 3 ./john --test
(note that this includes a great many more tests than the default version)
Benchmarking: Traditional DES [128/128 BS SSE2-16]... (3xMPI) DONE
Many salts:     8533K c/s real, 8707K c/s virtual
Only one salt:  7705K c/s real, 8110K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... (3xMPI) DONE
Many salts:     279808 c/s real, 282634 c/s virtual
Only one salt:  273362 c/s real, 276096 c/s virtual
Benchmarking: FreeBSD MD5 [128/128 SSE2 intrinsics 12x]... (3xMPI) DONE
Raw:    65124 c/s real, 65781 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (3xMPI) DONE
Raw:    2722 c/s real, 2749 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... (3xMPI) DONE
Short:  1387K c/s real, 1415K c/s virtual
Long:   3880K c/s real, 3959K c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... (3xMPI) DONERaw:    114781K c/s real, 115940K c/s virtual

I don't quite understand the Kerberos results.



Other targets of interest are:

linux-x86-64-avx         Linux, x86-64 with AVX (2011+ Intel CPUs)
linux-x86-64-xop         Linux, x86-64 with AVX and XOP (2011+ AMD CPUs)
linux-x86-64             Linux, x86-64 with SSE2 (most common)
linux-x86-avx            Linux, x86 32-bit with AVX (2011+ Intel CPUs)
linux-x86-xop            Linux, x86 32-bit with AVX and XOP (2011+ AMD CPUs)
linux-x86-sse2           Linux, x86 32-bit with SSE2 (most common, if 32-bit)
linux-x86-mmx            Linux, x86 32-bit with MMX (for old computers)
linux-x86-any            Linux, x86 32-bit (for truly ancient computers)

The FX 8150 does AVX and XOP, while my 1055T doesn't.

The community version has more options:

linux-x86-64-native      Linux, x86-64 'native' (all CPU features you've got)
linux-x86-64-gpu         Linux, x86-64 'native', CUDA and OpenCL (experimental)
linux-x86-64-opencl      Linux, x86-64 'native', OpenCL (experimental)
linux-x86-64-cuda        Linux, x86-64 'native', CUDA (experimental)
linux-x86-64-avx         Linux, x86-64 with AVX (2011+ Intel CPUs)
linux-x86-64-xop         Linux, x86-64 with AVX and XOP (2011+ AMD CPUs)
linux-x86-64[i]          Linux, x86-64 with SSE2 (most common)
linux-x86-64-icc         Linux, x86-64 compiled with icc
linux-x86-64-clang       Linux, x86-64 compiled with clang
linux-x86-gpu            Linux, x86 32-bit with SSE2, CUDA and OpenCL (experimental)
linux-x86-opencl         Linux, x86 32-bit with SSE2 and OpenCL (experimental)
linux-x86-cuda           Linux, x86 32-bit with SSE2 and CUDA (experimental)
linux-x86-sse2[i]        Linux, x86 32-bit with SSE2 (most common, 32-bit)
linux-x86-native         Linux, x86 32-bit, with all CPU features you've got (not necessarily best)
linux-x86-mmx            Linux, x86 32-bit with MMX (for old computers)
linux-x86-any            Linux, x86 32-bit (for truly ancient computers)
linux-x86-clang          Linux, x86 32-bit with SSE2, compiled with clang
linux-alpha              Linux, Alpha
linux-sparc              Linux, SPARC 32-bit
linux-ppc32-altivec      Linux, PowerPC w/AltiVec (best)
linux-ppc32              Linux, PowerPC 32-bit
linux-ppc64              Linux, PowerPC 64-bit
linux-ia64               Linux, IA-64