Pages

29 September 2012

249. Quick but precise isotopic pattern (isotope envelope) calculator in Octave

UPDATE: Below is an accurate calculator,  but it is impractically slow for large molecules. A practical AND accurate calculator is found here:http://verahill.blogspot.com.au/2012/10/isotopic-pattern-caculator-in-python.html

Use the post below to learn about the fundamental theory, but then look at the other post to understand how to implement it.

Old post:
Getting fast and accurate isotopic patterns can be tricky using tools available online, for download or which form part of commercial packages. A particular problem is that different tools give slightly different values -- so which one to trust?

The answer: the tool for which you know that the algorithm is sound.

The extreme conclusion of that way of thinking is to write your own calculator.
Below is the conceptual process of calculating the isotopic pattern of a molecule using GNU Octave.

You need the linear algebra package:
sudo apt-get install octave octave-linear-algebra

b is the isotopic distribution for an element, and bb are the masses of those isotopes.

Once you've got a computational engine it's not too difficult to expand it for more general cases, account for charge, and instrument resolution.


Molecule: Cl4

b=[0.7578,0.2422];
bb=[34.96885,36.96885];
e=prod(cartprod(b,b,b,b),2);
ee=sum(cartprod(bb,bb,bb,bb),2);
n=4;
g=histc([ee e],linspace(min(ee),max(ee),n*(max(ee)-min(ee)+1)),2);
h=linspace(min(ee),max(ee),n*(max(ee)-min(ee)+1));
distr=e'*g;
plot(h,100.*distr/max(distr))
[h' (100.*distr/max(distr))']
Here's the output for n=1:
   139.87540    78.22048
   140.87540     0.00000
   141.87540   100.00000
   142.87540     0.00000
   143.87540    47.94141
   144.87540     0.00000
   145.87540    10.21502
   146.87540     0.00000
   147.87540     0.81620

And here's the output from Matt Monroe's calculator:
Isotopic Abundances for Cl4
  Mass/Charge Fraction  Intensity
   139.87541 0.3297755   78.22
   140.87541 0.0000000    0.00
   141.87541 0.4215974  100.00
   142.87541 0.0000000    0.00
   143.87541 0.2021197   47.94
   144.87541 0.0000000    0.00
   145.87541 0.0430662   10.22
   146.87541 0.0000000    0.00
   147.87541 0.0034411    0.82


Another molecule: Li2Cl2

Here's the code:
a=[0.0759,0.9241];
aa=[6.01512,7.01512];
b=[0.7578,0.2422];
bb=[34.96885,36.96885];
e=prod(cartprod(a,a,b,b),2);
ee=sum(cartprod(aa,aa,bb,bb),2);
n=1;
g=histc([ee e],linspace(min(ee),max(ee),n*(max(ee)-min(ee)+1)),2);
h=linspace(min(ee),max(ee),n*(max(ee)-min(ee)+1));
distr=e'*g;
plot(h,100.*distr/max(distr))
[h' (100.*distr/max(distr))']

ans =

    81.96794     0.67170
    82.96794    16.35626
    83.96794   100.00000
    84.96794    10.45523
    85.96794    63.71604
    86.96794     1.67079
    87.96794    10.17116

vs Matt Monroe's calculator:
Isotopic Abundances for Li2Cl2
  Mass/Charge Fraction  Intensity
    81.96795 0.0033082    0.67
    82.96795 0.0805564   16.36
    83.96795 0.4925109  100.00
    84.96795 0.0514932   10.46
    85.96795 0.3138084   63.72
    86.96795 0.0082288    1.67
    87.96795 0.0500941   10.17

We can then expand the code to allow for plotting
a=[0.0759,0.9241];
aa=[6.01512,7.01512];
b=[0.7578,0.2422];
bb=[34.96885,36.96885];
e=prod(cartprod(a,a,b,b),2);
ee=sum(cartprod(aa,aa,bb,bb),2);
n=1;

g=histc([ee e],linspace(min(ee),max(ee),n*(max(ee)-min(ee)+1)),2);
h=linspace(min(ee),max(ee),n*(max(ee)-min(ee)+1));
distr=e'*g;
gauss= @(x,c,r,s) r.*1./(s.*sqrt(2*pi)).*exp(-0.5*((x-c)./s).^2);
k=100.*distr/max(distr);

npts=1000;
resolution=0.25;

x=linspace(min(ee)-1,max(ee)+1,npts);
l=cumsum(gauss(x,h',k',resolution));
l=100*l./max(l(rows(l),:));
plot(x,l(rows(l),:))

which gives:

Compare with Matt Monroe's calculator:

28 September 2012

248. Matt Monroe's Molecular Weight Calculator under Wine on Linux

I've downloaded the source code to Matt Monroe's molecular weight calculator in the past, and having replaced wsearch32 (+wine) with OpenChrom I figured I'd go online, download it and see what Mono can do for me. I had a vague recollection that the source code was only freely available online for a short while, and as it turns out I couldn't find it this time.

Anyway, not finding the source code I decided to update my Molecular Weight calculator from version 6.46 to 6.49 which (finally) allows you to set the charge of an ion WITHOUT having the mass of a H+ added for each charge. It's not difficult to compensate for, but it's always confusing to new students.

1. Install Wine and winetricks, add dlls
You can either install wine from the repos (old version)
sudo apt-get install wine-bin

Or you can download a newer, unstable version from dev.carbon-project.org:
http://verahill.blogspot.com.au/2012/01/debian-testingwheezy-64-bit-installing.html

Or you can compile your own:
http://verahill.blogspot.com.au/2013/01/308-compiling-wine-1521-on-debian.html



The mono step was a right headache and would fail unless I nuked everything winetricks and wine knew about each other/

To get winetricks and set everything up:

sudo apt-get install cabextract
wget http://winetricks.org/winetricks
sudo mv winetricks /usr/local/bin/
sudo chmod +x /usr/local/bin/winetricks
wget http://downloads.sourceforge.net/project/wine/Wine%20Mono/0.0.4/wine-mono-0.0.4.msi
wine msiexec /i wine-mono-0.0.4.msi
You're now asked whether to download and install mono...sigh...more often that not this has failed in the past.
winetricks vcrun6sp6
Download the file from the browser window that just opened
cd ~/.cache/winetricks/vcrun6sp6 
mv ~/Downloads/Vs6sp6.exe .
winetricks vcrun6sp6
winetricks corefonts
winetricks riched30
wine uninstaller --remove '{E45D8920-A758-4088-B6C6-31DBB276992E}'
winetricks dotnet20
cd ~/.cache/winetricks/dotnet20/
mv ~/Downloads/dotnetfx.exe .
winetricks dotnet20
Ignore this error. Installation will take a while after that. Have patience. Like 10 minutes kind of patience.
And finally:
winetricks wsh57


2. Download the molecular weight calculator and install
If you go to http://www.alchemistmatt.com/mwtwin.html
you get redirected to here: http://omics.pnl.gov/software/MWCalculator.php


cd ~/tmp
mkdir molw
cd molw/
wget http://omics.pnl.gov/installers/MolecularWeightCalculator_Installer.zip
wget http://omics.pnl.gov/installers/MwtWinDll_SourceAndSupportingDLLs.zip
ls *.zip|xargs -I {} unzip {}
unzip MwtWinDll_Source_v3.4.4518.zip

You'll get some warning about Revisionhistory.txt etc. being overwritten. That's fine.

Launch the install with
wine msiexec /i MolecularWeightCalculator.msi




If you try to launch the mol weight calculator at this point you'll get an error about a missing MwtWinDLL.dll:

So sort that out:

cd ~/tmp/molw/bin
regsvr32 MwtWinDll.dll
Successfully registered DLL MwtWinDll.dll
[If I tried to copy the dll to the wine structure first and register that copy I got:
DllRegisterServer not implemented in DLL C:\windows\system\MwtWinDll.dll]
If it seems weird to install wine-mono and then remove it as is done above, it's to get around a bug which causes dotnet20 installation to fail/


Anyway, you're pretty much done:
 wine ~/.wine/drive_c/Program\ Files/Molecular\ Weight\ Calculator/mwtwin.exe

Yay!






Comment:
Getting there was a bit of a trek, passing though a whole lot of different sets of dlls:
winetricks msflxgrd
winetricks vcrun2005
winetricks vb6run
winetricks mdac28
winetricks comctl32ocx
winetricks comctl32

The solution above should suffice though.

I even ended up installing mol weight calc on a windows box and using dependency walker, but not even that sorted it out -- googling for scrrun did it in the end.

In particular this last error was bloody annoying:
"Object doesn't support this action." What, saving?

"Error saving default options file. Use the /X switch at the command line to prevent this error."
But it was solved by doing winetricks wsh57

27 September 2012

247. Setting up Openchrom (and using it to open Agilent .D ESI-MS files on Linux)

I've been using Wsearch (http://www.wsearch.com.au/wsearch32/wsearch32.htm) to process agilent chemstation ESI-MS spectra for the past few years. It and Matt Monroe's Molecular Weight calculator (why, oh why is there no comparable molecular weight calculator for linux?) have been the only two reasons why I've bother with Wine under Linux. Openchrom is written in java and so will run on both Good (Linux) and Evil (OS X and Win) operating systems.

Having finally discovered openchrom (v 0.6 so still early days) I can now finally retire wsearch from my own computers (it's still a good piece of software, but it's crippled to encourage the purchasing of a 'full' version, and I've had no luck purchasing a license from the author in spite of having tried several times during the past couple of years). OpenChrom can export an entire agilent experiment as a '3D' csv file which makes processing a lot more fun.

As an aside: I hate proprietary file formats since they prevent me from using my own tools (cat, sed, gawk, gnuplot, octave) when processing -- or at a minimum make it more difficult. Most universities and grant agencies now add a provision regarding data management in their grant acceptance agreements/work conduct policies. In general these provisions state that the data shall be made available publicly and /or managed by a university repository. What is REALLY missing is a clause about using open formats -- and that should be taken into account when acquiring new instrumentation. All else being equal, an instrument which is 'open' will be a lot cheaper to manage in the long run since you won't have to feel locked in in terms of software. That's incidentally a reason why I like Metrohm since they provide details of their RS-232 interface allowing you to write your own software.

Anyway, here's how to get set up:


1. Install Java v1.7 (need > 1.6)
You can either use openjdk 7 or (Oracle) Java. See here for a general guide to installing Oracle/Sun Java.

As for openjdk, you can easily install it:
sudo apt-get install openjdk-7-jdk

(the openjdk-7-jre package is enough if you don't want the full developer's kit)

Anyway.

Make sure that you've selected the right version:
 sudo update-alternatives --config java
There are 7 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java   1061      auto mode
  1            /usr/bin/gij-4.4                                 1044      manual mode
  2            /usr/bin/gij-4.6                                 1046      manual mode
  3            /usr/bin/gij-4.7                                 1047      manual mode
  4            /usr/lib/jvm/j2re1.6-oracle/bin/java             314       manual mode
  5            /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/java        315       manual mode
  6            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java   1061      manual mode
 *7            /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java   1051      manual mode



2. Get openchrom
cd ~/tmp
wget http://sourceforge.net/projects/openchrom/files/REL-0.6.0/openchrom_linux.gtk.x86_64_0.6.0.zip
unzip openchrom_linux.gtk.x86_64_0.6.0.zip
cd linux.gtk.x86_64/OpenChrom/
sudo mkdir /opt/openchrom
sudo chown $USER /opt/openchrom 
cp * -R /opt/openchrom
chmod +x /opt/openchrom/openchrom

Stick

alias openchrom='/opt/openchrom/openchrom'

in your ~/.bashrc and source it.




3. Get plugins
On first boot you're asked whether you want to get additional plugins using the 'Openchrom marketplace'. Since I'm mainly processing data from an Agilent ESI-MS, I wanted the plugin for Agilent files. The website says that you need a license key for plugins BUT that it's free to register for one.

This is a 30-days trial version. Afterwards, you need a valid serial key.
You can get a free serial key after registration on http://www.openchrom.net.
You can use the converter for commercial or non-commercial purposes free of charge, but you are not allowed to redistribute this software without my permission.

Note, that clicking on links on the website didn't lead me to a link to download the plugin. Instead, in OpenChrom click on the Plug-ins menu:






As always, make sure you trust your suppliers.


And then you're done installing.

There's nothing odd about registering other than this: you will receive an email with a confirmation of your registration in clear text WITH YOUR PASSWORD. So...be aware of that.


4. You can now browse in the tree to the left and select your .D folder:

There's a bit of clever thinking when it comes to the functionality of the program. The upside of this I think will eventually be that it's easy to get a consistent experience for a set of users (not unimportant for a research group). The downside is that it's a bit clunky getting started. Play with it for an hour and you'll get the hang of it, so it's not really that much of a hurdle. Also, too many options seem to be context sensitive -- I am having real trouble finding various options under the 'Accurate' perspective which I can find under the 'default' perspective.


5. Some comments:

It's still early days for OpenChrom (v 0.6) , and there are a few minor issues which may or may not affect you:

* Registration keys. They are easy enough to get (register online, log in, click on the plug in online that you want and you'll see the key), but if you have installed a new plug in and open your first spectrum right after that you'll be asked for registration keys. It won't tell you for which plug in the dialogue you're seeing is though, so if you've just installed three different plug ins you'll have to do some trial-and-error. This is fixed in the upcoming version.

* Raw/gaussian plot of mass spectrum. This took a while to figure out, but you have to use perspectives. The default (heavily zoomed in) view looks like this:
This might be good enough for those organic types...us 108 element inorganic types want more detail
If you go to the top right corner, click on 'other' (perspectives)
and select accurate you get
Bingo!
And then there's the obligatory wish list:

* A good quality isotopic pattern calculator would be nice. Anyone who has compared the output from different pieces of software will have discovered that different calculators may yield very different patterns. I think some of it boils down to truncation rather than incorrect isotopic ratios, but that just highlights how difficult it can be to implement a seemingly simple concept. The only calculator which I trust AND find useful is Matt Monroe's calculator -- the predicted patterns look good, and you get proper Gaussian broadening which means that it looks 'right'. This would be perfect as a plug-in. If only I knew how to properly implement it...

* A good quality ion generator -- some pieces of software (Hi Matt) allow you to select a handful of elements or fragments, pick a range of charges, input an m/z value and based on that spits out a list over possible identities for your signal. It's a good thing to have by your side the first time you look at a complex mixture trying to figure out what products may be present. This would be perfect as a plug-in. I've written this type of programmes before, but in python using for-loops...a vectorized version should be faster and maybe even easier to write.

25 September 2012

246. Cluster network performance testing (very basic) on Debian Testing using a gigabit switch

Playing with hpcc got me thinking about my network connection.

My cluster looks like this:
I've got four nodes which are connected via two networks, 192.168.2.0/24 and 192.168.1.0/24. The 192.168.1.0/24 network is connected using a gigabit switch. Be (see below) acts as the gateway. The 192.168.2.0/24 network is connected via a crappy old netgear 10/100 router (dhcp) and provides access to the outside world (hello mac spoofing :) ). Each box shares a folder via nfs using a unique name.
_Nodes_
Be: AMD II X3, 8 GB ram (192.168.1.1): Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)
Ta: Intel i5-2400, 8 GB ram (192.168.1.150):  Intel Corporation 82579LM Gigabit Network Connection (rev 04)
B: AMD Phenom II X6, 8 GB ram (192.168.1.101): Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
Ne: AMD FX 8150 X8, 16 GB ram (192.168.1.120): Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

So, time to test the network performance:
sudo apt-get install iperf

On all your boxes (e.g. using clusterssh) start the iperf daemon
iperf -s

Then on each of your nodes run:
iperf -c 192.168.1.1 && iperf -c 192.168.1.101 && iperf -c 192.168.1.150 && iperf -c 192.168.1.120

------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 45.7 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.101 port 37893 connected with 192.168.1.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   564 MBytes   473 Mbits/sec
------------------------------------------------------------
Client connecting to 192.168.1.101, TCP port 5001
TCP window size:  169 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.101 port 35926 connected with 192.168.1.101 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  15.5 GBytes  13.3 Gbits/sec
------------------------------------------------------------
Client connecting to 192.168.1.150, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.101 port 48257 connected with 192.168.1.150 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   564 MBytes   473 Mbits/sec
------------------------------------------------------------
Client connecting to 192.168.1.120, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.101 port 43236 connected with 192.168.1.120 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   617 MBytes   517 Mbits/sec


Overall, this is what I got
Client/Server (MBit/s)
     Be     B     Ta    Ne
Be   13.7G  310   308   316
B    564    15.5G 564   617
Ta   726    660   19.7G 936
Ne   882    484   917   19.4G

I'm not sure whether to expect a metric gigabit (1000 metric MBit) or a binary one (1024 binary MBit), but looking at our results our best is 936 Mbit/s and worst 308 Mbit/s. All of them should thus ideally reach at least 936 MBit/s. They all have gigabit network card.

And now, try to improve it:
I went through the whole shebang with
sudo ifconfig eth1 mtu 9000
sudo ifconfig eth1 mtu 8000
etc.
Anyway, I got the following MTUs that way:
Be  7100
B    7100
Ne  9000
Ta   9000

I then set the MTUs to 7100 on all the nodes and tried pinging from node to node, e.g.:
ping -s 7072 -M do 192.168.1.101

Well, that maxed out at 1472 i.e. about MTU 1500 which was the original value. So I'm a bit confused.


Settings:
Be:
eth1      Link encap:Ethernet  HWaddr 00:f0:4d:83:0a:48  
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::2f0:4dff:fe83:a48/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:24124966 errors:0 dropped:27064 overruns:0 frame:0
          TX packets:19569426 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:25859945667 (24.0 GiB)  TX bytes:14200267703 (13.2 GiB)
B:
eth1      Link encap:Ethernet  HWaddr 02:00:8c:50:2f:6b  
          inet addr:192.168.1.101  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::8cff:fe50:2f6b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14540970 errors:0 dropped:36651 overruns:0 frame:0
          TX packets:16801915 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12347398135 (11.4 GiB)  TX bytes:18008416370 (16.7 GiB)
Ta:
eth1      Link encap:Ethernet  HWaddr 78:2b:cb:b3:a4:b7  
          inet addr:192.168.1.150  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::7a2b:cbff:feb3:a4b7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14717233 errors:0 dropped:68232 overruns:0 frame:0
          TX packets:17769966 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:13860096243 (12.9 GiB)  TX bytes:20207270880 (18.8 GiB)
          Interrupt:20 Memory:e1a00000-e1a20000 
Ne:
eth1      Link encap:Ethernet  HWaddr 90:2b:34:93:75:e6  
          inet addr:192.168.1.120  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::922b:34ff:fe93:75e6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:13567520 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10710054 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:13086635236 (12.1 GiB)  TX bytes:12381041605 (11.5 GiB)

245. Recompile debian's hpcc with other libs

I installed hpcc using apt-get, but -- and this is a first -- when trying to run it complained over missing libs.



Why compile?

hpcc
hpcc: error while loading shared libraries: libatlas.so.3gf: cannot open shared object file: No such file or directory
Doing
aptitude show hpcc 
Depends: libatlas3gf-base, libc6 (>= 2.7), libopenmpi1.3, mpi-default-bin
apt-cache search libatlas.so.3gf

libatlas3-base - Automatically Tuned Linear Algebra Software, generic shared
libatlas3gf-base - Transitional package to libatlas3-base
and doing
 aptitude search atlas|grep ^i


i   libatlas-dev                    - Automatically Tuned Linear Algebra Softwar
i A libatlas3gf-base                - Transitional package to libatlas3-base
but
locate libatlas.so.3gf
comes up empty.

So build your own:
sudo mkdir /opt/hpcc
sudo chown $USER /opt/hpcc
cd /opt/hpcc
wget http://ftp.de.debian.org/debian/pool/main/h/hpcc/hpcc_1.4.1.orig.tar.gz
tar xvf hpcc_1.4.1.orig.tar.gz
cd hpcc-1.4.1/
wget http://ftp.de.debian.org/debian/pool/main/h/hpcc/hpcc_1.4.1-2.debian.tar.gz
tar xvf hpcc_1.4.1-2.debian.tar.gz
patch -i debian/patches/add-Make.Debian.patch

Edit Make.Debian. For some reason LAdir is ignored, hence the -L option in LAlib
 78 # ----------------------------------------------------------------------
 79 # - MPI directories - library ------------------------------------------
 80 # ----------------------------------------------------------------------
 81 # MPinc tells the  C  compiler where to find the Message Passing library
 82 # header files,  MPlib  is defined  to be the name of  the library to be
 83 # used. The variable MPdir is only used for defining MPinc and MPlib.
 84 #
 85 MPdir        =/usr/lib/openmpi/lib/
 86 MPinc        =
 87 MPlib        =-lmpi
 88 #
 89 # ----------------------------------------------------------------------
 90 # - Linear Algebra library (BLAS or VSIPL) -----------------------------
 91 # ----------------------------------------------------------------------
 92 # LAinc tells the  C  compiler where to find the Linear Algebra  library
 93 # header files,  LAlib  is defined  to be the name of  the library to be
 94 # used. The variable LAdir is only used for defining LAinc and LAlib.
 95 #
 96 LAdir        = /opt/ATLAS/lib
 97 LAinc        =
 98 LAlib        = -L/opt/ATLAS/lib -ltatlas
 99 #

The above assumes that you've compiled your own openblas as shown elsewhere on this blog. You can use whatever math libs you want. Again, there are a couple described on this blog (acml, netlib blas/lapack, openblas, ATLAS). I've had success with the netlib blas/lapack and atlas (built with netlib lapack).

mv Make.Debian hpl/
make arch=Debian

Hopefully everything went well. Now you need an input file.
cp _hpccinf.txt hpccinf.txt

Edit hpccinf.txt:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
1000         Ns
1            # of NBs
80           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
3            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Launch by doing
mpirun -n X ./hpcc
where X=Ps times Qs (e.g. 3 in the example above).

I put the hpccinf.txt in a shared (nfs) folder (~/jobs), created a file called myhost

tantalum slots=2 max_slots=4
boron slots=2 max_slots=6
neon slots=2 max_slots=8
 and then launched using
mpirun -n 4 -hostfile myhost /opt/hpcc/hpcc-1.4.1/./hpcc

21 September 2012

244. Molden on debian testing

Update: avogadro can write gamess input files, but seems to offer little in the way of showing detailed output from gamess output files. Also, some of the input files contain keywords which don't exist.

Original post:
Nothing beats a good GUI, so after butting heads with gabedit again (and losing - again. Although in this case I think I tried to make it do something it wasn't designed to) I've decided to try Molden.

To download, go here, make sure to be a good citizen and register yourself as a user (will help motivate funding for development) then download: http://www.cmbi.ru.nl/molden/howtoget.html

cd ~/tmp
wget ftp://ftp.cmbi.ru.nl/pub/molgraph/molden/molden5.0.tar.gz
tar xvf molden5.0.tar.gz
cd molden5.0/

Edit makefile and remove -lXmu from line 20:

16 CC = cc
17 FC = gfortran
18 LIBS =  -lX11 -lm
19 LDR = ${FC} 
20 LIBSG = -L/usr/X11R6/lib -lGLU -lGL -lX11 -lm

cd surf/

edit Makefile and change it from

 46 depend: $(DEPEND)
 47     @ echo making dependencies...
 48     @ echo ' ' > makedep
 49     @ makedepend $(INCLUDE) -f makedep $(DEPEND)

to

 46 depend: $(DEPEND)
 47     @ echo making dependencies...
 48     @ echo ' ' > makedep
 49     @ $(CC) $(INCLUDE) -M $(DEPEND) > makedep

Save and go back up one level, and run make:
 cd ../
 make

You're pretty much done.

I like putting things in /opt, so
sudo mkdir /opt/molden
sudo chown $USER /opt/molden
cp ~/tmp/molden5.0/* -R /opt/molden

stick
export PATH=$PATH:/opt/molden
in your ~/.bashrc

Type
molden
to run

Molden can read output files from gamess -- still exploring the exact capabilities, but e.g the convergence information can be accessed:


and you can get nifty contour plots of the electron density of orbitals etc.


Error:

If you don't edit the surf/Makefile as shown above you'll get

make[1]: Leaving directory `/home/me/tmp/molden5.0/ambfor'
make -C surf depend
make[1]: Entering directory `/home/me/tmp/molden5.0/surf'
making dependencies...
make[1]: makedepend: Command not found
make[1]: *** [depend] Error 127
make[1]: Leaving directory `/home/me/tmp/molden5.0/surf'

20 September 2012

243. My own personal benchmarks for NWChem, gromacs with atlas, openblas, acml on AMD and intel

Update: you can compile against acml on intel as well, and against mkl on amd. Still need to do some performance testing to see how well it works. The artificial penalty of running mkl on AMD is well-publicised and led to a lawsuit, but I don't know how acml performs on mkl.


The title says it all, really. Since I'm back to exploring ways of improving performance for my little cluster I figured I'd break this out as a separate post. Most of this data was found here before: http://verahill.blogspot.com.au/2012/09/new-compute-node-using-amd-fx-8150.html

All units are running up-to-date debian testing (wheezy).

Configuration:
Boron (B): Phenom II X6 2.8 GHz, 8Gb RAM (2.8*6=16.8 GFLOPS predicted)
Neon (Ne): FX-8150 X8 3.6 GHz, 16 Gb RAM (3.6*8=28.8 GFLOPS predicted (int), 3.6*4=14.4 GFLOPS (fpu))
Tantalum (Ta): Quadcore i5-2400 3.1 GHz, 8 Gb RAM (3.1*4=12.4 GFLOPS predicted)
Vanadium (V):  Dual socket 2x Quadcore Xeon X3480 3.06 GHz, 8Gb. CentOS (ROCKS 5.4.3)/openblas.

Results

Gromacs --double (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, double precision, 500k steps)
B  :  10.662 ns/day (11.8  GFLOPS, runtime 8104 seconds)***
B  :    9.921 ns/day ( 10.9 GFLOPS, runtime 8709 seconds)**
Ne:  10.606 ns/day (11.7  GFLOPS, runtime 8146 seconds) *
Ne:  12.375 ns/day (13.7  GFLOPS, runtime 6982 seconds)**
Ne:  12.385 ns/day (13.7  GFLOPS, runtime 6976 seconds)****
Ta:  10.825 ns/day (11.9  GFLOPS, runtime 7981 seconds)***
V :   10.560 ns/dat (11.7  GFLOPS, runtime 8182 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

Gromacs --single (1 ns 6x6x6 nm tip4p water box; dynamic load balancing, single precision, 500 k steps)
B  :   17.251 ns/day (19.0 GFLOPS, runtime 5008 seconds)***
Ne:   21.874 ns/day (24.2 GFLOPS, runtime  3950 seconds)**
Ne:   21.804 ns/day (24.1 GFLOPS, runtime 3963  seconds)****
Ta:   17.345 ns/day (19.2 GFLOPS, runtime  4982 seconds)***
V :   17.297 ns/day (19.1 GFLOPS, runtime 4995 seconds)***
*no external blas/lapack.
**using ACML libs
*** using openblas
**** using ATLAS

NWChem (opt biphenyl cation, cp-md/pspw):
B  :   5951 seconds**
B  :   4084 seconds ***
B  :   5782 seconds ***xy
Ne:    3689 seconds**
Ta :   4102 seconds***
Ta :   4230 seconds***xy
V :    5396 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem

NWChem (opt biphenyl cation, geovib, 6-31G**/ub3lyp):
B  :  2841 seconds **
B  :  2410 seconds***
B  :  2101 seconds ***x
B  :  2196 seconds ***xy
Ne: 1665 seconds **
Ta : 1785 seconds***
Ta : 1789 seconds***xy
V  : 2600 seconds***

*no external blas/lapack.
**using ACML libs
*** using openblas
x Reconfigured using getmem.nwchem
y NWChem 6.1.1

A Certain Commercial Ab Initio Package (Freq calc of pre-optimised H14C19O3 at 6-31+G*/rb3lyp):
B  :    2h 00 min (CPU time 10h 37 min)
Ne:   1h 37 min (CPU time: 11h 13 min)
Ta:   1h 26 min (CPU time: 5h 27 min)
V  :   2h 15 min (CPU time 15h 50 min)
Using precompiled binaries.


Gamess:
(I'm still working on learning how to run gamess efficiently, so take these values with a huge saucer of salt for now). bn.inp does a geometry optimisation of a biphenyl cation (mult 2) at ub3lyp/6-31G**. bn.inp has no $STATPT card while bn3.inp does and it makes a huge difference -- but is this because it does 20 steps (nsteps=20), then kills the run? The default is 50 steps and it does seem like all the runs do the maximum number of steps, then exit.

 Again, still learning. See below for input files. Will fix this post as I learn what the heck I'm doing. The relative run times on each node are still comparable though, but just don't use the numbers to compare the run speed of e.g. nwchem vs gamess.

Gamess using bn.inp with atlas
B:    9079 seconds
Ne: 7252 seconds
Ta:  9283 seconds

Gamess using bn.inp with openblas
B:   9071 seconds
Ta: 9297 seconds

Gamess using bn.inp with acml
Ne: 7062 seconds

Gamess using bn3.inp with atlas. 
B: 4016 seconds
Ne: 3162 seconds
Ta: 4114 seconds

MPQC:
Here I've used the version in the debian repos. I've created a hostfile
neon slots=8 max_slots=8
tantalum slots=4 max_slots=4
boron slots=6 max_slots=6

and then just looked at changing the order and slots assignment as well as total number of cores assigned using mpirun.

Simple test case looking at number of cores/distribution:
n cores:  Seconds: Configuration(cores,exec nodes)
4    :   11   : 4(Ta)
4    :   17   : 4(Ne)
4    :   17   : 4(B)
4    :   42   : 2(Ta)+2(B)
6    :   12   : 6(B)
6    :   13   : 6(Ne)
6    :   74   : 2(Ta)+2(B)+2(Ne)
8    :   12   : 8(Ne)
10  :   43   : 4(Ta)+6(B)
12  :   47   : 4(Ta)+8(Ne)
14  :   55   : 6(B)+8(Ne)
18  :   170 :  4(Ta)+6(B)+8(Ne)

My beowulf cluster doesn't seem to be much of a super computer. All in all, this looks like a pretty good argument in favour of upgrading to infiniband...


bn.inp:
 $CONTRL 
COORD=CART UNITS=ANGS scftyp=uhf dfttyp=b3lyp runtyp=optimize 
ICHARG=1 MULT=2 maxit=100
$END
 $system mwords=2000 $end
 $BASIS gbasis=n31 ngauss=6 ndfunc=1 npfunc=1 $END
 $guess guess=huckel $end

 $DATA
biphenyl
C1
C      6.0      0.0000000000   -3.5630100000    0.0000000000 
C      6.0     -1.1392700000   -2.8592800000   -0.3938400000 
C      6.0     -1.1387900000   -1.4654500000   -0.3941500000 
C      6.0      0.0000000000   -0.7428100000    0.0000000000 
C      6.0      1.1387900000   -1.4654500000    0.3941500000 
C      6.0      1.1392700000   -2.8592800000    0.3938400000 
C      6.0      0.0000000000    0.7428100000    0.0000000000 
C      6.0      1.1387900000    1.4654500000   -0.3941500000 
C      6.0      1.1392700000    2.8592800000   -0.3938400000 
C      6.0     -1.1387900000    1.4654500000    0.3941500000 
C      6.0      0.0000000000    3.5630100000    0.0000000000 
C      6.0     -1.1392700000    2.8592800000    0.3938400000 
H      1.0      0.0000000000   -4.6489600000    0.0000000000 
H      1.0     -2.0282700000   -3.3966200000   -0.7116100000 
H      1.0     -2.0214800000   -0.9282700000   -0.7279300000 
H      1.0      2.0282700000   -3.3966200000    0.7116100000 
H      1.0      2.0282700000    3.3966200000   -0.7116100000 
H      1.0     -2.0214800000    0.9282700000    0.7279300000 
H      1.0      0.0000000000    4.6489600000    0.0000000000 
H      1.0     -2.0282700000    3.3966200000    0.7116100000 
H      1.0      2.0214800000    0.9282700000   -0.7279300000 
H      1.0      2.0214800000   -0.9282700000    0.7279300000 
 $END


bn3.inp:
$CONTRL 
COORD=CART UNITS=ANGS scftyp=uhf dfttyp=b3lyp runtyp=optimize 
ICHARG=1 MULT=2 maxit=100
$END
 $system mwords=2000 $end
 $BASIS gbasis=n31 ngauss=6 ndfunc=1 npfunc=1 $END
 $STATPT OPTTOL=0.0001 NSTEP=20 HSSEND=.TRUE. $END
 $guess guess=huckel $end

 $DATA
biphenyl
C1
C      6.0      0.0000000000   -3.5630100000    0.0000000000 
C      6.0     -1.1392700000   -2.8592800000   -0.3938400000 
C      6.0     -1.1387900000   -1.4654500000   -0.3941500000 
C      6.0      0.0000000000   -0.7428100000    0.0000000000 
C      6.0      1.1387900000   -1.4654500000    0.3941500000 
C      6.0      1.1392700000   -2.8592800000    0.3938400000 
C      6.0      0.0000000000    0.7428100000    0.0000000000 
C      6.0      1.1387900000    1.4654500000   -0.3941500000 
C      6.0      1.1392700000    2.8592800000   -0.3938400000 
C      6.0     -1.1387900000    1.4654500000    0.3941500000 
C      6.0      0.0000000000    3.5630100000    0.0000000000 
C      6.0     -1.1392700000    2.8592800000    0.3938400000 
H      1.0      0.0000000000   -4.6489600000    0.0000000000 
H      1.0     -2.0282700000   -3.3966200000   -0.7116100000 
H      1.0     -2.0214800000   -0.9282700000   -0.7279300000 
H      1.0      2.0282700000   -3.3966200000    0.7116100000 
H      1.0      2.0282700000    3.3966200000   -0.7116100000 
H      1.0     -2.0214800000    0.9282700000    0.7279300000 
H      1.0      0.0000000000    4.6489600000    0.0000000000 
H      1.0     -2.0282700000    3.3966200000    0.7116100000 
H      1.0      2.0214800000    0.9282700000   -0.7279300000 
H      1.0      2.0214800000   -0.9282700000    0.7279300000 
 $END

19 September 2012

242. Briefly: Compiling NWChem 6.1.1 with Python on Debian Testing (Wheezy)

Back at the end of June a minor version of NWChem (bug fixes) was released.

There isn't much difference between compiling 6.1.1 and 6.1. Mainly, the difference is in what line to edit for python compatibility (NWChem 6.1 here:http://verahill.blogspot.com.au/2012/05/building-nwchem-61-on-debian.html )

1. Install dev packages
sudo apt-get install libopenmpi-dev openmpi-bin python2.7-dev zlib1g-dev libssl-dev

2. Compile openblas  or download e.g. acml.
(compiles fine, but doesn't run, with atlas)

3. NWchem goodness:

sudo mkdir /opt/nwchem
sudo chown $USER /opt/nwchem
cd /opt/nwchem
wget http://www.nwchem-sw.org/images/Nwchem-6.1.1-src.2012-06-27.tar.gz
(or go here http://www.nwchem-sw.org/index.php/Download)

tar xvf Nwchem-6.1.1-src.2012-06-27.tar.gz
cd nwchem-6.1.1-src/src/config/

Edit makefile.h and hange (line numbers are just for convenience -- don't add them)
1956 #   EXTRA_LIBS += -ltk -ltcl -L/usr/X11R6/lib -lX11 -ldl
1957      EXTRA_LIBS +=    -lnwcutil  -lpthread -lutil -ldl
1958   LDOPTIONS = -Wl,--export-dynamic

to
1956 #   EXTRA_LIBS += -ltk -ltcl -L/usr/X11R6/lib -lX11 -ldl
1957      EXTRA_LIBS +=    -lnwcutil  -lpthread -lutil -ldl -lssl -lz
1958   LDOPTIONS = -Wl,--export-dynamic

cd to /opt/nwchem/nwchem-6.1.1-src/ Create a file called buildconf.sh with the following content:
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all python"
export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export BLASOPT="-L/opt/openblas/lib -lopenblas"
#export BLASOPT="-L/opt/acml/acml5.2.0/gfortran64_int64/lib -lacml"
#export BLASOPT="-L/opt/ATLAS/lib -lsatlas -ltatlas"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH=$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib
#export LIBRARY_PATH=$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/acml/acml5.2.0/gfortran64_int64/lib
#export LIBRARY_PATH=$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/ATLAS/lib
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran 1> make.log 2>make.err
export FC=gfortran
cd ../contrib
./getmem.nwchem

Start the compilation:
time sh buildconf.sh 

 On a quadcore i5-2400 it took 18 minutes.

241. pKa, part 3: ccCA in NWChem. Doing something wrong?

First of all, I'm having problems reproducing the output from 'task ccca' by following the methods described in J. Chem. Theory Comput 2008, 4, 328-334 (scaling 0.9854) or Mol. Phys. 2009,107(8-12),1107-1121. The discrepancies are the energies reported for the MP2/cc-pVTZ-DK and CCSD(T)/cc-PVTZ which leads to a difference in calculated relativistic and correlation corrections. More about that at some other time.

Here's using ccCA in NWChem on acetic acid/acetate and formic acid/formate.
More about how it works in another post.

Basically, the way I am using it the results are very, very poor with  ccCA. All I can think is that I must be doing something wrong.


INPUT files 

Acetic acid input:

Title "aceticacid"
Start  aceticacid

echo

charge 0

geometry autosym units angstrom
 C     -0.312051     -1.36877     0.00000
 H     -0.929226     -1.55822     -0.878253
 H     -0.929226     -1.55822     0.878253
 H     0.548700     -2.02934     0.00000
 C     0.150590     0.0606620     0.00000
 O     -0.897092     0.922315     0.00000
 H     -0.521850     1.81528     0.00000
 O     1.29371     0.435169     0.00000
end

basis
* library "cc-pvtz"
end

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

driver
  default
end

ccca
  optimize
end

task ccca

Acetate input:

Title "acetate_ccca"
Start  acetate_ccca
echo
charge -1

geometry autosym units angstrom
 C     -0.0311237     -1.36218     0.00000
 H     0.501926     -1.73727     0.878691
 H     0.501926     -1.73727     -0.878691
 H     -1.05131     -1.75101     0.00000
 C     -0.00500996     0.204086     0.00000
 O     1.14247     0.706045     0.00000
 O     -1.12049     0.771493     0.00000
end

basis
* library "cc-pvtz"
end

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

driver
  default
end

ccca
   optimize
end

task ccca


Formic acid input:
Title "formicacid"
Start  formicacid

echo

charge 0

geometry autosym units angstrom
 C     0.410955     -0.132154     0.00000
 H     1.50430     -0.0475164     0.00000
 O     -0.134104     1.09718     0.00000
 H     -1.09846     0.988665     0.00000
 O     -0.203188     -1.15938     0.00000
end

basic
* library "cc-pvtz"
end

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

driver
end

ccca
   optimize
end

task ccca


Formate input:
Title "formate"
Start  formate

echo

charge -1

geometry autosym units angstrom
 C     0.00000     0.00000     0.329396
 H     0.00000     0.00000     1.47310
 O     -1.13532     0.00000     -0.189103
 O     1.13532     0.00000     -0.189103
end

basis "ao basis" spherical print
* library "cc-pvtz"
END

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

driver
end

ccca
   optimize
end

task ccca


OUTPUT 

Acetic acid
 Temperature                      =   298.15K
 frequency scaling parameter      =   0.9889

 Zero-Point correction to Energy  =   38.155 kcal/mol  (  0.060805 au)
 Thermal correction to Energy     =   41.060 kcal/mol  (  0.065434 au)
 Thermal correction to Enthalpy   =   41.653 kcal/mol  (  0.066378 au)

 Total Entropy                    =   69.467 cal/mol-K
   - Translational                =   38.179 cal/mol-K (mol. weight =  60.0211)    - Rotational                   =   23.830 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    7.458 cal/mol-K

 Cv (constant volume heat capacity) =   14.439 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    8.480 cal/mol-K


 ccCA: calculations done, now printing results
 
 ccCA-P  reference energy =   -228.82035086993045     
 ccCA-S3 reference energy =   -228.82800135561300     
 ccCA-S4 reference energy =   -228.82030423449530     
 ccCA-PS3 reference energy=   -228.82417611277174     
 DK correction            =  -0.13322049012506909     
 CCSD(T) correction       =   -4.8762979862686961E-002
 CV correction            =  -0.20936881324035994     
 ---------------------------
 Total ccCA-P   energy    =   -229.21170315315857     
 Total ccCA-S3  energy    =   -229.21935363884111     
 Total ccCA-S4  energy    =   -229.21165651772341     
 Total ccCA-PS3 energy    =   -229.21552839599985     
 
 Thermochemistry available:
            ZPE   =   6.0851792771826778E-002
 ccCA-P   E+ZPE   =  -229.15085136038675     
 ccCA-S3  E+ZPE   =  -229.15850184606930     
 ccCA-S4  E+ZPE   =  -229.15080472495160     
 ccCA-PS3 E+ZPE   =  -229.15467660322804     
 Wrote ccCA-P    energy to the RTDB 
 Leaving ccCA module...

 Task  times  cpu:     5565.6s     wall:     5650.7s

Acetate

 Temperature                      =   298.15K
 frequency scaling parameter      =   0.9889

 Zero-Point correction to Energy  =   29.591 kcal/mol  (  0.047157 au)
 Thermal correction to Energy     =   31.853 kcal/mol  (  0.050762 au)
 Thermal correction to Enthalpy   =   32.446 kcal/mol  (  0.051706 au)

 Total Entropy                    =   64.067 cal/mol-K
   - Translational                =   38.129 cal/mol-K (mol. weight =  59.0133)
   - Rotational                   =   23.739 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    2.199 cal/mol-K

 Cv (constant volume heat capacity) =   11.235 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    5.276 cal/mol-K

 ccCA: calculations done, now printing results
 
 ccCA-P  reference energy =   -228.25857936176124     
 ccCA-S3 reference energy =   -228.26625083689740     
 ccCA-S4 reference energy =   -228.25849407721080     
 ccCA-PS3 reference energy=   -228.26241509932930     
 DK correction            =  -0.13318127658752132     
 CCSD(T) correction       =   -4.4728554700242285E-002
 CV correction            =  -0.20921905251765338     
 ---------------------------
 Total ccCA-P   energy    =   -228.64570824556665     
 Total ccCA-S3  energy    =   -228.65337972070282     
 Total ccCA-S4  energy    =   -228.64562296101622     
 Total ccCA-PS3 energy    =   -228.64954398313472     
 
 Thermochemistry available:
            ZPE   =   4.7193435242008613E-002
 ccCA-P   E+ZPE   =  -228.59851481032464     
 ccCA-S3  E+ZPE   =  -228.60618628546081     
 ccCA-S4  E+ZPE   =  -228.59842952577421     
 ccCA-PS3 E+ZPE   =  -228.60235054789271     
 Wrote ccCA-P    energy to the RTDB 
 Leaving ccCA module...

 Task  times  cpu:     3859.1s     wall:     3910.2s


Formic acid

 Temperature                      =   298.15K
 frequency scaling parameter      =   0.9889

 Zero-Point correction to Energy  =   20.909 kcal/mol  (  0.033320 au)
 Thermal correction to Energy     =   22.902 kcal/mol  (  0.036497 au)
 Thermal correction to Enthalpy   =   23.495 kcal/mol  (  0.037441 au)

 Total Entropy                    =   59.329 cal/mol-K
   - Translational                =   37.387 cal/mol-K (mol. weight =  46.0055)    - Rotational                   =   21.008 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    0.934 cal/mol-K

 Cv (constant volume heat capacity) =    8.703 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    2.744 cal/mol-K

 ccCA: calculations done, now printing results

 ccCA-P  reference energy =   -189.56748775122853
 ccCA-S3 reference energy =   -189.57364633780318
 ccCA-S4 reference energy =   -189.56733835209894
 ccCA-PS3 reference energy=   -189.57056704451585
 DK correction            =  -0.11856238070660652
 CCSD(T) correction       =   -3.0831132609506540E-002
 CV correction            =  -0.16057296161548607
 ---------------------------
 Total ccCA-P   energy    =   -189.87745422616013
 Total ccCA-S3  energy    =   -189.88361281273478
 Total ccCA-S4  energy    =   -189.87730482703054
 Total ccCA-PS3 energy    =   -189.88053351944745

 Thermochemistry available:
            ZPE   =   3.3346398704552728E-002
 ccCA-P   E+ZPE   =  -189.84410782745556
 ccCA-S3  E+ZPE   =  -189.85026641403022
 ccCA-S4  E+ZPE   =  -189.84395842832598
 ccCA-PS3 E+ZPE   =  -189.84718712074289
 Wrote ccCA-P    energy to the RTDB
 Leaving ccCA module...

 Task  times  cpu:     1369.3s     wall:     1407.5s

Formate
 Temperature                      =   298.15K
 frequency scaling parameter      =   0.9889

 Zero-Point correction to Energy  =   12.385 kcal/mol  (  0.019737 au)
 Thermal correction to Energy     =   14.252 kcal/mol  (  0.022713 au)
 Thermal correction to Enthalpy   =   14.845 kcal/mol  (  0.023656 au)

 Total Entropy                    =   56.927 cal/mol-K
   - Translational                =   37.321 cal/mol-K (mol. weight =  44.9976)
   - Rotational                   =   19.229 cal/mol-K (symmetry #  =        2)
   - Vibrational                  =    0.377 cal/mol-K

 Cv (constant volume heat capacity) =    7.310 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    1.351 cal/mol-K


 ccCA: calculations done, now printing results
 
 ccCA-P  reference energy =   -189.01189831122844     
 ccCA-S3 reference energy =   -189.01808726799254     
 ccCA-S4 reference energy =   -189.01171261173857     
 ccCA-PS3 reference energy=   -189.01499278961049     
 DK correction            =  -0.11851294005154500     
 CCSD(T) correction       =   -2.6430727035545942E-002
 CV correction            =  -0.16057463040127118     
 ---------------------------
 Total ccCA-P   energy    =   -189.31741660871680     
 Total ccCA-S3  energy    =   -189.32360556548090     
 Total ccCA-S4  energy    =   -189.31723090922694     
 Total ccCA-PS3 energy    =   -189.32051108709885     
 
 Thermochemistry available:
            ZPE   =   1.9751889903778755E-002
 ccCA-P   E+ZPE   =  -189.29766471881302     
 ccCA-S3  E+ZPE   =  -189.30385367557713     
 ccCA-S4  E+ZPE   =  -189.29747901932316     
 ccCA-PS3 E+ZPE   =  -189.30075919719508     
 Wrote ccCA-P    energy to the RTDB 
 Leaving ccCA module...

Solvation energy

Solvation energy may seem easy to calculate, but difficult to calculate accurately using implicit methods, in particular for ions. I used the optimized structures from above, and then did a single-point COSMO (rsolv 0. Not ideal)  at RDFT(b3lyp)/cc-pVTZ/

Acetic acid: -8.59 kcal/mol
Acetate: -72.33 kcal/mol
Formic acid: -8.90
Formate: -72.59 kcal/mol
H+: -624.61 kcal/mol (lit. value)


Free energies:  
G: ccCA-P+(Hcorr-T*Scorr)
 G(acetic acid): -229.21170315315857*627.503+(41.653-298.15*69.467/1000)-8.59
 G(acetate): -228.64570824556665*627.503+(32.446-298.15*64.067/1000)-72.33
 G(formic acid): -189.87745422616013*627.503+(23.495-298.15*59.329/1000)-8.90
 G(formate):  -189.31741660871680*627.503+(14.845-298.15*56.927/1000)-72.59
 G(H+): -6.28 kcal/mol (lit. value) -264.61 (lit. value)=-270.89 kcal/mol


Results: Direct approach:
Don't forget to account for the standard state. R=1.9858775(34)×10−3 kcal/(K.mol)
DG(acetic/acetate)=G(acetate)+G(H+)-G(acetic)+RT*ln(1/24.46)=
(-228.64570824556665*627.503+(32.446-298.15*64.067/1000)-72.33-270.89)-(-229.21170315315857*627.503+(41.653-298.15*69.467/1000)-8.59)+1.9858775*298.15*log(1/24.46)/1000=
11.0435795929699 kcal/mol=46.2063370169861 kJ/mol =>
pKa=DG*log10(e)/RT=
11.0435795929699*log10(e)/(1.9858775*10**(-3)*298.15)
=8.1 (which is very bad -- it should be close to 4.75)

DG(formic/formate)=G(formate)+G(H+)-G(formic)-RT*ln(1/24.46)=
(-189.31741660871680*627.503+(14.845-298.15*56.927/1000)-72.59-270.89)-(-189.87745422616013*627.503+(23.495-298.15*59.329/1000)-8.90)+1.9858775*298.15*log(1/24.46)/1000=
7.01850845285312 kcal/mol=29.3654393667375 kJ/mol
pKa=5.1 (which is quite bad -- it should be close to 3.75)


Results: Isodesmic approach
From an older post:
"Assuming that we know that formic acid has a pKa of 3.75, then DG_solution=pKa*RT/log(e)=3.75*8.314*298.15/log10(e)/1000=21.404 kJ/mol. The reverse reaction is -21.404 kJ/mol."
That's about 5.116 kcal/mol

DG(acetate)+DG(formic)-(DG(acetic)+DG(formate))+DG(ref)=
((-228.64570824556665*627.503+(32.446-298.15*64.067/1000)-72.33)+(-189.87745422616013*627.503+(23.495-298.15*59.329/1000)-8.90))-((-229.21170315315857*627.503+(41.653-298.15*69.467/1000)-8.59)+(-189.31741660871680*627.503+(14.845-298.15*56.927/1000)-72.59))+5.116=
4.02507114014588 kcal/mol+5.116 kcal/mol =9.14107114014588 kcal/mol <=> pKa= 6.7
Which is better, but still not as good as here.

Using the E+zpe energies doesn't help much:
((-228.59851481032464*627.503+(32.446-298.15*64.067/1000)-72.33)+(-189.84410782745556*627.503+(23.495-298.15*59.329/1000)-8.90))-((-229.15085136038675*627.503+(41.653-298.15*69.467/1000)-8.59)+(-189.29766471881302*627.503+(14.845-298.15*56.927/1000)-72.59))+5.116=
9.10100587110175 kcal/mol <=> pKa=6.68

I really have no idea why the results are so bad when I had reasonable results with DFT/b3lyp/6-31++G** which should be worse than the E(CBS)+E(CC)+E(CV)+E(DK) approach for calculating electronic energies.

Solvation energies are a bit different and could explain some of the difference. Using the solvation energies from here I got:
((-228.64570824556665*627.503+(32.446-298.15*64.067/1000)-73.23)+(-189.87745422616013*627.503+(23.495-298.15*59.329/1000)-9.99))-((-229.21170315315857*627.503+(41.653-298.15*69.467/1000)-9.32)+(-189.31741660871680*627.503+(14.845-298.15*56.927/1000)-72.47))+5.116=
7.76107114008302 kcal/mol <=> pKa=5.69. Not 4.75, but closer.

Using rsolv 1.3 I get
((-228.64570824556665*627.503+(32.446-298.15*64.067/1000)-71.09)+(-189.87745422616013*627.503+(23.495-298.15*59.329/1000)-6.37))-((-229.21170315315857*627.503+(41.653-298.15*69.467/1000)-6.53)+(-189.31741660871680*627.503+(14.845-298.15*56.927/1000)-71.90))+5.116=
10.1610711401063 kcal/mol which is bad.


More thinking.
  This paper says that the gas phase free energy for the deprotonation of acetic acid should be 341.1 kcal/mol
(-228.64570824556665*627.503+(32.446-298.15*64.067/1000)-6.82)-(-229.21170315315857*627.503+(41.653-298.15*69.467/1000))=340.75 kca/mol
We're within 1 kcal/mol 

The same paper states 338.5 kcal/mol for formic acid:
(-189.31741660871680*627.503+(14.845-298.15*56.927/1000)-6.82)-(-189.87745422616013*627.503+(23.495-298.15*59.329/1000))=336.67 kca/mol

For our direct solution phase pKa calculation formic acid was off by about 1.9 kcal/mol which is similar to the error here.

18 September 2012

240. Harmonic frequency scaling in NWChem

...is difficult to find information about, but apparently it IS possible:
http://www.emsl.pnl.gov/docs/nwchem/nwchem-support/2012/08/0021._NWCHEM_FW:_BOUNCE_nwchem-users_emsl.pnl

Example input:

Title "acetate"

Start  acetate

echo

charge -1

geometry autosym units angstrom
 C     0.0191498     0.0215213     0.0191498
 H     -0.621688     -0.620669     0.658428
 H     0.658428     -0.620669     -0.621688
 H     -0.628475     0.649316     -0.628475
 C     0.881900     0.883055     0.881900
 O     1.74905     0.315740     1.74905
 O     0.799516     2.22959     0.799516
end

basis
* library "cc-pvtz"
end

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

driver
end

set vib:scalefreq  0.9854

task dft optimize
task dft freq

which gives:

 Temperature                      =   298.15K
 frequency scaling parameter      =   0.9854

 Zero-Point correction to Energy  =   29.486 kcal/mol  (  0.046990 au)
 Thermal correction to Energy     =   31.752 kcal/mol  (  0.050601 au)
 Thermal correction to Enthalpy   =   32.345 kcal/mol  (  0.051544 au)

 Total Entropy                    =   64.086 cal/mol-K
   - Translational                =   38.129 cal/mol-K (mol. weight =  59.0133)
   - Rotational                   =   23.739 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    2.218 cal/mol-K

 Cv (constant volume heat capacity) =   11.268 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    5.309 cal/mol-K

c.f.

 Temperature                      =   298.15K
 frequency scaling parameter      =   1.0000

 Zero-Point correction to Energy  =   29.923 kcal/mol  (  0.047686 au)
 Thermal correction to Energy     =   32.173 kcal/mol  (  0.051272 au)
 Thermal correction to Enthalpy   =   32.766 kcal/mol  (  0.052215 au)

 Total Entropy                    =   64.008 cal/mol-K
   - Translational                =   38.129 cal/mol-K (mol. weight =  59.0133)
   - Rotational                   =   23.739 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    2.141 cal/mol-K

 Cv (constant volume heat capacity) =   11.132 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    5.173 cal/mol-K

239. Sun GridEngine: resetting queue status on node

I occasionally run into problems with space during a run on my cluster, which causes the job to fail and the node to be marked as unavailable:

qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
eight.q@neon                   BIP   0/0/8          0.45     lx26-amd64    
---------------------------------------------------------------------------------
five.q@boron                   BIP   0/0/5          6.01     lx26-amd64    
---------------------------------------------------------------------------------
six.q@boron                    BIP   0/6/6          6.01     lx26-amd64    
    788 0.75000 submit__la user         r     09/07/2012 18:36:56     6        
---------------------------------------------------------------------------------
two.q@beryllium                BIP   0/0/2          0.24     lx26-amd64    
---------------------------------------------------------------------------------
four.q@tantalum                BIP   0/0/4          0.05     lx26-amd64    E
---------------------------------------------------------------------------------
three.q@beryllium              BIP   0/0/3          0.24     lx26-amd64    
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.24     lx26-amd64    
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/1          6.01     lx26-amd64    
---------------------------------------------------------------------------------
main.q@tantalum                BIP   0/0/1          0.05     lx26-amd64    

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    789 0.67310 zoli.qsub  user         qw    09/09/2012 10:00:35     6        
    802 0.60527 submit__bi user         qw    09/10/2012 20:45:24     6        
    803 0.60525 submit__bi user         qw    09/10/2012 20:46:00     6        
    927 0.25071 submit__ac user         qw    09/18/2012 08:24:00     4        
    928 0.25000 submit__ac user         qw    09/18/2012 08:45:52     4  

Before you do anything else, free up space and consider moving your scratch dir to a different/separate disk.

Since I keep forgetting how to reset it, here it is -- as a SGE admin do:
 /usr/bin/qmod -c four.q@tantalum
me@beryllium changed state of "four.q@tantalum" (no error)
And now everything is good:

qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
eight.q@neon                   BIP   0/0/8          0.25     lx26-amd64    
---------------------------------------------------------------------------------
five.q@boron                   BIP   0/0/5          5.91     lx26-amd64    
---------------------------------------------------------------------------------
six.q@boron                    BIP   0/6/6          5.91     lx26-amd64    
    788 0.75000 submit__la user         r     09/07/2012 18:36:56     6        
---------------------------------------------------------------------------------
two.q@beryllium                BIP   0/0/2          0.44     lx26-amd64    
---------------------------------------------------------------------------------
four.q@tantalum                BIP   0/4/4          0.17     lx26-amd64    
    927 0.25071 submit__ac user         r     09/18/2012 11:01:26     4        
---------------------------------------------------------------------------------
three.q@beryllium              BIP   0/0/3          0.44     lx26-amd64    
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.44     lx26-amd64    
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/1          5.91     lx26-amd64    
---------------------------------------------------------------------------------
main.q@tantalum                BIP   0/0/1          0.17     lx26-amd64    

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    789 0.67310 zoli.qsub  user         qw    09/09/2012 10:00:35     6        
    802 0.60527 submit__bi user         qw    09/10/2012 20:45:24     6        
    803 0.60525 submit__bi user         qw    09/10/2012 20:46:00     6        
    928 0.25000 submit__ac user         qw    09/18/2012 08:45:52     4   

17 September 2012

238. Calculating pKa, part 2: CBS extrapolation basics

I'm typing this up as I'm learning. Be prepared that there may be outright errors, fuzzy thinking and misunderstood concepts in this post. Use it as an inspiration for learning about Complete Basis Set extrapolation -- not as an authoritative guide. What follows is just what I think I've understood. It's a short post since I'll just put it here in order that I can refer to it later.

The general idea
The basic idea behind Complete Basis Set extrapolation seems to be that as you make your basis sets larger and larger by adding Zeta functions to your valence orbitals (making your valence orbitals more 'flexible') the energies you get start to approach those you would get with an infinitely large or complete (and thus 'true') basis set. Exactly how quickly this approach (exponential?) occurs is a matter for debate, so there's a number of ways of extrapolating it.

In practical terms, what seems to be done is:
1. The structure of a molecule is first optimised using a specific level of theory (e.g. MP2/cc-aug-pvdz) in the gas phase. This structure is used for all subsequent calculations.
2. A single point calculation at MP2/aug-cc-pVDZ is done. The electronic energy is recorded.
3. A single point calculation at MP2/aug-cc-pVTZ is done. The electronic energy is recorded.
4. A single point calculation at MP2/aug-cc-pVQZ is done. The electronic energy is recorded.
5. A single point calculation at MP2/aug-cc-pV5Z is done. The electronic energy is recorded.
etc.

where D=double(2), T=triple (3), Q=Quadruple (4)

Using MP2 and the Dunning-Knoll series of correlation consistent basis sets, for formate I got
2   -188.772348824917 (MP2/aug-cc-pVDZ)
3   -188.930859478806 (MP2/aug-cc-pVTZ)
4   -188.984671946038 (MP2/aug-cc-pVQZ)


Which looks like this:
Not everyone will appreciate the default choice of green and red by gnuplot...
The fitted line is MP2energy=1*exp(-0.599028936687644*Zetafunctions)-189.082170648532

Plotting an exponential with three unknown based only on three points is obviously poor form, but the point is fairly clear.

The extrapolated energy is -189.082170648532 hartree. For practical use you will need to obtain enthalpy/entropy corrections and solvation energies using a set method. Also, there are normally corrections for unpaired electrons etc.

The Peterson Scheme:
 (Peterson, K. A.; Woon, D. E.; Dunning, T. H., Jr. J. Chem. Phys. 1994100, 7410, link)

Instead of a straight A*exp(-B*x)+C fit you do E(x)=A+B*exp(-(x-1))+C*exp(-(x-1)**2), where A is the CBS energy.

So, using gnuplot we can put our energies in 'cbs.dat':
2 -188.772348824917
3 -188.930859478806
4 -188.984671946038

and fit it in gnuplot using:

set xrange [0:5]
f(x)=A+B*exp(-(x-1))+C*exp(-(x-1)**2)
fit f(x) 'cbs.dat' u 1:2 via A,B,C
plot 'cbs.dat' u 1:2, f(x) lc 3



which gives us a CBS energy of -189.016 Hartree (c.f. -189.082 using a simple exponential).

The basic underlying principle behind CBS is thus pretty clear even to someone who's not skilled in the art of computational chemistry.

The difficulties seem to be:
1. What basis set family to use (cc-pVXZ ?)
2. Whether to use diffuse/polarisation/extra orbital functions (aug, p, d+)?
3.  What level to do solvation calc on (DFT, MP2; COSMO, CPCM)?
4. What level to do structural optimisation at?
5. What level to do frequency calculation at?

Remember that a great many basis sets haven't been parametrised for elements beyond Ar.

Also, as far as solvation calculations are concerned an issue seems to be that the correct way to calculate solvation energy seems to be using gas phase structures -- and NOT structures optimised using a solvation model. This can cause issues if the gas phase and solution phase conformations are considerably different.

16 September 2012

237. Briefly: Packet corrupt during ssh sessions

Whenever I'm using two subnets for my cluster I seems to be having problems with:
Corrupted MAC on input.
Disconnecting: Packet corrupt
It particularly happens when there's a lot of information being passed to the screen. It's a right killer when you're compiling on a remote system. However, while I've been able to get around that by running a GNU Screen session on the remote box it was time to solve it.\


I googled and found:
https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.17/+bug/60764


The subnet I'm having issues with is operation across a switch. One of the computers on the network is defined as the gateway and I tend to have problems when connecting from it to the other computers on the network.
The gateway server has two interfaces, eth0 which is connected to a router which is connected to the outside world, and eth1 which is connected to the local subnet I'm having problems with.

The fix has been as as simple as
sudo apt-get install ethtool
sudo ethtool -K eth1 rx off tx off

I only had to run this on the gateway box and so far I've had no issues. Depending on how you're managing your network interfaces (i.e. wicd, network-manager, /etc/networking/*) you may want to add it to the post-up section of your /etc/network/interfaces:
post-up ethtool -K eth1 rx off tx off

Links to this post:
http://winscp.net/forum/viewtopic.php?t=12469

14 September 2012

236. Calculating pKa, part 1:Example (attempt) of an isodesmic reactions in NWChem

Back to learning about computational approaches to chemistry. The usual warnings apply: why would you trust anything that I say about anything? I'm writing anonymously, and I may misunderstand things at times. So make sure that you compare what I write with that of other sources and make up your own mind.

Anyway, I found a fairly detailed presentation in which they were using Gaussian 98 here: https://www.uow.edu.au/~adamt/Trevitt_Research/Links_files/pKa%20workshop%20slides.pdf


While that should be ok to reproduce, it's not that straightforward to do even with Gaussian, since G09 and G03 don't report solvation parameters in the same way (or detail) as G98.

I'm also a lot keener on NWChem than Gaussian for various reasons, not least that it's 'free' (both libre and gratis) while Gaussian inc. has been accused of doing somewhat unfriendly things in the name of protecting their business interests.

See here and here for an example, and then here for a rebuttal from Gaussian inc. I know that EMSL/Pacific Northwest National Lab that develop NWChem and ECCE are prohibited from using Gaussian since they are considered as being competitors.


Back to science.

Our test example will be acetate, and we'll use formic acid to correct our results.
The fact that this post is very long is due to the amount of detail supplied -- I prefer to show some of the more obvious things so that people can learn from what I post -- and I learn by writing the post.

But first let's just do everything using direct methods.

We work with a thermochemical cycle:

IF we can't calculate the DG_solution directly (i.e. too expensive) we can optimise our structures in the gas phase, and then calculate the solvation energy for those structures.

Then DG_sol=DG_gas+DG_solvation(B)-DGsolvation(A).
(more generally sum of DG_solv(prod) - sum of DG_solv(reactants)).


1. pKa of Acetic acid using direct methods

We can either do
H3CCOOH -> H3COO- + H+
or
H3CCOOH + H2O -> H3COO- + H3O+

Steps:
Optimise acetic acid and acetate in the gas phase and do frequency calculations to get the enthalpy and entropy. Then use the gas phase structures and do single point calculations using COSMO to get the electrostatic solvation energies. Finally, use standard state corrections.

A. Optimise acetic acid in the gas phase and do frequency calculation

Title "aceticacid_gas"
Start  aceticacid_gas
echo
charge 0

geometry autosym units angstrom
 C     0.0402340     0.0308110     0.0402340
 H     -0.600803     -0.611482     0.679201
 H     0.679201     -0.611482     -0.600803
 H     -0.607055     0.659335     -0.607055
 C     0.903928     0.890481     0.903928
 O     0.814831     2.23989     0.814831
 O     1.78275     0.299052     1.78275
 H     2.25438     1.03276     2.25438
end

basis "ao basis" spherical print
  H library "6-31++G**"
  O library "6-31++G**"
  C library "6-31++G**"
END

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

driver
  default
end

task dft optimize
task dft freq

which gives

         Total DFT energy =     -229.102415550663
      One electron energy =     -550.833155571059
           Coulomb energy =      230.555870626149
    Exchange-Corr. energy =      -29.497734561879
 Nuclear repulsion energy =      120.672603956126

 Numeric. integr. density =       31.999999816803

     Total iterative time =      3.6s
and
 Temperature                      =   298.15K
 frequency scaling parameter      =   1.0000

 Zero-Point correction to Energy  =   38.683 kcal/mol  (  0.061646 au)
 Thermal correction to Energy     =   41.568 kcal/mol  (  0.066243 au)
 Thermal correction to Enthalpy   =   42.160 kcal/mol  (  0.067186 au)

 Total Entropy                    =   69.198 cal/mol-K
   - Translational                =   38.179 cal/mol-K (mol. weight =  60.0211)
   - Rotational                   =   23.855 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    7.164 cal/mol-K

 Cv (constant volume heat capacity) =   14.327 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    8.368 cal/mol-K

So that G=-229.102415550663*(627.503 kcal/Hartree)+(42.160 kcal/mol-298.15*(69.198 cal/molK)/1000)=-1.4372e+05 kcal/mol

B. Optimise acetate in the gas phase and do frequency calculation

Title "acetate_gas"

Start  acetate_gas

echo

charge -1

geometry autosym units angstrom
 C     0.0405721     0.0285481     0.0405721
 H     -0.601438     -0.613690     0.678620
 H     0.678620     -0.613690     -0.601438
 H     -0.605857     0.658809     -0.605857
 C     0.904975     0.886806     0.904975
 O     0.825186     2.23364     0.825186
 O     1.77103     0.316179     1.77103
end

basis "ao basis" spherical print
  H library "6-31++G**"
  O library "6-31++G**"
  C library "6-31++G**"
END

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken 
end

driver
  default
end

task dft optimize
task dft freq                     

which gives

         Total DFT energy =     -228.540046314754
      One electron energy =     -539.474204198295
           Coulomb energy =      229.063209931484
    Exchange-Corr. energy =      -29.339040486703
 Nuclear repulsion energy =      111.209988438759

 Numeric. integr. density =       32.000000595562

     Total iterative time =      2.9s

and

 Temperature                      =   298.15K
 frequency scaling parameter      =   1.0000

 Zero-Point correction to Energy  =   30.023 kcal/mol  (  0.047845 au)
 Thermal correction to Energy     =   32.271 kcal/mol  (  0.051427 au)
 Thermal correction to Enthalpy   =   32.863 kcal/mol  (  0.052371 au)

 Total Entropy                    =   64.022 cal/mol-K
   - Translational                =   38.129 cal/mol-K (mol. weight =  59.0133)
   - Rotational                   =   23.766 cal/mol-K (symmetry #  =        1)
   - Vibrational                  =    2.127 cal/mol-K

 Cv (constant volume heat capacity) =   11.112 cal/mol-K
   - Translational                  =    2.979 cal/mol-K
   - Rotational                     =    2.979 cal/mol-K
   - Vibrational                    =    5.153 cal/mol-K
so that G=-228.540046314754*(627.503 kcal/Hartree)+(32.271 kcal/mol-298.15*(64.022 cal/molK)/1000)=-1.4338e+05 kcal/mol

Putting A and B together: (-228.540046314754*(627.503)+(32.271-298.15*(64.022/1000)))-(-229.102415550663*(627.503)+(42.160-298.15*(69.198/1000)))=344.54 kcal/mol

We haven't accounted for solvation or the proton yet.

C. Solvation of acetic acid

Title "aceticacid_solvation"
Start  aceticacid_solvation

echo

charge 0

geometry autosym units angstrom
 C     -0.313400     -1.37257     0.00000
 H     -0.932151     -1.56367     -0.882188
 H     -0.932151     -1.56367     0.882188
 H     0.551887     -2.03461     0.00000
 C     0.149260     0.0607660     0.00000
 O     1.30165     0.439500     0.00000
 O     -0.897630     0.927778     0.00000
 H     -0.523912     1.82535     0.00000
end

basis "ao basis" spherical print
  H library "6-31++G**"
  O library "6-31++G**"
  C library "6-31++G**"
END

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

cosmo
end

task dft energy

which gives
                  COSMO solvation results
                  -----------------------

                 gas phase energy =      -229.1024156124
                 sol phase energy =      -229.1172649438
 (electrostatic) solvation energy =         0.0148493315 (    9.32 kcal/mol)

D. Solvation of acetate

Title "acetate_solvation"
Start  acetate_solvation

echo

charge -1

geometry autosym units angstrom
 C     -0.0308736     -1.36399     0.00000
 H     0.503418     -1.74042     0.882388
 H     0.503418     -1.74042     -0.882388
 H     -1.05531     -1.75261     0.00000
 C     -0.00485953     0.199667     0.00000
 O     -1.12642     0.778065     0.00000
 O     1.14855     0.713544     0.00000
end

basis "ao basis" spherical print
  H library "6-31++G**"
  O library "6-31++G**"
  C library "6-31++G**"
END

dft
  mult 1
  direct
  noio
  XC b3lyp
  grid fine
  mulliken
end

cosmo
end

task dft energy

which gives
                  COSMO solvation results
                  -----------------------
  
                 gas phase energy =      -228.5400463128
                 sol phase energy =      -228.6567490452
 (electrostatic) solvation energy =         0.1167027324 (   73.23 kcal/mol)


Putting A, B and solvation energies together: 
[(-228.540046314754*(627.503)+(32.271-298.15*(64.022/1000)))-(-229.102415550663*(627.503)+(42.160-298.15*(69.198/1000)))]-[73.23-9.32]=344.54-63.910=280.63 kcal/mol

E. The Proton
If you try to do any calculations on an isolated proton you get and SCFE of zero, and you won't do much better in terms of thermochemical data. Yet, monoatomic gases obviously still posses entropy and enthalpy. Instead, the document I cite above uses the ideal gas partition function for ideal monoatomic gases which gives a value of 6.28 kcal/mol for the free energy of a proton.  The last reference states that the free energy for a proton in the gas phase is experimentally determined to be -6.28 kcal/mol and that the free energy of hydration is -264.61 kcal/mol (experimental).

[(-228.540046314754*(627.503)+(32.271-298.15*(64.022/1000)))-6.28-(-229.102415550663*(627.503)+(42.160-298.15*(69.198/1000)))]-[73.23+264.61-9.32]=338.26-328.52=9.74 kcal/mol =40.752 kJ/mol
We're not done yet though -- make sure to continue to 'F. Standard States'

F. Standard States

We know that

DG=DG^0+RT ln (Q).

We have a bit of a problem. We're doing calculations in the gas phase (pressure) but looking at predicting solution values (concentration). Also, I don't fully get it yet, so my explanation is probably a bit fuzzy.

 So if A+B-> C+D we get Q=(C*D)/(A*B) but for A -> C+D we get Q=(C*D)/(A) or (1 bar x1 bar/1 bar)= 1 bar = 101350 Pa => nRT/P=1*8.314*298.15/101350=0.024458 m3= 24.46 L. The concentration of each species is 1 mol/bar which in volume terms means 1/24.46 L.

And the A -> C + D situation is what we have if we look at
HA -> H+ + A-

So,
Q=((1/24.46)*(1/24.46)/(1/24.46))=1/2.4.46 which gives
DG=DG^0-RTln(24.46)=DG^0-7924.9 J/mol

Thus, we need to correct for the standard state: 40.752 -7924.9/1000=32.827 kJ/mol

G. Calculating pKa

Since (in solution so that concentrations are 1 M)
DG=DG^0+RTln(K), where K=([H3COO][H+])/([H3COOH])
DG=DG^0+(RT ln([HCOO]/[H3COOH])+RTln (10^(-pH)), where RT ln([HCOO]/[H3COOH])=RTln(1/1)=0 so
DG=DG^0+RTln (10^(-pH))=DG^0+RT log (10^(-pH)/log(e)=DG^0-(RT/log(e)) * pH
Which for equilbrium, where pH=pKa and DG=0, turns into
pKa=DG^0*log(e)/RT


pKa= DG*log(e)/(RT)=(32.827*1000)*log(e)/(8.314*298.15)=5.75


Not that great as predictions go (exp: pKa=4.75). Looking at some of the literature one error lies in the size of the solvation energies. Possibly one should tune the parameters used in the COSMO.

2. Isodemic reaction/correction

Using formic acid
This approach is based on 1) the similarity between two compounds and 2) us knowing the DG_solution parameter for one of them.

Assuming that we know that formic acid has a pKa of 3.75, then DG_solution=pKa*RT/log(e)=3.75*8.314*298.15/log10(e)/1000=21.404 kJ/mol. The reverse reaction is -21.404 kJ/mol.

We skip a few steps.
Here are the calculated parameters for formic acid (using the same method as above):

Formic acid


SCFE: -189.772804709496 Hartree
Enthalpy correction: 23.773 kcal/mol
Entropy correction: 59.339 cal/mol
Solvation energy: 9.99 kcal/mol




Formate
SCFE:  -189.217943605798 Hartree
Enthalpy correction: 15.016 kcal/mol
Entropy correction: 56.992 cal/mol
Solvation energy: 72.47 kcal/mol

[Just for kicks we quickly look at what the prediction is:
accounting for everything (solvation, proton etc.)
((-189.217943605798*627.503+(15.016-298.15*56.992/1000)-72.47) +(-6.28-264.61))-(-189.772804709496*627.503+(23.773-298.15*59.339/1000)-9.99)=6.7498 kcal/mol
6.7498*4.184-7924.9/1000=20.316 kJ/mol <=> pKa=1000*20.316*log10(e)/(8.314*298.15)=3.56]

The isodesmic approach:

Here we look at
H3CCOOH + -OOCH -> H3CCOO- +HOOCH

This combined reaction has a
DG_solution=DG_solution(acetic acid/acetate)+DG_solution(formate/formic acid)
<=>
DG_solution(acetic acid/acetate)= DG_solution-DG_solution(formic acid/formate)
=DG_solution-(-21.404 kJ/mol)
=DG_gas+[DG_sol(acetate)+DG_sol(formic acid)-DG_sol(acetic acid)-DG_sol(formate)]+21.404 kJ/mol
=
(-228.540046314754*(627.503)+(32.271-298.15*(64.022/1000)))+(-189.772804709496*627.503+(23.773-298.15*(59.339/1000)))-(-229.102415550663*(627.503)+(42.160-298.15*(69.198/1000)))-(-189.217943605798*627.503+(15.016-298.15*(56.992/1000)))+(-73.23-9.99+9.32+72.47)+5.116 kcal/mol= 8.11 kcal/mol= 33.93 kJ/mol

Here we don't need to fiddle with standard states or experimental values for solvation of the proton.

pKa= DG*log(e)/(RT)=(33.93*1000)*log(e)/(8.314*298.15)=5.94
which is even worse than before...
We want ca 27 kJ/mol = 6.48 kcal/mol. Paradoxically this may be due to the ab initio approach to the pKa of formic acid actually giving a very reasonable value.

Using Propanoic acid
So let's try the isodesmic approach using propanoic acid as our reference instead.


Propanoic acid


SCFE:  -268.419515785389 Hartree
Enthalpy correction:  60.894 kcal/mol
Entropy correction:  75.184 cal/mol
Solvation energy:  8.15 kcal/mol




Propanoate
SCFE:   -267.857478200414 Hartree
Enthalpy correction: 52.096 kcal/mol
Entropy correction:  75.392 cal/mol
Solvation energy:  72.14 kcal/mol

pKa=4.86 <=> 27.739 kJ/mol= 6.63 kcal/mol

(-228.540046314754*(627.503)+(32.271-298.15*(64.022/1000)))+(-268.419515785389*627.503+(60.894-298.15*(75.184/1000)))-(-229.102415550663*(627.503)+(42.160-298.15*(69.198/1000)))-(-267.857478200414*627.503+(52.096-298.15*(75.392/1000)))+(-73.23-8.15+9.32+72.14)+6.63 kcal/mol=7.4324 kcal/mol=31.097

pKa= DG*log(e)/(RT)=(31.097*1000)*log(e)/(8.314*298.15)=5.45

It's a bit better, but still a bit off.

3. Conclusion:
The isodesmic approach is not magic and it relies on the similarity of two compounds, one for which there are experimental data, causing similar computational issues. Under the right conditions it's a useful approach, whereas under other conditions -- where a body of experimental data exists -- it might just be easier to determine the correlation between experimental and calculated data via fitting.

The approach worked better for the acetate/propanoate pair than the formate/acetate pair -- and one would consider acetic acid and propanoic acid to be more similar than formic acid and the higher acids. We're still far off from obtaining a perfect result though.

An additional problem is obviously the sensitivity of pKa to the DG -- one pH unit is about 1.36 kcal/mol, which is very small given the usual errors in DFT level calculations. I've seen indications online (google!) that the accuracy of b3lyp is about 3 kcal/mol, and one can always debate the accuracy of a highly empirical method like COSMO.