Showing posts with label ecce. Show all posts
Showing posts with label ecce. Show all posts

03 June 2012

172. ECCE and a ROCKS cluster: step by step

This is quite similar to a recent post, but here's a step-by-step, detailed account of how to set up ECCE for remote job submission to a ROCKS 5.4.3 cluster (one front node, 4 subnodes)

Coming soon (give it a week): Setting up a virtualbox machine with ecce for (stubborn) windows and ROCKS/CentOS users.

What isn't shown are all the failed attempts and dead-ends I went through and encountered getting to the point where I had a working system. I compiled ECCE. I compiled tcsh. I tried compiling bsd csh, which required me to compile bmake etc. This stuff looks simple, and it is simple -- but not obvious.

NOTE: From the outside we connect to rocks.university.edu. From inside the cluster the submit node is called rocks.local, and the subnodes are called node0, node1, etc. Refer to this naming if you get confused late.

Step 1. Create the site in ecce
From the terminal, do
ecce -admin
and add a new machine

Don't forget to hit Add/Change queue to make the changes to the queue part take effect. Then hit Add/Change. Oh, and pay attention to the Allocation Account tick box - if it's ticked you can't submit anything unless you add an account.  Important: the machine name you add here is the local name or local IP of the submit node -- it's not the 'public' name or url. We'll add that somewhere else later. Don't forget to select the queue manager (I forgot in the screen shot)

Close.

2. Editing your CONFIG file
Since you're already in the terminal, go to ecce-v6.3/apps/siteconfig

Take a quick peek at your Machines file (no editing):
Machines line
rocks rocks.local Dell beo Intel 40:5 ssh :NWChem:Gaussian-03 MN:RD:SD:UN:PW:Q:TL

Take another look at rocks.Q -- there's probably nothing to edit here either:

rocks.Q# Queue details for rocks
Queues:    nwchem
nwchem|minProcessors:       1
nwchem|maxProcessors:       40
nwchem|runLimit:       100000
nwchem|memLimit:       0
nwchem|scratchLimit:       0
Finally, do some editing of your CONFIG.rocks file.

CONFIG.rocks

NWChem: /share/apps/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /share/apps/gaussian/g09/g09
perlPath: /usr/bin/
qmgrPath: /opt/gridengine/bin/lx26-amd64
sourcefile: /home/rocksuser/.cshrc
frontendMachine: rocks.university.edu

SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$walltime
#$ -l h_vmem=$memoryG
#$ -j y
#$ -pe orte $totalprocs  
}

NWChemEnvironment{
            LD_LIBRARY_PATH /usr/lib/openmpi/1.3.2-gcc/lib/
}

NWChemCommand {
        /opt/openmpi/bin/mpirun -n $totalprocs $nwchem $infile > $outfile
}
Gaussian-03Command {
    setenv GAUSS_SCRDIR /tmp
    setenv GAUSS_EXEDIR /share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09
        time /share/apps/gaussian/g09/g09 $infile  $outfile }

Obviously, your variables will be different. NOTE that memory is in gigabyte here. You could also do $memoryM for megabyte. Just adjust your launcher requirements accordingly.

Step 3. Making csh modifications on the ROCKS cluster
On the main node just use the root password (or become sudo) and move /etc/csh.cshrc and /etc/csh.login out of the way (backing them up is a good idea). It doesn't seem like you need to make any changes csh-wise to the subnodes.

Step 4. Finalising our set up
Start ecce the normal way (e.g. run ecce from the terminal)
In the Gateway, start the Machine Browser, highlight 'rocks' and click on Setup Remote Access.
Do what you're told.

Step 5. Submit to your heart's content!

NOTE: the option to set the amount of memory is not shown in the launcher window above: my mistake. You can edit you apps/siteconfig/Machines file and add :MM at the end of the line, e.g.
Dynamic beryllium       Unspecified     Unspecified     Unspecified     18:3    ssh     :NWChem:Gaussian-03     MN:RD:SD:UN:PW:Q:TL:MM

171. Building ECCE on ROCKS/CentOS


I installed ECCE on a couple of a single workstation with ROCKS, and remotely on a 40 core cluster with ROCKS. The local, workstation install worked fine. I never really bothered much about the cluster install, and only recently looked closer at it. Well, I can launch the 'gateway' but nothing else -- when I click on e.g. the organizer button I get the rocks version of an hourglass that never goes away -- and I don't get any error messages. Turning on logging doesn't yield anything either. 

Ergo, I figured that building it myself  may yield a different result. It didn't on the ROCKS cluster, but everything worked just fine on the single-node ROCKS training box I keep in my office.


CentOS is a bit dated, so you'll need to build your own apr and apr-util. Build apr:
cd /share/apps/utils/
wget http://mirror.mel.bkb.net.au/pub/apache//apr/apr-1.4.6.tar.gz
wget http://mirror.mel.bkb.net.au/pub/apache//apr/apr-util-1.4.1.tar.gz
tar xvf apr-1.4.6.tar.gz
cd apr-1.4.6/
./configure --prefix=/share/apps/utils/apr
make
make install
cd ../
tar xvf apr-util-1.4.1.tar.gz
cd apr-util-1.4.1/
./configure --prefix=/share/apps/utils/apr-util --with-apr=/share/apps/utils/apr/


Time for ecce.
First download 
cd /share/apps/ecce/
tar xvf ecce-v6.3-src.tar.bz2
cd ecce-v6.3/
export ECCE_HOME=/share/apps/ecce/ecce-v6.3
cd build/

Edit build_ecce
889       ./configure --prefix=$ECCE_HOME/${ECCE_SYSDIR}3rdparty/httpd --enable-rewrite --enable-dav --enable-ss-compression
to
889       ./configure --prefix=$ECCE_HOME/${ECCE_SYSDIR}3rdparty/httpd --enable-rewrite --enable-dav --enable-ss-compression --with-apr=/share/apps/utils/apr/bin/apr-1-config --with-apr-util=/share/apps/utils/apr-util/bin/apu-1-config

./build_ecce
Just follow the instructions i.e. hit return, over and over again. Answer no to running tests again. Then run build_ecce again:
./build_ecce
Now stuff should be building. Do this another six times. From the README:
"At this stage the script will build one 3rd party package per invocation,
exiting after each package is built.  In order the 3rd party packages that
will be built are:
1. Apache Xerces XML parser
2. Mesa OpenGL
3. wxWidgets C++ GUI toolkit
4. wxPython GUI toolkit
5. Apache HTTP web server"
The httpd build ends with a minor error about "lib" missing. It's fine.

The sixth time ECCE itself is built, and that's the step that takes by far the longest. It finishes with:
 ECCE built and distribution created in /share/apps/ecce/ecce-v6.3
On a single-node desktop I got it to run a seventh time it seemed. The last step finished with the message above though.

Go to your /share/apps/ecce/ecce-v6.3/ dir where you'll find install_ecce.v6.3.csh
Do the install
csh -f install_ecce.v6.3.csh
Follow the instructions.

You may also want to
sudo mv /etc/csh.* ~/
to get rid of the crappy csh config files.

Edit your ~/.bashrc:

alias startecceserver='csh -f /share/apps/ecce/ecce-v6.3/server/ecce-admin/start_ecce_server'
alias stopecceserver='csh -f /share/apps/ecce/ecce-v6.3/server/ecce-admin/stop_ecce_server
export ECCE_HOME=/share/apps/ecce/ecce-v6.3/apps
export PATH=$PATH:${ECCE_HOME}/scripts

and your ~/.cshrc:

setenv ECCE_HOME /share/apps/ecce/ecce-v6.3/apps
set PATH= (/share/apps/nwchem/nwchem-6.1/bin/LINUX64 $PATH)

On my single-node box I had to edit the apps/siteconfig/DataServers and replace eccetera.emsl.pnl.gov with localhost (two instances), as well as the apps/siteconfig/jndi.properties file (one instance).

In spite of the hassle on the single node box, everything works there -- the builder, organizer etc. all open just fine. The rocks cluster, looks fine, but doesn't work.

The ROCKS Cluster:
Everything seems to work fine -- starting ecce launches the gateway, but clicking on anything sees the centos version of the hourglass churn over and over for all eternity. Nothing happens.

I looked through these two threads, and i also tried the pre-built 32 bit binary. All without luck.

I've also tried editing the site_runtime file:
ECCE_MESA_OPENGL true
ECCE_MESA_EXCEPT x86_64:RedHat:Fedora:CentOS
(matches the lsb_release -is output)




01 June 2012

169. ECCE, node hopping and cluster nodes without direct access to WAN

NOTE: I'm still working on this. But this sort of works for now. More details coming soon.
Update:  I've posted better solutions more recently for use with SGE. See aspects of e.g. http://verahill.blogspot.com.au/2012/06/ecce-in-virtual-machine-step-by-step.html and http://verahill.blogspot.com.au/2012/06/ecce-and-inaccessible-cluster-nodes.html

It's not easy choosing a good title for this post which describes the purpose of it succinctly yet clearly (sort of like this sentence), so here's what we're dealing with:

The Problem:
You can access the submit node from off-site. You can't access the subnodes directly from off-site. This post shows you how you can submit to each subnode directly. A better technical solution is obviously to use qsub on the main node. Having said that, with very little modification this method can also be adapted to the situation described here: http://verahill.blogspot.com.au/2012/05/port-redirection-with-eccenwchem.html

A low-tech, home-built example is the following:
-eth0 - Main  node - eth1 - (subnodes: node0, node1, node2, node3)/

eth0 is WAN, and eth1 is LAN. You can ssh to the main node from the 'internet'. There's no queue manager like SGE installed -- submission of jobs is done by logging onto each node and executing the commands there.

The example:
Main node:
    my.cluster.edu
Nodes:
   compute-0-0
   compute-0-1
   compute-0-2
   compute-0-3


The most obvious solution:
do port redirection. The downside is that it requires some technical skills of the users, and anything with networking and ssh is a PITA if they insist on using windows.


The smarter solution:
I was informed of this solution here.

ROCKS-specific stuff
If your cluster is running ROCKS 5.4.3 and you're having issues opening csh shells, just move /etc/csh.cshrc out of the way as a crude fix.
sudo mv /etc/csh.cshrc ~/
Don't forget to do this on all the nodes if they have local /etc folders! And it's not that easy -- the passwords aren't the same on the nodes as on the main node.
So, on the main node:
rocks set host sec_attr compute-0-1 attr=root_pw
sudo su
# cat /etc/hosts
#ssh 10.1.255.253
#mv /etc/csh.cshrc .
#exit

As the user you'll be running as, edit/create your ~/.cshrc

setenv GAUSS_SCRDIR /tmp
setenv ECCE_HOME /share/apps/ecce/apps
set PATH= (/share/apps/nwchem/nwchem-6.1/bin/LINUX64 $PATH)
setenv LD_LIBRARY_PATH /opt/openmpi/lib:/share/apps/openblas/lib
Repeat on all nodes.

SIDE NOTE: What I'm ultimately looking to achieve on the ROCKS cluster is front-node managed SGE submissions. Easy, you say? Well, ECCE submits a
setenv PATH /opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}
and for some reason the ROCKS/CentOS (t)csh can't handle (most of the time) adding ${PATH} to itself since it's too long unless "s are added (it seems) i.e. this works consistently:
setenv PATH "/opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}"
On my debian system either works fine (bsd-csh).

ANYWAY -- continue

Ecce:
Start setting it up using
ecce -admin



After that, do some editing in ecce-6.3/apps/siteconfig/

CONFIG.voldemort:
xappsPath: /usr/X11R6/bin
NWChem: /share/apps/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /share/apps/gaussian/g09/g09
sourceFile:  /etc/csh.login
frontendMachine: my.cluster.edu

Machines:
compute-0-0 compute-0-0.local Dell HPCS3 Phase 1 Cluster AMD 2.6 GHz Opteron 40:5 ssh :NWChem:Gaussian-03MN:RD:SD:UN:PW




169. ECCE (PNNL/EMSL) and documentation

The title makes it sound like this is going to be a very dry and boring post, but that's not the intention.

Any current or fledgling user of ECCE will probably agree with me when I say that ECCE is somewhat under-documented (there are practical and understandable reasons for this. Besides, no-one like writing documentation). Most of the documentation which does exist is written for earlier versions of ECCE which used a different work flow. Other features are completely undocumented or actually finding the information is difficult.

As it turns out a lot of the technical documentation is to be found in the ecce-6.3/apps/siteconfig/ directory as many of the settings files are self-documenting. Another thing that has been pointed out to me is that the release notes are paid extra attention to because of the lack of a large, coherent body of documentation.

In particular look at:
site_runtime
Some minor settings you mostly won't have to bother with. You may want to look at the OPENGL settings though.
Also, if you're trouble-shooting you can uncomment the ECCE_RCOM_LOGMODE line

# send remote communications output to the console for diagnosing problems
#ECCE_RCOM_LOGMODE true

This allows ECCE to echo every remote command it executes.

submit.site
A treasure trove of ideas and information. Anything you set here will be the default. Defaults can be overriden in the site-specific CONFIG.<> files. It's a judgement call whether to put a specific setting here or in the CONFIG.<> files. Personally, I always change to

NWChemCommand {
  mpirun -n $totalprocs  $nwchem $infile > $outfile
}
as a default setting here.


remote_shells.site
Editing this can allow you to adapt to restrictions in what kind of shells you can run on the remote site or node. The documentation here is a bit iffier.

CONFIG-Examples/ 
Examples of configurations, as the name would suggest. It could do with an overview of the different sites and situations in the README file though.


31 May 2012

168. Port redirection with ECCE/nwchem

Update 1/6/2012: There may be an alternative, better way of doing this: "Another feature that may or may not be useful to you with this special node that is setup with a higher ulimit for submitting your ECCE jobs is that ECCE has a "hop" feature that lets it go from a main login node on a machine to other nodes before actually running commands (e.g. submitting jobs). If you look at the $ECCE_HOME/siteconfig/CONFIG-Examples/CONFIG.mpp2 file, you'll see this "frontendMachine" directive that is what is used to do this. I'm thinking this might allow you to skip the port redirect options with ssh and just "hop" to your special node from regular login node on the compute host. But, I don't think I'd worry about it if what you have now is working fine."
What I describe below works fine, but it does require that users are able to set up the port redirect themselves. The solution hinted at above would be a better technical solution for a larger group of users with varying technical abilities since it should be enough to copy CONFIG.<> and <>.Q files.
Here's how to use main-subnode hopping: http://verahill.blogspot.com.au/2012/06/ecce-and-inaccessible-cluster-nodes.html

Original post:

If you for some reason can't connect to your computational cluster on port 22, here's how to sort it out.


Setting it up the first time

Adding a node:
I have access to an idle computer off-campus, which sits behind a trusty old linksys router. To access it directly I need to use port forwarding.

Edit  ecce-6.3/apps/siteconfig/remote_shells.site and add

ssh_p9999: ssh -p 9999|scp -P 9999
The miniscule p for ssh and capital P for scp are important. They do the same thing, but the different programmes expect difference cases. 

In the ECCE gateway (the small floating window with icons that start when you start ECCE) select Machine Browser. Under Machine, select Register Machine:


The localhost bit is the key here -- since you'll be doing port redirect you'll formally connect to a port on localhost. Hit Add/Change.
In the main Machine Browser window, highlight oxygen and click on Setup Remote Access. Your ssh_p9999 thingy should show up there. Don't bother testing anything just yet (i.e. Machine status etc.). Since I'm writing this from memory I don't know whether you need to have the port-redirect active at this point or not. If you do, see below under Running.


It really is that simple. See below for how to actually use it.

Adding a remote site:
I work at an Australian university where I appear to be the only person using ECCE. Thus, while ulimit on the local SGE managed cluster is a meagre 32 procs, this hasn't been a problem up till now. However, ECCE launches five procs per job, so using up my procs allocation I've been locked out of the cluster on a regular basis.

As a solution, I've been offered my very own shiny submit node with a heftier 37k procs allowed. The downside is that it's only accessible from the standard submit node. Luckily, it's not much more difficult than doing port redirects for a remote node.


Edit  ecce-6.3/apps/siteconfig/remote_shells.site and add

ssh_p5454: ssh -p 5454scp -P 5454
The miniscule p for ssh and capital P for scp are important. 


In the terminal, run
ecce -admin

Most of the fields above should fairly self-explanatory. A few things to watch out for:

  • ECCE actually looks at the proc to node ratio and will impose strict limitations on the number of cores you can use per node. 50/20 means that if you want to use four cores ECCE forces the use of two nodes. Depending on how you run your jobs (configuration) this may or may not have any real impact. To be safe, pick something like 700 cores and 20 nodes.
  • Path means path. I think ECCE defaults to giving the perl path as /usr/bin/perl, but it should be /usr/bin. Same goes for the qsub path.
  • You need to create a queue. The queue name isn't used anywhere other than in ECCE, so it can be a smart way of setting up defaults. What I'm saying is: it's not that important to get it 'right' since it bears no relation to anything on your cluster.
Click on close.

In 'regular' ecce (i.e. started without -admin) go to the machine browser window, highlight the added site, hit Set up Remote Access, and pick ssh_p5454 as shown below. Don't bother testing anything just yet (e.g. Machine status etc.). Since I'm writing this from memory I don't know whether you need to have the port-redirect active at this point or not. If you do, see below under Running.



As always, setting up a site takes a bit of customisation. Here's my ecce-6.3/apps/siteconfig/CONFIG.gn54 on my ecce workstation.
NWChem: /opt/sw/nwchem-6.1/bin/nwchem
Gaussian-03: /usr/local/bin/G09
perlPath: /usr/bin
qmgrPath: /opt/n1ge62/bin/lx24-amd64
SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
}

NWChemEnvironment{
    LD_LIBRARY_PATH /usr/lib/openmpi/1.3.2-gcc/lib/
    PATH /opt/n1ge62/bin/lx24-amd64/
}
NWChemCommand {
#$ -pe mpi_smp$totalprocs  $totalprocs
module load nwchem/6.1
mpirun -n $totalprocs $nwchem $infile > $outfile
}
Gaussian-03Command {
#$ -pe g03_smp4 4
module load gaussian/g09
time G09< $infile > $outfile }
and here's my ecce-6.3/apps/siteconfig/gn54.Q  on my ecce workstation.

# Queue details for gn54
Queues:    nwchem squ8
nwchem|minProcessors:       1
nwchem|maxProcessors:       8
nwchem|runLimit:       4320
nwchem|memLimit:       4000
nwchem|scratchLimit:       0
squ8|minProcessors:       1
squ8|maxProcessors:       6
squ8|runLimit:       4320
squ8|memLimit:       4000
squ8|scratchLimit:       0 
Finally, you need to make sure that  nwchem can find everything - put a file called .nwchemrc in your home folder on the remote node with the correct paths in it, e.g.

nwchem_basis_library /opt/sw/nwchem-6.1/data/libraries/
nwchem_nwpw_library /opt/sw/nwchem-6.1/data/libraryps/
ffield amber
amber_1 /opt/sw/nwchem-6.1/data/amber_s/
amber_2 /opt/sw/nwchem-6.1/data/amber_q/
amber_3 /opt/sw/nwchem-6.1/data/amber_x/
amber_4 /opt/sw/nwchem-6.1/data/amber_u/
spce /opt/sw/nwchem-6.1/data/solvents/spce.rst
charmm_s /opt/sw/nwchem-6.1/data/charmm_s/
charmm_x /opt/sw/nwchem-6.1/data/charmm_x/


That's it.

Running

Starting port redirect:
Before you can pipe anything through to your supersecure remote node/remote site, you need to open a teminal window and do
ssh -C username@remoteserver -L 9999:hiddenserver:22
for the remote node above, or
ssh -C username@remoteserver -L 5454:hiddenserver:22
for the remote site above.

Just to make the syntax clear:
Remote node:
My linksys route is at IP address 110.99.99.99, and the remote node behind it is at 192.168.1.106. My username is verahill
ssh -C verahill@110.99.99.99 -L 9999:192.168.1.106:22

Remote site:
The standard submit node is called msgln4.university.au, and the hidden node is called gn54.university.au. My username is lindqvist
ssh -C lindqvist@msgln4.university.au -L 5454:gn54.university.au:22
You may need to install and use autossh if you keep on being booted off due to inactivity! The syntax is identical.

In ECCE:
Everything in ECCE now works exactly like before - just select the target computer/site/node and go.





What doesn't work:
So far I haven't been able to sort out the whole 'open' remote terminal, which means that tail -f output doesn't work either. I'm leaving that for a rainy day with too much time on my hands.

28 May 2012

167. ECCE/Nwchem on An Australian University computational cluster using qsub with g09/nwchem

EDIT:
I've just learned the First Rule of Remote Computing:
always start by checking the number of concurrent processes you're allowed on the head node, or you can lock yourself out faster that you can say "IT support'.

do
ulimit -u
If it's anywhere under 1000, then you need to be careful.
Default ulimit on ROCKS: 73728
Default ulimit on Debian/Wheezy:  63431
Ulimit on the Oz uni cluster: 32

ECCE launches FIVE processes per job.
Each pipe you add to a command launches another proc. Logging in launches a proc -- if you've reached your quota, you can't log in until a processes finishes.

cat test.text|sed 's/\,/\t/g'|gawk '{print $2,$3,$4}' 
yields three processes -- ten percent of my entire quota.

NOTE:
Running something on a cluster where you have limited access is very different from a cluster you're managing yourself. Apart from knowing the physical layout, you normally have sudo powers on a local cluster.

On potential issue is excessive disk usage -- both in terms of storage space and in terms of raw I/O (writing to an nfs mounted disk is not efficient anyway)
So in order to cut down on that:
1. Define a scratch directory using e.g. (use the correct path)
scratch_dir /scratch
The point being that /scratch is a local directory on the execution node

2. Make sure that you specify
dft
     direct
     ..
end
or even
dft
    noio
    ...
end
to do as little disk caching as possible.

I accidentally ended up storing 52 GB of aoints files from a single job. It may have been what locked me out of the submit node for three hours...

A good way to check your disk-usage is
ls -d * |xargs du -hs

Now, continue reading:



Setting everything up the first time:
First figure out where the mpi libs are:
qsub.tests:

#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -l h_rt=00:14:00
#$ -l h_vmem=4G
#$ -j y
locate libmpi.so
Assuming that the location is /usr/lib/openmpi/1.3.2-gcc/lib/, put 
export LD_LIBRARY_PATH=/usr/lib/openmpi/1.3.2-gcc/lib/
in your ~/.bashrc


Next, look at ls /opt/sw/nwchem-6.1/data -- if there's a default.nwchemrc file, then
ln -s /opt/sw/nwchem-6.1/data/default.nwchemrc ~/.nwchemrc

If not, create ~/.nwchemrc with the locations of the different basis sets, amber files and plane-wave sets listed as follows:

nwchem_basis_library /opt/sw/nwchem-6.1/data/libraries/
nwchem_nwpw_library /opt/sw/nwchem-6.1/data/libraryps/
ffield amber
amber_1 /opt/sw/nwchem-6.1/data/amber_s/
amber_2 /opt/sw/nwchem-6.1/data/amber_q/
amber_3 /opt/sw/nwchem-6.1/data/amber_x/
amber_4 /opt/sw/nwchem-6.1/data/amber_u/
spce /opt/sw/nwchem-6.1/data/solvents/spce.rst
charmm_s /opt/sw/nwchem-6.1/data/charmm_s/
charmm_x /opt/sw/nwchem-6.1/data/charmm_x/


Using nwchem:
A simple qsub file would be:

#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -l h_rt=00:14:00
#$ -l h_vmem=4G
#$ -j y
#$ -pe orte 4
module load nwchem/6.1
time mpirun -n 4 nwchem  test.nw > nwchem.out


with test.nw being the actual nwchem input file which is present in your cwd (current working directory).


Using nwchem with ecce:
This is the proper way of using nwchem. If you haven't already, look here: http://verahill.blogspot.com.au/2012/05/setting-up-ecce-with-qsub-on-australian.html

Then edit your  ecce-6.3/apps/siteconfig/CONFIG.msgln4  file:

NWChem: /opt/sw/nwchem-6.1/bin/nwchem
Gaussian-03: /usr/local/bin/G09
perlPath: /usr/bin/perl
qmgrPath: /usr/bin/qsub

SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
}

NWChemFilesToDelete{ core *.aoints.* }

NWChemEnvironment{
    LD_LIBRARY_PATH /usr/lib/openmpi/1.3.2-gcc/lib/
}

NWChemCommand {
#$ -pe mpi_smp4  4
module load nwchem/6.1

mpirun -n $totalprocs $nwchem $infile > $outfile
}

Gaussian-03Command {
#$ -pe g03_smp4 4
module load gaussian/g09

time G09< $infile > $outfile }

Gaussian-03FilesToDelete{ core *.rwf }

Wrapup{
find /scratch/* -name "*" -user $USER |xargs -I {} rm {} -rf
}

And you should be good to go. IMPORTANT: don't copy the settings blindly -- what works at your uni might be different from what works at my uni. But use the above as an inspiration and validation of your thought process. The most important thing to look out for in terms of performance is probably your -pe switch.

Since I'm having problems with the low ulimit, I wrote a small bash script which I've set to run every ten minutes as a cronjob. Of course, if you've used up your 32 procs you can't run the script...also, instead of piping stuff right and left (each pipe creates another fork/proc) I've written it so it dumps stuff to disk. That way you have a list over procs in case you need to kill something manually:

 The script: ~/clean_ps.sh
date 
ps ux>~/.job.list
ps ux|gawk 'END {print NR}'

cat ~/.job.list|grep "\-sh \-i">~/.job2.list
cat ~/.job2.list|gawk '{print$2}'>~/.job3.list
cat ~/.job3.list|xargs -I {} kill -15 {}

cat ~/.job.list|grep "echo">~/.job4.list
cat ~/.job4.list|gawk '{print$2}'>~/.job5.list
cat ~/.job5.list|xargs -I {} kill -15 {}

cat ~/.job.list|grep "notty">~/.job6.list
cat ~/.job6.list|gawk '{print$2}'>~/.job7.list
cat ~/.job7.list|xargs -I {} kill -15 {}

cat ~/.job.list|grep "perl">~/.job8.list
cat ~/.job8.list|gawk '{print$2}'>~/.job9.list
cat ~/.job9.list|xargs -I {} kill -15 {}

qstat -u ${USER} 
ps ux |gawk 'END {print NR}' 
echo "***" 

and the cron job is set up using
crontab -e
 */10 * * * * sh ~/clean_ps.sh>> ~/.cronout

Obviously this kills any job monitoring from the point of view of ecce. However, it keeps you from being locked out. You can manually check the job status using qstat -u ${USER}, then reconnect when a job is ready. Not that convenient, but liveable.

166. Briefly: nvidia API mismatch on debian when running ecce

UPDATE: There's a much better way to do this: "One thing that I did notice is your issues with OpenGL where you suggested moving the shared libraries to another directory. While that's perfectly workable, this would be another instance where consulting the $ECCE_HOME/siteconfig/site_runtime file would be useful. There you would learn about the $ECCE_MESA_OPENGL and $ECCE_MESA_EXCEPT variables that control whether to use the ECCE-supplied GL libraries or native ones (e.g. hardware OpenGL card drivers) on your machine." I'll update this post again when I've had a time to look into it. Lecture slides and grant rejoinders don't write themselves...

Original post:
If you get an error along the lines of this:
 http://www.linuxquestions.org/questions/debian-26/api-mismatch-nvidia-kernel-module-871115/
only when you're running ECCE i.e. there's an API mismatch error with a difference in kernel module version vs the nvidia driver component (in my case 295.49 and 290.10, respectively), thenyou may want to have a look in your apps folder before you launch a major investigation, e.g.
ecce-6.3/apps/rhel5-gcc4.1.2-m64/3rdparty/mesa/liblibGL.so    libGL.so.295.49  libGLU.so.1           libnvidia-glcore.so.295.49
libGL.so.1  libGLU.so        libGLU.so.1.3.071100  libnvidia-tls.so.295.49
You can symmlink the correct drivers, or -- which is even easier -- just move your 3rdparty/mesa library to e.g.3rdparty/bakmesa and see if it solves it

24 May 2012

162. PSPW/Carr-Parrinello using ECCE

This is more of a note to self about carr-parrinello using ecce/nwchem. As always, this isn't about the science, but about making the computation run at all. And what I may consider a bug may in fact be a feature.

If you simply click your way through ecce and try to launch a pspw carr-parrinello calc, it will fail.

Two problems:

  • task pspw carr-parrinello expects a .movecs file to be present. You can 'solve' this by putting task pspw steepest_descent before you task pspw carr-parrinello statement
  • if you relaunch a run, often you get a crash with an error referring to writing after EOF. You can solve this by cleaning out your run directory.

Problem number one comes down to this:


"Velocity Wavefunction Datafile
The one-electron orbital velocities are stored in a velocity wavefunction datafile. This is a binary file and cannot be directly edited. This datafile is used by the Car-Parrinello task and can be generated using the v_wavefunction_initializer task."


The dependence on certain files (at a minimum, .movecs) being present and the need for optimisation before CPMD and the fact that either
task pspw energy
or
task pspw optimize
will more often than not prevent your ecce.out file from showing any of the MD stuff means it's still better to set up your file by hand and run everything by hand in a dedicated directory. Ecce doesn't copy runtime file back and forth, and that's the main problem here.

Really, what I find a problem is that I'd like to optimize and equilibriate a set of molecules, then continue using the equilibriated set.

If all you're trying to do is to get something, anything to work to get a feel for how this stuff works, then continue reading.

Problematic example file:

  1 scratch_dir /home/me/jobs/scratch
  2 Title "biphenyl_ground_twisted_cpmd_1-1"
  3
  4 Start  biphenyl_ground_twisted_cpmd_1-1
  5
  6 echo
  7
  8 charge 0
  9
 10 geometry autosym units angstrom
 11  C     0.00676622     3.53807     0.0197363
 12  C     -1.29633     2.88855     0.554869
 13  C     -1.31879     1.38415     0.519460
 14  C     0.00129627     0.730174     -0.000557722
 15  C     1.28746     1.38368     -0.578129
 16  C     1.31931     2.90453     -0.512952
 17  C     -0.0100394     -0.758319     -0.0224661
 18  C     1.33004     -1.36336     0.563945
 19  C     1.24425     -2.89848     0.485842
 20  C     -1.31683     -1.36948     -0.531559
 21  C     0.0254501     -3.54181     0.0318405
 22  C     -1.30632     -2.89413     -0.540694
 23  H     0.0916004     4.71976     0.273191
 24  H     -2.74374     3.75562     1.47791
 25  H     -2.78549     0.594633     1.40665
 26  H     2.70470     3.74589     -1.20122
 27  H     3.09496     -3.77313     1.52949
 28  H     -2.76973     -0.640827     -1.49262
 29  H     0.0203915     -4.70472     -0.288098
 30  H     -2.76848     -3.72979     -1.38695
 31  H     2.88815     -0.631319     1.39550
 32  H     2.66933     0.621436     -1.58686
 33 end
 34
 35 ecce_print ecce.out
 36
 37 nwpw
 38   mult 1
 39   np_dimensions -1  -1
 40   tolerances 1e-7  1e-7
 41   car-parrinello
 42     time_step 5.000000e+00
 43     fake_mass 5.000000e+02
 44     loop 10 100
 45     scaling 1.000000e+00 1.000000e+00
 46   end
 47 end
 48
 49 task pspw car-parrinello

Quick 'solution'
3 memory 200mw
48 task pspw steepest_descent

The line numbers are added by me. Remove them before running.

You can also stick task pspw energy or optimize in there -- but the way ecce does it, with just a task pspw carr-parrinello, won't work. Either way, it'd be nice to be able to carry over the movecs files between calculations.


See below for various errors:


Error #1:
If you set up the run using ecce, it won't work and there won't be any real error message to explain why the run exits immediately.

294      >>>  JOB STARTED       AT Thu May 24 14:19:04 2012  <<<
295           ================ input data ========================
296   library name resolved from: compiled reference
297   NWCHEM_NWPW_LIBRARY set to: </opt/nwchem/nwchem-6.1/src/nwpw/libraryps/>
298   library name resolved from: compiled reference
299   NWCHEM_NWPW_LIBRARY set to: </opt/nwchem/nwchem-6.1/src/nwpw/libraryps/>
300
301 -----ECCE Log Information-----
302 Starting Job: Thu May 24 14:19:02 EST 2012
303 Using /home/me/jobs/scratch as nwchem SCRATCH_DIR
304 nwchem exit status = -1
305 Final exit status = -1
306 Completed Job: Thu May 24 14:19:05 EST 2012

If you launch the run in the terminal (without mpirun -- mpi suppresses error messages sometimes) you get:
     >>>  JOB STARTED       AT Thu May 24 14:20:15 2012  <<<
          ================ input data ========================
  library name resolved from: compiled reference
  NWCHEM_NWPW_LIBRARY set to: </opt/nwchem/nwchem-6.1/src/nwpw/libraryps/>
  library name resolved from: compiled reference
  NWCHEM_NWPW_LIBRARY set to: </opt/nwchem/nwchem-6.1/src/nwpw/libraryps/>
ERROR:  Could not open pipe from input file
The reason is that ecce doesn't carry files over from previous simulations -- you need the .movecs file. This can be generated by
  task pspw steepest_descent 

If you could run all your job in the same directory that wouldn't be a problem.

Error #2
438      >>>  JOB STARTED       AT Thu May 24 14:24:38 2012  <<<
439           ================ input data ========================
440  ------------------------------------------------------------------------
441  out of heap memory        0
442  ------------------------------------------------------------------------
443  ------------------------------------------------------------------------
444   current input line :
445     48: task pspw energy
446  ------------------------------------------------------------------------
447  ------------------------------------------------------------------------
448  ------------------------------------------------------------------------
449  For more information see the NWChem manual at http://www.nwchem-sw.org/        index.php/NWChem_Documentation

We chucked task pspw steepest_descent in before our task pspw carr-parrinello and now get a new error: out of heap memory. Easily fixed. You can set e.g. 200MW under pspw/details or add
memory 200 MW
by hand.

Of course, if you add it by clicking in ecce then your task pspw steepest_descent line will be removed, so you'll have to add that by hand again.


Error #3
According to the manual "This [movecs] datafile is used by the Car-Parrinello task and can be generated using the v_wavefunction_initializer task."
Well, try
task v_wavefunction_initialize
and you get

>>>> PSPW Serial Module - v_wavefunction_initializer <<<<
0:Segmentation Violation error, status=: 11
(rank:0 hostname:beryllium pid:24675):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
Last System Error Message from Task 0:: No such file or directory
And you find that there's a file called ????


Error 4
Going full out:

task pspw wavefunction_initializer
task pspw pseudopotential_formatter
task pspw v_wavefunction_initializer
task pspw car-parrinello
gives


 >>>> PSPW Serial Module - wavefunction_initializer <<<<
0:Floating Point Exception error, status=: 8
(rank:0 hostname:beryllium pid:26026):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigFpeHandler():249 cond:0
Last System Error Message from Task 0:: No such file or directory

Error 5:
If you haven't cleared out your run directory you get this via ecce
        ============ Car-Parrinello iteration ==============
     >>>  ITERATION STARTED AT Thu May 24 18:03:07 2012  <<<
    iter.         KE+Energy             Energy        KE_psi        KE_ion   Temperature
    ------------------------------------------------------------------------------------
      10  -0.1662131203E+02  -0.1662582428E+02   0.43690E-02   0.39005E-02        143.80

-----ECCE Log Information-----
Starting Job: Thu May 24 18:00:41 EST 2012

and this if you run in the terminal

At line 847 of file cpmdv5.F (unit = 31, file = './cpmd_test.emotion')
Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE
      10   0.1175850345E+06   0.3940717762E+03   0.18793E-03   0.27950E-02       1852.03
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 26303 on
node beryllium exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
------------------------------------------------------------------------

21 May 2012

158. Setting up ecce with qsub at An Australian University computational cluster

EDIT: this works for G09 on that particular cluster. Come back in a week or two for a more general solution (end of May 2012/beginning of June 2012).

I don't feel comfortable revealing where I work. But imagine that you end up working at an Australian University in, say, Melbourne. I do recognise that I will be giving enough information here to make it possible to identify  who I am (and there are many reasons not to want to be identifiable -- partly because students can be mean and petty, and partly because I suffer from the delusion that IT rules apply to Other People, and not me -- and have described ways of doing things you're not supposed to be doing in this blog)

Anyway.

My old write-ups of ecce are pretty bad, if not outright inaccurate. Anyway, I presume that in spite of that you've managed to set up ECCE well enough to run stuff on nodes of your local cluster.

Now it's time for the next level -- on a remote site using SGE/qsub

So far I've only tried this out with G09 -- they are currently looking to set up nwchem on the university cluster. Not sure what the best approach to the "#$ -pe g03_smp2 2" switch is for nwchem.

--START HERE --

EVERYTHING I DESCRIBE IS DONE ON YOUR DESKTOP, NOT ON THE REMOTE SYSTEM. Sorry for shouting, but don't got a-messing with the remote computational cluster -- we only want to teach ecce how to submit jobs remotely. The remote cluster should be unaffected.

1. Creating the Machine
To set up a site with a queue manager, start
ecce -admin

Do something along the lines of what's shown in the figure above.

If you're not sure whether your qsub belongs to PBS or SGE, type qstat -help and look at the first line returned, e.g. SGE 6.2u2_1.

2. Configure the site
Now, edit your ecce-6.3/apps/siteconfig/CONFIG.msgln4  (local nodes go into ~/.ECCE  but remote SITES go in apps/siteconfig --  and that's what we're working with here).

   NWChem: /usr/local/bin/NWCHEM
   Gaussian-03: /usr/local/bin/G09
   perlPath: /usr/bin/perl
   qmgrPath: /usr/bin/qsub
 
   SGE {
   #$ -S /bin/csh
   #$ -cwd
   #$ -l h_rt=$wallTime
   #$ -l h_vmem=4G
   #$ -j y
   #$ -pe g03_smp2 2

   module load gaussian/g09
    }
A word of advice -- open the file in vim (save using :wq!) or do a chmod +w on it first since it will be set to read-only by default.


3. Queue limits
The same goes for the next file, which controls various job limits, ecce-6.3/apps/siteconfig/msgln4.Q:
# Queue details for msgln4
Queues:    squ8

squ8|minProcessors:       2
squ8|maxProcessors:       6
squ8|runLimit:       4320
squ8|memLimit:       4000
squ8|scratchLimit:       0
4. Connect
In the ecce launcher-mathingy click on Machine Browser, and Set Up Remote Access for the remote cluster. Basically, type in your user name and password.

Click on machine status to make sure that it's connecting

5.Test it out!
If all is well you should be good to go

17 May 2012

153. dft gridsize: ecce defaults to medium for nwchem and fine for g09

I set out to reproduce Malagoli and Brédas in Chemical Physics Letter, 2000, 327, 13-17 (Link). Essentially it's a paper on calculating reorganisational energies in a few simple organic species, such as biphenyl.

I'm a computational noob -- I'm stronger in the computer department than the computational one. The following will most likely only be useful to other newcomers like myself. Anyway...

The authors used ub3lyp/g-31g** and g03. While it may have taken a substantial amount of time in 2000, today the entire paper can be reproduced in a few hours on a simple beowulf cluster. So I set out to do just that -- partly to make sure that I understood the approach, partly to make sure that g09 and g03 gave the same results, and also importantly to make sure i can use nwchem for these calculations if I so desire. I'm getting to the point where nwchem is almost as fast as gaussian for some calculations (cosmo excepted...) and I much prefer the nwchem syntax and python support.

For biphenyl, g09 gave (in Hartree)
Neutral, ground state:     -463.3219416650 (geometry opt)
Cation, neutral geometry:-463.0352702000 (single point calc)
Cation, cation geometry:  -463.0422747850 (geometry opt)
Neutral, cation geometry: -463.3157130000 (single point calc)

That works out to a total energy of ca 0.36 eV -- same as in the paper

while nwchem gave
Neutral, ground state:      -463.3219454524 (geometry opt)
Cation, neutral geometry: -459.0525026266 (single point calc)
Cation, cation geometry:  -463.0422952505 (geometry opt)
Neutral, cation geometry: -459.0556262940 (single point calc)

The order of the stabilities don't even match!

Recalculating the energies by adding

dft
    grid fine
    .....
end

gave the 'correct' results:
Cation, neutral geometry: -463.0353366421
Neutral, cation geometry: -463.3157251871

which gives about 0.36 eV

Here's what puzzles me a bit: you definitely need to add 'fine' in nwchem, and that's what g09 defaults to. But the energies reported for the optimised structures (cation, cation geometry and neutral, ground state) must have been calculated using a fine grid too or one would presume that they'd be off too. Repeating this with the other structures in the paper gives the same result.

Anyway, if you get weird and wacky energies from you single point calcs maybe you should make sure you've specified the dft grid size consistently.


07 May 2012

144. [Fixed] Upgraded ECCE (6.2 -> 6.3) won't let you save input files, can't write basis etc.

Update 1/6/2012: As always there's a better, smarter way of doing this, and as usual it involves reading the manual, or in this case reading the documentation that comes with ecce:
"... you had issues upgrading from ECCE 6.2 to 6.3 in regards to ECCE not being able to find scripts it needed to generate input files such as creating basis sets. Your solution was the manual way for something that is a basic part of ECCE setup for users (maybe you've since figured this out). There is a $ECCE_HOME/scripts/runtime_setup.sh sh/bash environment setup script that you can invoke to set up the paths as needed. This is documented in the list of steps needed when you install ECCE after it is done extracting the distribution right before the install finishes. When you actually invoke ecce then the rest of the environment (such as putting the scripts/parsers directory in the path) is done by the ecce_env script."

From what I understand the path which is set is ECCE_HOME and $ECCE_HOME/scripts is the added to PATH. But that's not enough. I think the 'word too long' thingy is playing up again.

The problem:
After upgrading ecce from 6.2 to 6.3 I keep getting errors of this type after drawing a structure, and then choosing a basis set:

ERROR: Input files could not be generated--failed writing basis set
Calculation saved as small.

And yes, no input files are generated.

Since I launch ecce from the terminal, I get the following error messages when I try to launch nwchem jobs:

sh: 1: std2NWChem: not found

and gaussian jobs:
sh: 1: std2Gaussian-03: not found

Unrelated: I also get a lot of
Word too long.
Word too long.
Word too long.
Word too long.
which has to do with csh somehow. I am not a fan of csh.

My ecce-6.3/apps/siteconfig/CONFIG.beryllium looks correct.

The investigation:

First, in ecce-6.3
 cat */*/*/*|strings|grep std2
[..]
std2GAMESS(US)          - Script that translates basis set from standard
std2Gaussian-92         - Identical to std2Gaussian-94.
std2Gaussian-94         - Script that translates basis set from standard
std2NWChem              - Script that translates basis set from standard
[..]
At least now we know what its related to.
tail -n 9999 */*/*/*|strings|egrep "std2|<=="
==> apps/scripts/parsers/README <==
   "std2Gaussian-94".
std2GAMESS(US)          - Script that translates basis set from standard
std2Gaussian-92         - Identical to std2Gaussian-94.
std2Gaussian-94         - Script that translates basis set from standard
std2NWChem              - Script that translates basis set from standard
locate std2
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Amica
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2GAMESS-UK
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2GAMESS-US
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Gaussian-03
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Gaussian-92
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Gaussian-94
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Gaussian-98
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Hondo
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2Meldef
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2NWChem
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2aceII
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2molcas
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2molpro
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2supermolecule
/home/me/.ecce/ecce-6.3/apps/scripts/parsers/std2tx93
apps/scripts/ecce_env shows that ecce checks whether parsers is in path -- and adds the parsers directory if it isn't:

if (`echo $PATH | grep -c "${ECCE_HOME}/scripts/parsers"` == 0 ) then
  set path = (${ECCE_HOME}/scripts/parsers $path)
endif

Something seems to go wrong here.

The solution

Add this to ~/.bashrc
export ECCE_HOME=/home/me/.ecce/ecce-6.3/apps
export PATH=${ECCE_HOME}/scripts:
${ECCE_HOME}/scripts/parsers
:${PATH}

06 May 2012

143. MD =Ecce + NWChem. 4.Dynamics

Update 19 June 2013:
A much better written and complete guide to getting started with MD in ECCE+NWChem is found here: http://saccharides.blogspot.tw/2013/06/ecce-md-calculation.html

Original post:

Part 1:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-1-prepare.html

Part 2:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-2-md-optimize.html

Part 3:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-3-energy.html


Rightclick, new, dynamics

 Double-click on the new icon


Click on editor


 Make sure the number of step is high enough -- I had a lot of off errors when it was 30, or 300. You should also use an equilibration period -- here I didn't.
 Nothing much to see here.
 Here's the output -- without initial equilibration
and with 1000 equil steps
And that's about it. Time to start looking into doing MM/QM...

05 May 2012

142. MD = Ecce + NWChem. 3. Energy

Update 19 June 2013:
A much better written and complete guide to getting started with MD in ECCE+NWChem is found here: http://saccharides.blogspot.tw/2013/06/ecce-md-calculation.html

Original post:

Part 1:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-1-prepare.html

Part 2:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-2-md-optimize.html

Part 4.
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-4dynamics.html

Rightclick and select new, Energy
 Double-click on the new icon
 Click on editor
 Set up your calc, then launch
It should run quickly. There's little to see in terms of output except for single-point energies.

141. MD = Ecce + NWChem. 2. MD Optimize

Update 19 June 2013:
A much better written and complete guide to getting started with MD in ECCE+NWChem is found here: http://saccharides.blogspot.tw/2013/06/ecce-md-calculation.html

Original post:

Part 1 is here: http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-1-prepare.html

Part 3:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-3-energy.html


Part 4.
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-4dynamics.html



Rightclick on your project and select create new, Optimeze

Double-click on the icon that was created
 Click on editor
You can now configure the optimization.
 Things you may want to look at are the number of steps
 and whether you use SHAKE or not -- 100% of failed attempts so far have been related to SHAKE.
Launch. This will take longer than the Prepare step. A lot longer.


Displaying the result might also take a while -- here's a 2x2x2 box with 100 iterations:
without showing solvent

showing solvent


140. MD = ECCE + NWChem: 1. Prepare

Update 19 June 2013:
A much better written and complete guide to getting started with MD in ECCE+NWChem is found here: http://saccharides.blogspot.tw/2013/06/ecce-md-calculation.html

Original post:
Here's a multi-part description of how to set up a minimal MD simulation using ECCE/NWChem. To make things real easy we'll do an example with a fully described and parametrised system.

Part 2:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-2-md-optimize.html

Part 3:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-3-energy.html

Part 4:
http://verahill.blogspot.com.au/2012/05/md-ecce-nwchem-4dynamics.html


How to install and configure ecce is described elsewhere on this blog.

Start here
Start Ecce, select Organizer
 Rightclick in the organiser and create a new MD Study
 Rightclick on the new project, and create a MD Prepare
 This is what you should have now. Double click on the Prepare icon, then click on the Editor icon
 You get this menu. Click on the Builder icon in the bottom left corner.
Click on the Import from Structure Library icon (hidden behind the dropped down menu in the screenshot). Go to RNA bases, monomers and select Cytidine. Also, check the Atom table and Coordinates options in the Tools menu.
 Hit ctrl+s to save, and close the window.

Back in the organiser set the size of the box for the system
 And click on solvate. You'll see what commands are being added to your input files.
 You might not be able to launch at this point, with ecce complaining about being unable to find the pdb. For me, opening the builder and clicking on Center in the Coordinates box on the right (make it show by checking the right box in Tools menu in the Builder)

If all goes well you'll be able to launch your job, which will finish very fast.
 And here's what we have at this point.

Part 2 will do the next step -- MD Optimize

20 March 2012

113. Using ECCE to run nwchem jobs

EDIT: This post is getting messier as I'm hammering things out...but I've gotten everything to work in the end, so please persist.  The workflow described below is not the ideal one, but it'll get you started. I'll link here when I put up a newer, more reasonable tutorial.

EDIT2: I'm really warming to ECCE as I'm learning more about it. I still think it'd be nice if it was open source, and I can't understand why it has to be reliant on csh (which is pretty much broken on ROCKS, and uncomfortable at the best of times), but it's pretty neat once you've got all the details ironed out. Error feedback/report could be better though.

EDIT 3: ECCE is going open source the (northern) summer of 2012! As users we no longer have any excuses to complain.

Here's a quick introduction to getting started with using ECCE as the interface to nwchem, similar to how gaussview can be used to set up gaussian jobs.

This presumes that you've set up ECCE and preferably compiled your own version of nwchem:
http://verahill.blogspot.com.au/2012/03/ecce-on-debian-but-not-on-rockscentos.html
http://verahill.blogspot.com.au/2012/03/nwchem-61-with-openmpi-on-rocks.html
http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy-nwhchem.html


##Important##
Once I had figured all of this out I rebuilt nwchem and re-installed ecce in the proper locations. You might want to do the same.

A. If you're going to use several nodes you should put nwchem in the same position in the file system hierarchy on all nodes e.g.
/opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem

Also, make sure you share a folder (see how to use NFS) between the nodes which you can use for run time files e.g. /work

EDIT 4: This (probably) isn't necessary. In fact, using NFS in the wrong way will slow things down.

Set the permissions right (chown your user and set to 777 -- 755 is enough for nfs sharing between debian nodes, but between ROCKS and Debian you seem to need 777), and open your firewall on all ports for communication between the nodes.

B. Make sure that ECCE_HOME has been set in ~/.bashrc e.g.
export ECCE_HOME=/opt/ecce/apps

and in ~/.cshrc
setenv ECCE_HOME=/opt/ecce/apps

C.
edit /opt/ecce/apps/siteconfig/submit.site (location depends on where you install ecce)
Change lines 65+ from
#NWChemCommand {
#  $nwchem $infile > $outfile
#}
to (for multiple nodes)
NWChemCommand {
mpirun -hostfile /work/hosts.list -n $totalprocs --preload-binary /opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem $infile > $outfile
}
to use mpirun for parallel job submissions and assuming you have a hosts file in /work. For running on a single node you can use


NWChemCommand {
mpirun  -n $totalprocs $nwchem  $infile > $outfile
}

user either --preload-binary /opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem or $nwchem -- see what works for you. You probably can't do preload if you're running different linux distros (e.g. debian and centos)

My hosts.list looks like this:

tantalum slots=4 max_slots=4
beryllium slots=4 max_slots=5

Make sure that you don't accidentally put 2 jobs on node 0, then 2 jobs on node 1, then another 2 jobs on node 0, since they won't be consecutively numbered and will crash armci. You can avoid this by setting slots and max_slots to the same number.


D.
You may have to edit /etc/openmpi/openmpi-mca-params.conf if you have several (real or virtual) interfaces and add e.g.


btl=tcp,sm,self
btl_tcp_if_include=eth1,eth2
btl_tcp_if_exclude=eth0,virtbr0


Start ECCE:
First start the server
csh /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
then launch ecce

ecce

This will launch what the ecce people call the 'gateway':
The Gateway

0. Make sure you've got your machine set up
Click on Machine browser
Make sure that you can connect to the node e.g. by clicking on disk usage

Set the application paths. Don't fiddle with nodes -- just change number of processors to the total for all nodes.



1. Draw SiCl4 
Click on the Builder in the Gateway, which gives you the following:
The builder window

Click on More to get the periodic table which gives you access to Si

Select Geometry -- here, Tetrahedral

Si -- with four 'nubs' (yup, that's what the ecce ppl call them)

Time to attach Cl atoms to the nubs. Select Cl and pick Terminal geometry.

Click on a 'nub' to replace it with a Cl

And do it until you've replaced all 'nubs'. Hold down right mouse button to rotate

Click on the broom next to the bond menu on the right to pre-optimize  the structure using MM

And save. You will probably be limited to saving your jobs in folders below the ecce  folder.


2. Set up your job
Click on the Organizer icon in the 'gateway', which takes you here:

Click on the first icon, Editor

Focus on selecting Theory and Run type. Here's we'll do a geometry optimisation.

Click on Details for Theory

Click on Details for Run type

Constraints are optional

In the organizer, click on the third icon to set the basis set. Defined atoms for a particular basis set are indicated by a n orange right lower corner

You can get Details about the basis set

If you don't have a Navy Triangle you can't run. Click on Editor and see what might be wrong.

Ready to run. Click on Launch.
4. Running
I'm still working on enabling more than a single core...
Once you've clicked on launch you'll get

 If you click on viewer you can monitor the job

Optimization in progress
5. Re-launch a job at higher theory
In the Organizer, select your last job and then click on Edit, Duplicate Setup with Last Geometry
You then get a copy to edit

Change the basis set, save, then click on Final Edit

This is the nwchem input file in a vim instance

Add a line to the end, saying task scf freq to calculate the vibrations (there's another job option called geovib which does optim+freq , but here we do it by hand)

Launch

Running...

You can now look at the vibrations

And you can visualise MOs -- here's the HOMO which looks like all isolated p orbitals on the chlorine

You can also calculate 'properties'

These include GIAO shielding

Performance:
Here's phenol (scf/6-31g*) across three gigabit-linked nodes. The dotted line denotes node boundaries.


Here's a number of alkanes (scf/6-31g) on 4 cores on a single node: