I FOLLOW THAT POST ALMOST VERBATIM
This post will be more of a "I followed this guide and it actually works on debian testing/wheezy too and here's how" post, since it doesn't add anything significant to the post above, other than detail.
Since I ran into problems over and over again, I'm posting as much as I can here. Hopefully you can ignore most of the post for this reason.
Some reading before you start:
Having toyed with this for a while I've noticed one important factor in getting this to work:
the hostnames you use when you configure SGE MUST match those returned by hostname. It doesn't matter what you've defined in your /etc/host file. This can obviously cause a little bit of trouble when you've got multiple subnets set up (my computers communicate via a 10/100 net for WAN and a 10/100/1000 net for computations). My front node is called beryllium (i.e. this is what is returned when hostname is executed) but it's known as corella on the gigabit LAN. Same goes for one of my sub nodes: it's called borax on the giganet and boron on the slow LAN. hostname here returns boron. I should obviously go back and redo this for the gigabit subnet later -- I'm just posting what worked.
While setting it up on the front node takes a little while, the good news is that very little work needs to be done on each node. This would become important when you are working with a large number of nodes -- with the power of xargs and a name list, setting them up on the front node should be a breeze.
My front node is beryllium, and one of my subnodes is boron. I've got key-based, password-less ssh login set up.
Set up your front node before you touch your subnodes. Add all the node name to your front node before even installing gridengine-exec on the subnode.
I've spent a day struggling with this. The order of events listed here is the first thing that worked. You make modifications at your own peril (and frustration). I tried openjdk with little luck, hence the sun java.
NFS
Finally, I've got nfs set up to share a folder from the front node (~/jobs) to all my subnodes. See here for instructions on how to set it up: http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-sharing-folder.html
When you use ecce, you can and SHOULD use local scratch folders i.e. use your nfs shared folder as the runtime folder, but set scratch to e.g. /tmp which isn't an nfs exported folder.
Before you start, stop and purge
if you've tried installing and configuring gridengine in the past, there may be processes and files which will interfere. On each computer do
ps aux|grep sge
use sudo kill to kill any sge processes
Then
sudo apt-get purge gridengine-*
First install sun/oracle java on all nodes.
[UPDATE 24 Aug 2013: openjdk-6-jre or openjdk-7-jre work fine, so you can skip this]
There's no sun/oracle java in the debian testing repos anymore, so we'll follow this: http://verahill.blogspot.com.au/2012/04/installing-sunoracle-java-in-debian.html
sudo apt-get install java-package
Download the jre-6u31-linux-x64.bin from here: http://java.com/en/download/manual.jsp?locale=en
make-jpkg jre-6u31-linux-x64.bin
sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb
Then select your shiny oracle java by doing:
sudo update-alternatives --config java
sudo update-alternatives --config javaws
Then select your shiny oracle java by doing:
sudo update-alternatives --config java
sudo update-alternatives --config javaws
Do that one every node, front and subnodes. You don't have to do all the steps though: you just built oracle-j2re1.6_1.6.0+update31_amd64.deb so copy that to your nodes, do sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb and then do the sudo update-alternatives dance.
Front node:
sudo apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
(at the moment this installs v 6.2u5-7)
I used the following:
Configure automatically: yes=> SGE_ROOT: /var/lib/gridengine
Cell name: rupert
Master hostname: beryllium
=> SGE_CELL: rupert
=> Spool directory: /var/spool/gridengine/spooldb
=> Initial manager user: sgeadmin
sudo -u sgeadmin qconf -am ${USER}
sgeadmin@beryllium added "verahill" to manager listand to the user list:
qconf -au ${USER} users
added "verahill" to access list "users"We add beryllium as a submit host
qconf -as beryllium
beryllium added to submit host listCreate the group allhosts
qconf -ahgrp @allhosts
1 group_name @allhosts
2 hostlist NONE
I made no changes
Add beryllium to the hostlist
qconf -aattr hostgroup hostlist beryllium @allhosts
qconf -aattr hostgroup hostlist beryllium @allhosts
verahill@beryllium modified "@allhosts" in host group list
qconf -aq main.q
This opens another text file. I made no changes.
verahill@beryllium added "main.q" to cluster queue list
Add the host group to the queue:
qconf -aattr queue hostlist @allhosts main.q
verahill@beryllium modified "main.q" in cluster queue list
1 core on beryllium is added to SGE:
qconf -aattr queue slots "[beryllium=1]" main.q
verahill@beryllium modified "main.q" in cluster queue listAdd execution host
qconf -ae
which opens a text file in vim
I edited hostname (boron) but nothing else. Saving returns
added host boron to exec host listAdd boron as a submit host
qconf -as boron
boron added to submit host listAdd 3 cores for boron:
qconf -aattr queue slots "[boron=3]" main.q
Add boron to the queue
qconf -aattr hostgroup hostlist boron @allhosts
Here's my history list in case you can't be bother reading everything in detail above.
2015 sudo apt-get install gridengine-client gridengine-qmon gridengine-exec gridengine-master
2016 sudo -u sgeadmin qconf -am ${USER}
2017 qconf -help
2018 qconf user_list
2019 qconf -au ${USER} users
2020 qconf -as beryllium
2021 qconf -ahgrp @allhosts
2022 qconf -aattr hostgroup hostlist beryllium @allhosts
2023 qconf -aq main.q
2024 qconf -aattr queue hostlist @allhosts main.q
2025 qconf -aattr queue slots "[beryllium=1]" main.q
2026 qconf -as boron
2027 qconf -ae
2028 qconf -aattr hostgroup hostlist beryllium @allhosts
2029 qconf -aattr queue slots "[boron=3]" main.q
2030 qconf -aattr hostgroup hostlist boron @allhosts
Next, set up your subnodes:
My example here is a subnode called boron.
On the subnode:
sudo apt-get install gridengine-exec gridengine-client
Configure automatically: yesThis node is called boron.
Cell name: rupert
Master hostname: beryllium
Check whether sge_execd got start after the install
ps aux|grep sge
sgeadmin 25091 0.0 0.0 31712 1968 ? Sl 13:54 0:00 /usr/lib/gridengine/sge_execd
If not, and only if not, do
/etc/init.d/gridengine-exec start
cat /tmp/execd_messages.*
If there's no message corresponding to the current iteration of sge (i.e. you may have old error messages from earlier attempts) then you're probably in a good place.
Back to the front node:
qhost
qstat -f
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 6 0.57 7.8G 3.9G 14.9G 597.7M
boron lx26-amd64 3 0.62 3.8G 255.6M 14.9G 0.0
If the exec node isn't recognised (i.e. it's listed but no cpu info or anything else) then you're in a dark place. Probably you'll find a message about "request for user soandso does not match credentials" in your /tmp/execd_messages.* files on the exec node. The only way I got that solved was stopping all sge processes everywhere, purging all gridengine-* packages on all nodes and starting from the beginning -- hence why I posted the history output above.
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.64 lx26-amd64
---------------------------------------------------------------------------------
main.q@boron BIP 0/0/3 0.72 lx26-amd64
Time to see how far we've got:
Create a file called test.qsub on your front node:
#$ -S /bin/cshqsub test.qsub
#$ -cwd
tree -L 1 -d
hostname
Your job 2 ("test.qsub") has been submitted
qstat -u ${USER}
job-ID prior name user state submit/start at queue slots ja-task-IDls
2 0.00000 test.qsub verahill qw 06/05/2012 14:03:10 1
test.qsub test.qsub.e2 test.qsub.o2
cat test.qsub.[oe]*
.
0 directories
beryllium
Tree could have had more exciting output I s'pose, but I didn't have any subfolders in my run directory.
So far, so good. We still need to set up parallel environments (e.g. orte, mpi).
Before that, we'll add another node, which is called tantalum and has a quadcore cpu.
On the front node:
qconf -as tantalum
qconf -ae
qconf -aattr hostgroup hostlist tantalum @allhosts
So far, so good. We still need to set up parallel environments (e.g. orte, mpi).
Before that, we'll add another node, which is called tantalum and has a quadcore cpu.
On the front node:
qconf -as tantalum
qconf -ae
replace template with tantalumqconf -aattr queue slots "[tantalum=4]" main.q
qconf -aattr hostgroup hostlist tantalum @allhosts
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
beryllium lx26-amd64 6 0.67 7.8G 3.7G 14.9G 597.7M
boron lx26-amd64 3 0.14 3.8G 248.0M 14.9G 0.0
tantalum - - - - - - -
On tantalum:
Install java by copying the oracle-j2re1.6_1.6.0+update31_amd64.deb which got created when you set it up the first time.
sudo dpkg -i oracle-j2re1.6_1.6.0+update31_amd64.deb
sudo update-alternatives --config java
sudo update-alternatives --config javaws
Install gridengine:
sudo apt-get install gridengine-exec gridengine-client
On the front node:
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - beryllium lx26-amd64 6 0.62 7.8G 3.7G 14.9G 601.0M boron lx26-amd64 3 0.15 3.8G 248.6M 14.9G 0.0 tantalum lx26-amd64 4 4.02 7.7G 977.0M 14.9G 24.1M
qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.71 lx26-amd64
---------------------------------------------------------------------------------
main.q@boron BIP 0/0/3 0.72 lx26-amd64
---------------------------------------------------------------------------------
main.q@tantalum BIP 0/0/4 4.01 lx26-amd64
It's a beautiful thing when everything suddenly works.
In order to use all the cores on each node we need to set up parallel environments.
qconf -ap orte
pe_name orte
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
To use a parallel environement include #$ -pe orte 3 for 3 slots in your test.qsub:
#$ -S /bin/csh
#$ -cwd
#$ -pe orte 3
tree -L 1 -d
hostname
Submit it:
qsub test.qsub
Your job 14 ("test.qsub") has been submittedqstat
job-ID prior name user state submit/start at queue slots ja-task-ID
14 0.00000 test.qsub verahill qw 06/05/2012 15:43:25 3
verahill@beryllium:~/mine/qsubtest$ cat test.qsub.*
.
0 directories
boron
It got executed on boron.
We are basically done with a basic setup now. To read more, use google. Some additional info that might be helpful is here: http://wiki.gridengine.info/wiki/index.php/StephansBlog
We're going to set up a few more parallel environments:
qconf -ap mpi1
This obviously isn't the end of my travails -- now I need to get nwchem and gaussian happy.
I've got this in my CONFIG.Dynamic (inside joke) file
NWChem: /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /opt/gaussian/g09/g09
perlPath: /usr/bin/perl
qmgrPath: /usr/bin/
SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
#$ -pe mpi$totalprocs $totalprocs
}
NWChemCommand {
setenv LD_LIBRARY_PATH "/usr/lib/openmpi/lib:/opt/openblas/lib"
setenv PATH "/bin:/usr/bin:/sbin:/usr/sbin"
mpirun -n $totalprocs /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem $infile > $outfile
}
Gaussian-03Command{
setenv GAUSS_SCRDIR /scratch
setenv GAUSS_EXEDIR /opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09
/opt/gaussian/g09/g09 $infile $outfile >g09.log
}
And now everything works!
See below for a few of the annoying errors I encountered during my adventures:
Error -- missing gridengine-client
The gaussian set-up worked fine. The nwchem setup worked on one node but not at all on another -- my problem sounded identical to that described here (two nodes, same binaries, still one works and one doesn't):
http://www.open-mpi.org/community/lists/users/2010/07/13503.php
And it's the same as this one too http://www.digipedia.pl/usenet/thread/11269/867/
It's tooka while to troubleshoot this one. As always, when you're troubleshooting you discover the odd thing or two. On my front node:
boron:
locate rsh|grep "usr/bin"
/usr/bin/rsh
tantalum:
locate rsh|grep "usr/bin"
/usr/bin/glib-genmarshal
/usr/bin/qrsh
/usr/bin/rsh
sudo apt-get autoremove krb5-clients
Of course, that did not get it working...
The annoying things is that nwchem/mpirun on boron work perfectly together, also when submitting jobs directly via ECCE. It's just with qsub that I am having trouble. Search continues:
On the troublesome node:
aptitude search mpi|grep ^i
i libblacs-mpi-dev - Basic Linear Algebra Comm. Subprograms - D
i A libblacs-mpi1 - Basic Linear Algebra Comm. Subprograms - S
i A libexempi3 - library to parse XMP metadata (Library)
i libopenmpi-dev - high performance message passing library -
i A libopenmpi1.3 - high performance message passing library -
i libscalapack-mpi-dev - Scalable Linear Algebra Package - Dev. fil
i A libscalapack-mpi1 - Scalable Linear Algebra Package - Shared l
i A mpi-default-bin - Standard MPI runtime programs (metapackage
i A mpi-default-dev - Standard MPI development files (metapackag
i openmpi-bin - high performance message passing library -
i A openmpi-checkpoint - high performance message passing library -
i A openmpi-common - high performance message passing library -
Finally I found the real problem:
gridengine-client was missing on the troublesome node. Once I had installed that, everything worked!
We're going to set up a few more parallel environments:
qconf -ap mpi1
pe_name mpi1
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qconf -ap mpi2
pe_name mpi2
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 2
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qconf -ap mpi3
pe_name mpi3
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 3
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qconf -ap mpi4
pe_name mpi4
slots 9
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 4
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
And we'll call these using the #$ -pe mpi$totalprocs $totalprocs directive below
We need to add them (update: you need to add them to a queue. Which one is irrelevant, as long as the environment and queue parameters are consistent) to our main.q file though:
qconf -mq main.q
We need to add them (update: you need to add them to a queue. Which one is irrelevant, as long as the environment and queue parameters are consistent) to our main.q file though:
qconf -mq main.q
pe_list make orte mpi1 mpi2 mpi3 mpi4
This obviously isn't the end of my travails -- now I need to get nwchem and gaussian happy.
I've got this in my CONFIG.Dynamic (inside joke) file
NWChem: /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /opt/gaussian/g09/g09
perlPath: /usr/bin/perl
qmgrPath: /usr/bin/
SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$wallTime
#$ -l h_vmem=4G
#$ -j y
#$ -pe mpi$totalprocs $totalprocs
}
NWChemCommand {
setenv LD_LIBRARY_PATH "/usr/lib/openmpi/lib:/opt/openblas/lib"
setenv PATH "/bin:/usr/bin:/sbin:/usr/sbin"
mpirun -n $totalprocs /opt/nwchem/nwchem-6.1/bin/LINUX64/nwchem $infile > $outfile
}
Gaussian-03Command{
setenv GAUSS_SCRDIR /scratch
setenv GAUSS_EXEDIR /opt/gaussian/g09/bsd:/opt/gaussian/g09/local:/opt/gaussian/g09/extras:/opt/gaussian/g09
/opt/gaussian/g09/g09 $infile $outfile >g09.log
}
And now everything works!
See below for a few of the annoying errors I encountered during my adventures:
Error -- missing gridengine-client
The gaussian set-up worked fine. The nwchem setup worked on one node but not at all on another -- my problem sounded identical to that described here (two nodes, same binaries, still one works and one doesn't):
http://www.open-mpi.org/community/lists/users/2010/07/13503.php
And it's the same as this one too http://www.digipedia.pl/usenet/thread/11269/867/
[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 161--------------------------------------------------------------------------It looks like orte_init failed for some reason; your parallel process islikely to abort. There are many reasons that a parallel process canfail during orte_init; some of which are due to configuration orenvironment problems. This failure appears to be an internal failure;here's some additional information (which may only be relevant to anOpen MPI developer):
orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS--------------------------------------------------------------------------[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../orte/runtime/orte_init.c at line 132--------------------------------------------------------------------------It looks like orte_init failed for some reason; your parallel process islikely to abort. There are many reasons that a parallel process canfail during orte_init; some of which are due to configuration orenvironment problems. This failure appears to be an internal failure;here's some additional information (which may only be relevant to anOpen MPI developer):
orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS--------------------------------------------------------------------------[boron:18333] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../orte/tools/orterun/orterun.c at line 543
/usr/bin/rsh -> /etc/alternatives/rshwhich is normal, but
/etc/alternatives/rsh -> /usr/bin/krb5-rshThere are some krb packages on tantalum, but nothing on boron
boron:
locate rsh|grep "usr/bin"
/usr/bin/rsh
tantalum:
locate rsh|grep "usr/bin"
/usr/bin/glib-genmarshal
/usr/bin/qrsh
/usr/bin/rsh
sudo apt-get autoremove krb5-clients
Of course, that did not get it working...
The annoying things is that nwchem/mpirun on boron work perfectly together, also when submitting jobs directly via ECCE. It's just with qsub that I am having trouble. Search continues:
On the troublesome node:
aptitude search mpi|grep ^i
i libblacs-mpi-dev - Basic Linear Algebra Comm. Subprograms - D
i A libblacs-mpi1 - Basic Linear Algebra Comm. Subprograms - S
i A libexempi3 - library to parse XMP metadata (Library)
i libopenmpi-dev - high performance message passing library -
i A libopenmpi1.3 - high performance message passing library -
i libscalapack-mpi-dev - Scalable Linear Algebra Package - Dev. fil
i A libscalapack-mpi1 - Scalable Linear Algebra Package - Shared l
i A mpi-default-bin - Standard MPI runtime programs (metapackage
i A mpi-default-dev - Standard MPI development files (metapackag
i openmpi-bin - high performance message passing library -
i A openmpi-checkpoint - high performance message passing library -
i A openmpi-common - high performance message passing library -
Library conflict?
sudo apt-get autoremove mpi-default-*
And then recompile nwchem. Still no change.
Finally I found the real problem:
gridengine-client was missing on the troublesome node. Once I had installed that, everything worked!
Errors:
If your parallel job won't start (sits with qw forever), and qstat -o jobid gives you
scheduling info: cannot run in PE "orte" because it only offers 0 slots
make sure that qstat -f lists all your nodes.
This is good:
qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.71 lx26-amd64
---------------------------------------------------------------------------------
main.q@boron BIP 0/0/3 0.72 lx26-amd64
---------------------------------------------------------------------------------
main.q@tantalum BIP 0/0/4 4.01 lx26-amd64
This is bad:
qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@beryllium BIP 0/0/1 0.64 lx26-amd64
To fix it, do
qconf -aattr hostgroup hostlist tantalum @allhosts
on the front node for all your node names (change tantalum to the correct name)
An unhelpful error message:
qstat -u verahill
job-ID prior name user state submit/start at queue slots ja-task-ID
3 0.50000 test.qsub verahill Eqw 06/05/2012 11:45:18 1
cat test.qsub.[eo]*
/builder/src-buildserver/Platform-7.0/src/linux/lwmsg/src/connection-wire.c:325: Should not be here
This came from a faulty qsub directive: I used
#$ -S csh
instead of
#$ -S /bin/csh
i.e. you should use the latter.
I think it's a potentially common enough mistake that I post it here. See http://helms-deep.cable.nu/~rwh/blog/?p=159 for more errors.
Links to this post:
http://gridengine.org/pipermail/users/2012-November/005207.html
http://web.archiveorange.com/archive/v/JfPLjOHE5fXSiyFH0yzc
Hi,
ReplyDeleteI followed your tutorial and everything went fine, i first installed java version "1.7.0_09" on both my master and my execnode. Next i installed the master without any problems but when i installed the exec it gave me the following error:
11/13/2012 13:52:42| main|node0|E|communication error for "node0/execd/1" running on port 6445: "can't bind socket"
11/13/2012 13:52:43| main|node0|E|commlib error: can't bind socket (no additional information available)
11/13/2012 13:53:11| main|node0|C|abort qmaster registration due to communication errors
11/13/2012 13:53:13| main|node0|W|daemonize error: child exited before sending daemonize state
What can i do to try and fix this? Both my nodes have java, i'm able to ssh from my master to my node, my node gets his ip from my master's dhcp, there are no iptables
Hope you can help :)
Not sure what you did based on your description -- you're still working on setting up your front node, and while
Deletegridengine-master installed the following packages didn't work out: gridengine-client gridengine-qmon gridengine-exec?
It's hard to guess at what the problem can be. First obviously restart your services. Next, have a look at this post for some steps you can take to get info for troubleshooting: http://verahill.blogspot.com.au/2012/08/sun-gridengine-commlib-error-got-select.html
(i.e. don't look at the solution as much as the different tests).
In particular, look at /tmp/execd_messages.*
Also, be careful about setting node names to the same as hostname.
Once SGE is up and running it's robust, but getting it working that first time can sometimes take a bit of work.
Hi,
ReplyDeletewell i got the services started now: i currently use my master and a node ( 12cores / 64gb ). When i trie to run the following script:
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
main.q@camilla.UGent.be BIP 0/0/1 0.00 lx26-amd64
---------------------------------------------------------------------------------
main.q@node0 BIP 0/0/24 0.01 lx26-amd64
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
31 0.00000 SA root qw 11/14/2012 09:05:47 1 626-4200:1
root@camilla:/nfs/share/sge# qstat -explain c -j 31
==============================================================
job_number: 31
exec_file: job_scripts/31
submission_time: Wed Nov 14 09:05:47 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
mail_list: root@camilla
notify: FALSE
job_name: SA
jobshare: 0
env_list:
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
scheduling info: queue instance "main.q@camilla" dropped because it is full
queue instance "main.q@node0" dropped because it is full
not all array task may be started due to 'max_aj_instances'
It seems as if it's full but the load_avg is 0...
Try setting up a simple test job first to see if anything gets executed e.g.
ReplyDelete#!/bin/bash
#$-cwd
hostname
I've also never worked with arrays -- you may need to create a parallel environment (e.g. orte) and add it to your queue to allow for several slots.
What allocation_rule are you using? And what do you have under load_thresholds ?
When i do qsub test.sh ( with the script you suggested ) i get the following in the outputfile:
ReplyDeleteWarning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
camilla
So it does give the hostname but give an error also.
I wasn't loading cause of some errors but now it loads and uses all the resources. But i also get some error files with the message:
stdin: is not a tty
Okay the script works, i had to modify my main.q and set the shell_start_mode to unix_behavior.
ReplyDeleteHave you ever come across the following error:
ReplyDeleteerror reason 1: 11/22/2012 10:08:40 [0:9192]: execvlp(/var/spool/gridengine/execd/camilla/job_scripts/73, "/var/spool/
error reason 2: 11/22/2012 12:08:55 [0:4274]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/73, "/var/spool/
error reason 3: 11/22/2012 12:08:55 [0:4275]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/73, "/var/spool/
This error just keeps coming when i run my script.
What does your script look like?
DeleteAlso, does the job get marked with an E in your qstat? What is the reason given? NFS related?
DeleteThats my script:
Delete#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
#$-S /bin/sh
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
I don't think its an nfs problem, the directory i'm working in is also available on the exec node. yes the job gets marked: Eqw
Does
Deleteqstat -j {jobid} -explain E
give any more detail?
I don't really use mathematica so I can't test anything here.
This comment has been removed by the author.
DeleteThe output you posted in the original post looks truncated -- is that the whole error? e.g. there are unclosed "
Deleteyes you're right, the message was to long for qstat -j jobid to show to full error found it in the messages file:
Delete11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with exit status = 27
11/22/2012 12:26:11| main|camilla|E|can't open usage file "active_jobs/76.226/usage" for job 76.226: No such file or directory
11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]: execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file or directory
It's fairly similar to this:
Deletehttp://comments.gmane.org/gmane.comp.clustering.gridengine.users/20808
which is unresolved.
However
https://arc.liv.ac.uk/pipermail/gridengine-users/2006-March/009277.html
and
https://arc.liv.ac.uk/pipermail/gridengine-users/2005-February/003691.html
points towards permissions.
The user I launch my jobs with on the front node is an sge_admin, and I have duplicated that user on all nodes (but not as sge_admin on the nodes).
Hi again. I've got a few questions setting up PE OMPI.
ReplyDeleteAre you submitting 'test.qsub' on a NFS share?
What user should launch the jobs? Must he exist on all nodes?
Thank you!
Not an expert, but yes , I presume that the user must exist and have write/read/execute permission on all nodes for the binary and the save directory.
DeleteYou can have your job execute wherever -- the .qsub files are highly configurable. E.g. you can share a folder via NFS between all nodes, then have the script copy the input and execute it in a local folder.
I submit jobs on NFS shares -- I get around the speed issue by exporting one share from each node to the master e.g. ~/node, ~/node2, and then requesting the jobs to be launched the job in that directory. It works well with my setup for e.g. ECCE/NWChem/G09.