18 September 2012

239. Sun GridEngine: resetting queue status on node

I occasionally run into problems with space during a run on my cluster, which causes the job to fail and the node to be marked as unavailable:

qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
eight.q@neon                   BIP   0/0/8          0.45     lx26-amd64    
---------------------------------------------------------------------------------
five.q@boron                   BIP   0/0/5          6.01     lx26-amd64    
---------------------------------------------------------------------------------
six.q@boron                    BIP   0/6/6          6.01     lx26-amd64    
    788 0.75000 submit__la user         r     09/07/2012 18:36:56     6        
---------------------------------------------------------------------------------
two.q@beryllium                BIP   0/0/2          0.24     lx26-amd64    
---------------------------------------------------------------------------------
four.q@tantalum                BIP   0/0/4          0.05     lx26-amd64    E
---------------------------------------------------------------------------------
three.q@beryllium              BIP   0/0/3          0.24     lx26-amd64    
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.24     lx26-amd64    
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/1          6.01     lx26-amd64    
---------------------------------------------------------------------------------
main.q@tantalum                BIP   0/0/1          0.05     lx26-amd64    

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    789 0.67310 zoli.qsub  user         qw    09/09/2012 10:00:35     6        
    802 0.60527 submit__bi user         qw    09/10/2012 20:45:24     6        
    803 0.60525 submit__bi user         qw    09/10/2012 20:46:00     6        
    927 0.25071 submit__ac user         qw    09/18/2012 08:24:00     4        
    928 0.25000 submit__ac user         qw    09/18/2012 08:45:52     4  

Before you do anything else, free up space and consider moving your scratch dir to a different/separate disk.

Since I keep forgetting how to reset it, here it is -- as a SGE admin do:
 /usr/bin/qmod -c four.q@tantalum
me@beryllium changed state of "four.q@tantalum" (no error)
And now everything is good:

qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
eight.q@neon                   BIP   0/0/8          0.25     lx26-amd64    
---------------------------------------------------------------------------------
five.q@boron                   BIP   0/0/5          5.91     lx26-amd64    
---------------------------------------------------------------------------------
six.q@boron                    BIP   0/6/6          5.91     lx26-amd64    
    788 0.75000 submit__la user         r     09/07/2012 18:36:56     6        
---------------------------------------------------------------------------------
two.q@beryllium                BIP   0/0/2          0.44     lx26-amd64    
---------------------------------------------------------------------------------
four.q@tantalum                BIP   0/4/4          0.17     lx26-amd64    
    927 0.25071 submit__ac user         r     09/18/2012 11:01:26     4        
---------------------------------------------------------------------------------
three.q@beryllium              BIP   0/0/3          0.44     lx26-amd64    
---------------------------------------------------------------------------------
main.q@beryllium               BIP   0/0/1          0.44     lx26-amd64    
---------------------------------------------------------------------------------
main.q@boron                   BIP   0/0/1          5.91     lx26-amd64    
---------------------------------------------------------------------------------
main.q@tantalum                BIP   0/0/1          0.17     lx26-amd64    

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    789 0.67310 zoli.qsub  user         qw    09/09/2012 10:00:35     6        
    802 0.60527 submit__bi user         qw    09/10/2012 20:45:24     6        
    803 0.60525 submit__bi user         qw    09/10/2012 20:46:00     6        
    928 0.25000 submit__ac user         qw    09/18/2012 08:45:52     4   

1 comment: