SGE cheat sheet
SGE is a complex system and value of various tips depend on how you use it.
Very few tips are universal.
Derived from cheat sheet by Indy Siva (Mar 09, 2011) . See also
Oliver Wiki SGE
[Oct 07, 2014] Enabling schedd_job_info
Use command qconf -msconf
sge_qstat defines the command line switches that will be used by qstat by default.
If available, the default sge_qstat file is read and processed by
qstat(1).
There is a cluster global and a user private sge_qstat file. The user private file has
the highest precedence and is followed by the cluster global sge_qstat file. Command line
switches used with qstat(1)
override all switches contained in the user private or cluster global sge_qstat file.
The default sge_qstat file may contain an arbitrary number of lines,
although it is unclear what is the value of lines after the first. Blank lines
and lines with a '#' sign at the first column are skipped. Each line
can contain set of
qstat(1)
options. More than one option per
line is allowed.
Here is an example of a sge_qstat default options file (note the
leading blank before the first "-"):
=====================================================
# Just show me my own running and suspended jobs
-s rs -u $USER
=====================================================
Having defined a default
sge_qstat file like this and using qstat without
parameters
qstat
has the same effect as if qstat was executed with:
qstat -s rs -u <current_user>
Commonly Used commands
qacct -j 9999 -- get information about finished job 9999
qconf -mhgrp @allhosts edit hostgroup "@allhosts"
qstat -f [-q \*@node23] full display info [for node23 only]
qconf -sq all.q show "all.q" queue info
qconf -mq all.q modify "all.q" queue: update hostlist, #slots
qconf -aq all.q create queue named "all.q"
qconf -mc
qconf -rattr queue slots 0 all.q@node23 #slots -> 0 (== pbsnodes -o)
qstat -s r -q all.q@node23 show all running jobs on node23
qhost -h node23,node24 show host info for multiple nodes
qhost -q -h node23,node24 ibid, plus queue info
qmod -e all.q@node23 enable node23 in queue all.q (-d == disable)
qsub -j y -o `pwd` -q all.q test.sh submit test.sh job on queue all.q
qping -info node23 6445 execd 1 check status of execd on node23
qstat current user jobs
qstat -u "*" all user jobs
qstat -g c show available nodes and load
qstat -f detailed list of machines and job state
qstat -explain c -j job-id specific job status
qstat -f -u "*"
qdel job-id delete job
qsub -l h_vmem=### job.sh mem limit, see queue_conf(5) RESOURCE LIMITS
qsub -w v job.00 Troubleshoot problems with queue/scheduling
Adding and removing administrative privileges
from a host
Softpanorama page available qconf
qconf -ah # gives host administrative privileges
qconf -dh # removes administrative privileges from host
Adding an execution host
Softpanorama page available qconf
- Make the new host an execution host
qconf -eh <hostname>
- As root on this new host, run the following script from $SGE_ROOT
install_execd
Removing an execution host
Softpanorama page available qconf
Adding and removing submit hosts
qconf -as <hostname> # host is now a submit host
qconf -ds <hostname> # jobs may not be submitted from host
Displaying current administrative/submit/execution hosts
qconf -sh # show current administrative hosts
qconf -ss # show current submit hosts
qconf -sel # show current execution host list
Administering queues
qconf -aq <queuename> # adding a queue
qconf -dq <queuename> # delete a queue
qconf -mq <queuename> # modify a queue
qconf -Aq <filename> # adding a queue from file
qconf -mattr queue ... # change single attributes of more than one queue
qalter -w v <jobid>
This command enlists the reasons why a job is not dispatchable in principle.
For this purpose a dry scheduling run is performed.
The special with this dry scheduling run is that all consumable resources
(also slots) are considered to be fully available for this job. Similarly all load values are ignored because they are varying.
Job or Queue goes in error state "E"
Job or queue errors are indicated by an uppercase "E" in the qstat output. A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the queue.
Grid Engine offers a set of possiblities for users and administrators to get diagnosis information in case of job execution errors. Since both the queue and the job error state result from a failed job execution the diagnosis possibilities are applicable to both types of error states:
Since Grid Engine 6.0 for jobs in error state a one-line error reason is available through
qstat -j | grep error
With a 6.0 this is the recommended first source of diagnosis information for the end user.
For queues in error state a one-line error reason is available through
qstat -explain E
With a 6.0 this is the recommended first source of diagnosis information for administrators in case
of queue errors.
user abort mail If jobs are submitted with the submit option "-m a" a abort mail is sent to the
adress specified with the "-M user[@host]" option. The abort mail contains diagnosis information about
job errors and are the recommended source of information for users.
qacct accounting If no abort mail is available the user can run
qacct -j
to get information about the job error from Grid Engine job accounting.
- administrator abort mail An administrator can order admistrator mails about job execution problems
by specifying an appropriate email adress (see under administrator_mail in sge_conf(5) ). Administrator
mails contain more detailed diagnosis information than user abort mails and are the recommended in case
of frequent job execution errors.
- messages files If no administrator mail is available the Qmasters messages file should be first
investigated. Loggings related to a certain job can be found by searching for the appropriate job ID.
In the 'default' installation the Qmaster messages file is located at $SGE_ROOT/default/spool/qmaster/messages
Additional information can be sometimes found in the messages of the Execd where the job was started.
Use qacct -j <jobid> to figure out the host where the job was started and search in $SGE_ROOT/default/spool/<host>/messages
for the jobid.
Suspend and resubmit stalled jobs
Reference: http://gis.fem-environment.eu/grid-engine-howto/
# as user:
qstat | grep neteler | tr -s ' ' ' ' | cut -d' ' -f2 > /tmp/to_suspend.sge cat /tmp/to_suspend.sge
# as root (?):
su -
for i in `cat /tmp/to_suspend.sge` ; do qmod -sj $i ; done
qstat
# remove crashed blade from list of execution hosts:
qconf -de blade14
# delete host from list:
qconf -mhgrp "@allhosts"
# apply new list:
qconf -shgrp "@allhosts"
# verify queue stats: qstat -f # resubmit jobs to other nodes (as job user!!)
for i in `cat /tmp/to_suspend.sge` ; do qresub $i ; done
qstat
This command send a signal to a running job :
qmod -sj | -usf | -cd (suspend | unsuspend | clear error)
qmod -sj 3312136
qmod -usj 3312136 root - unsuspended job 3312136
Parallel Environment Configuration
qconf -sp pename Show the configuration for the specified parallel environment.
qconf -spl Show a list of all currently configured parallel environments.
qconf -ap pename Add a new parallel environment.
qconf -Ap filename Add a parallel environment from file filename.
qconf -mp pename Modify the specified parallel environment using an editor.
qconf -Mp filename Modify a parallel environment from file filename.
qconf -dp pename Delete the specified parallel environment.
List of currently defined queues
qconf -sql
How do I Control my jobs ?
Based on the status of the job displayed, you can control the job by the following actions:
- Modify a job: As a user, you have certain rights that apply exclusively to your jobs. The
command used is qmod. Check the man pages for the options that you are allowed to use.
- Suspend (or Resume) a job: This uses the UNIX kill command, and applies only to running jobs,
in practice you type qmod -s (or -r) job_id where job_id is given by qstat or qsub.
- Delete a job: You can delete a job that is running or spooled in the queue by using the qdel
command like this qdel job_id where job_id is given by qstat or qsub. Note that if your job is not on
the waiting queue, but is already executing, you need to issue the -f (force) option with the qdel job_id
command to terminate the job.
Job Priorities
The Grid Engine software also sometimes lets users set priorities among their own jobs. A user who
submits several jobs can specify, for example, that job 3 is the most important and that jobs 1 and
2 are equally important but less important than job 3.
- qsub -p option. You can set a priority range of -1024 (lowest) to 1023 (highest). This
priority tells the scheduler how to choose among a single user's jobs when several of that user's
jobs are in the system simultaneously. The relative importance assigned to a particular job depends
on the maximum and minimum priorities that are given to any of that user's jobs, and on the priority
value of the specific job.
Clean up!
/usr/local/SGE/bin/lx24-amd64/qmod -cq lam mpich2 long short nolimit