Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

SGE cheat sheet

News	Grid Engine	Recommended Links	qstat	qhost	Submitting Jobs To Queue Instance	Starting and Killing Daemons
qconf	qping	qacct	qalter -- Change Job Priority	qmod	Creating and modifying SGE Queues	Getting information about hosts
Configuring Hosts From the Command Line	Creating and modifying SGE Queues	Monitoring Queues and Jobs	Submitting Jobs To Queue	Monitoring and Controlling Jobs	Humor	Etc

SGE is a complex system and value of various tips depend on how you use it. Very few tips are universal.

Derived from cheat sheet by Indy Siva (Mar 09, 2011) . See also Oliver Wiki SGE

Commonly Used commands
Adding and removing administrative privileges from a host
Adding an execution host
Removing an execution host
Adding and removing submit hosts
Displaying current administrative/submit/execution hosts
Administering queues
Job or Queue goes in error state "E"
Suspend and resubmit stalled jobs
Parallel Environment Configuration
List of currently defined queues
How do I Control my jobs ?
Job Priorities
Clean up!

Top updates <p>Your browser does not support iframes.</p>
Bulletin	Latest	Past week	Past month	Google Search

NEWS CONTENTS

20141007 : Enabling schedd_job_info ( Oct 07, 2014 )
210210 : sge_qstat -- the default qstat options file ( softpanorama.org, )

Old News ;-)

[Oct 07, 2014] Enabling schedd_job_info

Use command qconf -msconf

Space is necessary befor "-"(minus) in sge_qstat -- the default qstat options file

sge_qstat defines the command line switches that will be used by qstat by default. If available, the default sge_qstat file is read and processed by qstat(1).

There is a cluster global and a user private sge_qstat file. The user private file has the highest precedence and is followed by the cluster global sge_qstat file. Command line switches used with qstat(1) override all switches contained in the user private or cluster global sge_qstat file.

The default sge_qstat file may contain an arbitrary number of lines, although it is unclear what is the value of lines after the first. Blank lines and lines with a '#' sign at the first column are skipped. Each line can contain set of qstat(1) options. More than one option per line is allowed.

Here is an example of a sge_qstat default options file (note the leading blank before the first "-"):
=====================================================
# Just show me my own running and suspended jobs
 -s rs -u $USER
=====================================================
Having defined a default sge_qstat file like this and using qstat without parameters
qstat
has the same effect as if qstat was executed with:
qstat -s rs -u <current_user>

Recommended Tips

Top Visited <p>Your browser does not support iframes.</p>
Bulletin	Latest	Past week	Past month	Google Search

Commonly Used commands

qacct -j 9999  -- get information about finished job 9999

qconf -mhgrp @allhosts                   edit hostgroup "@allhosts"
qstat -f [-q \*@node23]                  full display info [for node23 only]
qconf -sq all.q                          show "all.q" queue info
qconf -mq all.q                          modify "all.q" queue: update hostlist, #slots
qconf -aq all.q                          create queue named "all.q"
qconf -mc

qconf -rattr queue slots 0 all.q@node23  #slots -> 0 (== pbsnodes -o)

qstat -s r -q all.q@node23               show all running jobs on node23

qhost -h node23,node24                   show host info for multiple nodes
qhost -q -h node23,node24                ibid, plus queue info

qmod -e all.q@node23                     enable node23 in queue all.q (-d == disable)

qsub -j y -o `pwd` -q all.q test.sh      submit test.sh job on queue all.q

qping -info node23 6445 execd 1          check status of execd on node23

qstat                                    current user jobs
qstat -u "*"                             all user jobs
qstat -g c                               show available nodes and load
qstat -f                                 detailed list of machines and job state 
qstat -explain c -j job-id               specific job status
qstat -f -u "*"

qdel job-id                              delete job
qsub -l h_vmem=### job.sh                mem limit, see queue_conf(5) RESOURCE LIMITS



qsub -w v job.00                         Troubleshoot problems with queue/scheduling

Adding and removing administrative privileges from a host

Softpanorama page available qconf

qconf -ah # gives host administrative privileges

qconf -dh # removes administrative privileges from host

Adding an execution host

Softpanorama page available qconf

Make the new host an execution host
```
qconf -eh <hostname>
```
As root on this new host, run the following script from $SGE_ROOT
```
install_execd
```

Removing an execution host

Softpanorama page available qconf

First, delete the queues associated with this host
```
qconf -dq <queuenames...>
```
Delete the host
```
qconf -de <hostname>
```
Finally, delete the configuration for the host
```
qconf -dconf <hostname>
```

Adding and removing submit hosts

qconf -as <hostname> # host is now a submit host

qconf -ds <hostname> # jobs may not be submitted from host

Displaying current administrative/submit/execution hosts

qconf -sh # show current administrative hosts

qconf -ss # show current submit hosts

qconf -sel # show current execution host list

Administering queues

qconf -aq <queuename> # adding a queue

qconf -dq <queuename> # delete a queue

qconf -mq <queuename> # modify a queue

qconf -Aq <filename> # adding a queue from file

qconf -mattr queue ... # change single attributes of more than one queue

qalter -w v <jobid>

This command enlists the reasons why a job is not dispatchable in principle.

For this purpose a dry scheduling run is performed.

The special with this dry scheduling run is that all consumable resources

(also slots) are considered to be fully available for this job. Similarly all load values are ignored because they are varying.

Job or Queue goes in error state "E"

Job or queue errors are indicated by an uppercase "E" in the qstat output. A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the queue.

Grid Engine offers a set of possiblities for users and administrators to get diagnosis information in case of job execution errors. Since both the queue and the job error state result from a failed job execution the diagnosis possibilities are applicable to both types of error states:

Since Grid Engine 6.0 for jobs in error state a one-line error reason is available through

qstat -j  | grep error

With a 6.0 this is the recommended first source of diagnosis information for the end user.

For queues in error state a one-line error reason is available through

qstat -explain E

With a 6.0 this is the recommended first source of diagnosis information for administrators in case of queue errors.

user abort mail If jobs are submitted with the submit option "-m a" a abort mail is sent to the adress specified with the "-M user[@host]" option. The abort mail contains diagnosis information about job errors and are the recommended source of information for users.

qacct accounting If no abort mail is available the user can run

qacct -j

to get information about the job error from Grid Engine job accounting.

administrator abort mail An administrator can order admistrator mails about job execution problems by specifying an appropriate email adress (see under administrator_mail in sge_conf(5) ). Administrator mails contain more detailed diagnosis information than user abort mails and are the recommended in case of frequent job execution errors.
messages files If no administrator mail is available the Qmasters messages file should be first investigated. Loggings related to a certain job can be found by searching for the appropriate job ID. In the 'default' installation the Qmaster messages file is located at $SGE_ROOT/default/spool/qmaster/messages Additional information can be sometimes found in the messages of the Execd where the job was started. Use qacct -j <jobid> to figure out the host where the job was started and search in $SGE_ROOT/default/spool/<host>/messages for the jobid.

Suspend and resubmit stalled jobs

Reference: http://gis.fem-environment.eu/grid-engine-howto/

# as user:
qstat | grep neteler | tr -s ' ' ' '  | cut -d' ' -f2 > /tmp/to_suspend.sge cat /tmp/to_suspend.sge

# as root (?):
su -
for i in `cat /tmp/to_suspend.sge` ; do qmod -sj $i ; done
qstat

# remove crashed blade from list of execution hosts:

qconf -de blade14

# delete host from list:

qconf -mhgrp "@allhosts"

# apply new list:

qconf -shgrp "@allhosts"

# verify queue stats: qstat -f # resubmit jobs to other nodes (as job user!!)

for i in `cat /tmp/to_suspend.sge` ; do qresub $i ; done
qstat

This command send a signal to a running job :

qmod -sj | -usf | -cd (suspend | unsuspend | clear error)

qmod -sj 3312136

qmod -usj 3312136 root - unsuspended job 3312136

Parallel Environment Configuration

qconf -sp pename 	Show the configuration for the specified parallel environment.
qconf -spl 	Show a list of all currently configured parallel environments.
qconf -ap pename 	Add a new parallel environment.
qconf -Ap filename 	Add a parallel environment from file filename.
qconf -mp pename 	Modify the specified parallel environment using an editor.
qconf -Mp filename 	Modify a parallel environment from file filename.
qconf -dp pename 	Delete the specified parallel environment.

List of currently defined queues

qconf -sql

How do I Control my jobs ?

Based on the status of the job displayed, you can control the job by the following actions:

Modify a job: As a user, you have certain rights that apply exclusively to your jobs. The command used is qmod. Check the man pages for the options that you are allowed to use.
Suspend (or Resume) a job: This uses the UNIX kill command, and applies only to running jobs, in practice you type qmod -s (or -r) job_id where job_id is given by qstat or qsub.
Delete a job: You can delete a job that is running or spooled in the queue by using the qdel command like this qdel job_id where job_id is given by qstat or qsub. Note that if your job is not on the waiting queue, but is already executing, you need to issue the -f (force) option with the qdel job_id command to terminate the job.

Job Priorities

The Grid Engine software also sometimes lets users set priorities among their own jobs. A user who submits several jobs can specify, for example, that job 3 is the most important and that jobs 1 and 2 are equally important but less important than job 3.

qsub -p option. You can set a priority range of -1024 (lowest) to 1023 (highest). This priority tells the scheduler how to choose among a single user's jobs when several of that user's jobs are in the system simultaneously. The relative importance assigned to a particular job depends on the maximum and minimum priorities that are given to any of that user's jobs, and on the priority value of the specific job.

Clean up!

 /usr/local/SGE/bin/lx24-amd64/qmod -cq lam mpich2 long short nolimit

SGE cheat sheet

Contents

NEWS CONTENTS

Old News ;-)

[Oct 07, 2014] Enabling schedd_job_info

Space is necessary befor "-"(minus) in sge_qstat -- the default qstat options file

Recommended Tips

Commonly Used commands

Adding and removing administrative privileges from a host

Adding an execution host

Removing an execution host

Adding and removing submit hosts

Displaying current administrative/submit/execution hosts

Administering queues

Job or Queue goes in error state "E"

Suspend and resubmit stalled jobs

Parallel Environment Configuration

List of currently defined queues

How do I Control my jobs ?

Job Priorities

Clean up!