Table of Contents

Some Condor hints (to be extended)

Sample submit file:

universe = vanilla
executable = /home/steffeng/condor/test.sh
initialdir = /home/steffeng/condor
arguments  = $(Process)
#
log        = $(Process).log
output     = $(Process).out
notification = Never
#
request_cpus = 1
queue 400

Universes

Dynamic slots

Using (and addressing remotely) temporary storage

While your job may behave nicely if it runs on your personal machine, multiplying it by 100 may result in problems.

Please keep in mind that your home directory resides on a fileserver, and is accessed via NFS - which is basically synchronous, adding considerable delays. Also, several dozen write accesses from different machines will keep the disk heads busy, as metadata and data have to be written to various locations on the disk. Not to forget that adding a few bytes to the end of a file would require a full block read/update/writeback cycle. Even a RAID controller had limited opportunities to balance this kind of load.

The better approach is to write to a local disk (where only a handful of jobs compete for the filesystem), make implicit use of the OS' file buffering mechanisms, and at the end of your job send all the data back to the central storage in one big chunk.

The local scratch space has been named /local.

Although this request sounds a bit like “belt and braces”: please use /local/tmp/$user for your temporary data!

Try to avoid unnecessary copies of files when a symbolic link can also do.

Scratch space can be addressed from each other node via /hypatia/${nodename} which is mapped to ${nodename}:/local by the automounter.

Wrapper scripts

The general structure of your wrapper script would be

and possibly similar to the one you use below your home), or use $$ (the unique process id) to build something that doesn't get harmed by jobs running in parallel on the same local machine

where 0 1 2 … 15 are the signals you'd like to catch; “9” can't be trapped :)

In the worst case, some stuff remains on the /local disk - we'll discuss cleanups later.

DAGs

“The safest way of shutting down a running DAG is using a halt file. If your DAG is foo.dag, create the file foo.dag.halt. This will allow the DAG jobs to drain from the system, and create a rescue DAG at the end. The down side of this is that if you have long-running node jobs the draining can take a long time.

You can always condor_rm any running DAGs, but you'll lose whatever work the already-running node jobs have completed.”

Working your way through a text file with many parameter lines, one by one

line=$(( $1+1 ))
commandline=`sed -ne "${line}p" $listfile`
exec $executable $commandline

(You get the idea? You might even put the executable part into the commandline itself.)

Removing jobs from the queue

You may have accidentally submitted a job-cluster, and want to get rid of it? It's generally a bad idea to kill running jobs (think of the cleanup part!), so it's better done in steps:

While there is a -global option, you better stick to the submit machine (scheduler) you used to submit the jobs.

Some additional info, for the daring:

If you don't care too much, the command to kill everything that's not running (job status 2) would be condor_rm -constraint “JobStatus =!= 2”

Since the signal will be sent to the executable named in the submit file, you might think about establishing a “trap” to run some cleanups - this won't work if the signal sent is a SIGKILL (9) which cannot be trapped. (There is contradicting information on the 'net, whether SIGKILL or SIGTERM will be used. You'll have to find out.) If you don't like that, you may add a SIGTERM (15) handler, and use the condor_vacate_job command, which will send a SIGTERM to the job before putting it on Idle.

Wrong number of jobs?

If you submitted too few jobs and want to add more, but are bound to keep the $(Process) number as an index: While you cannot extend an existing job cluster, you can create one that consists of two parts, one held (possibly forever) and a second that's going to run.

Instead of your single queue $number line, add:

hold = True
queue ${number_of_jobs_to_be_skipped}
#
hold = False
queue ${number_of_additional_jobs}

After submitting, the first part will be on hold (and can removed as described above) while the second part gets run with the right $(Process) arguments! Of course, this mechanism can also be used if you want to run a few processes as “test balloons” with the option of later adding more - you may do the splitting the other way round (set hold to True for the second part). And, of course, you can also submit a whole job cluster into hold state, and then start to release part of it (if you come up with a cron based solution, please get it included here!). Update: a method that works without holding (presented on the condor-users mailinglist)

noop_job = True
queue ${number_of_jobs_to_be_skipped}
#
noop_job = False
queue ${number_of_jobs_to_be_run}

To be extended

Anything you'd like to share which might be important for others?

(last modified by Steffen Grunewald on 16 Oct 2018)