Astrophysical and Cosmological Relativity Group

Some Condor hints (to be extended)

Sample submit file:

universe = vanilla
executable = /home/steffeng/condor/test.sh
initialdir = /home/steffeng/condor
arguments  = $(Process)
#
log        = $(Process).log
output     = $(Process).out
notification = Never
#
request_cpus = 1
queue 400

The executable must be executable (chmod +x). In general, it's a good idea to use a wrapper script around your executable proper, to create a working directory (see “Using temporary storage” below), copy/symlink input and output data, and cleanup when your work is done (this is one of the most famous use cases of the trap statement).
It is recommended to explicitly (full path) define the initialdir.
If you queue n jobs, they will get Process numbers from 0 to (n-1).
You are expected not only to announce your CPU requirements (request_memory=n), but also those for memory (request_memory) and GPUs (request_gpus).
As Condor is a High-Throughput Computing scheduler, a larger number of shorter jobs is usually preferrable, as long as the runtimes stay above a few minutes (Condor tends to get confused by too-short runs).

Universes

The vanilla universe is what you want to use most of the time.
It is not planned to use the standard universe as that would imply “stupid checkpointing”.
- With memory images several GB in size, and a few hundred nodes, checkpoint servers are easily overloaded, which does make little sense if your parameter set is rather small, allowing for “smart checkpointing” instead. Hypatia does not implement (standard universe-type) checkpointing.
We have a parallel universe as well.
- The DedicatedScheduler runs on hypatia2, which means parallel jobs have to be submitted from there (only).
- Use machine_count instead of request_cpus.
- The queue count (again) is the number of separate jobs, not threads!
- See the OpenMPI article for details.

Dynamic slots

All machines are configured with a single dynamic (“parent”, “partitionable”) slot, containing all resources of the node.
Requesting a number of CPU cores, a certain amount of memory, etc. will “partition” (split some resources off) that slot, resulting in a sub-slot running the job, and a (free) slot keeping track of the remaining resources.
This process is repeated until the parent slot is running out of resources.
This has a drawback: condor_status -t will always show unclaimed slots, even if these are depleted (no CPU available).
To get a better image of available resources, you may check the Hypatia Status page

Using (and addressing remotely) temporary storage

While your job may behave nicely if it runs on your personal machine, multiplying it by 100 may result in problems.

Please keep in mind that your home directory resides on a fileserver, and is accessed via NFS - which is basically synchronous, adding considerable delays. Also, several dozen write accesses from different machines will keep the disk heads busy, as metadata and data have to be written to various locations on the disk. Not to forget that adding a few bytes to the end of a file would require a full block read/update/writeback cycle. Even a RAID controller had limited opportunities to balance this kind of load.

The better approach is to write to a local disk (where only a handful of jobs compete for the filesystem), make implicit use of the OS' file buffering mechanisms, and at the end of your job send all the data back to the central storage in one big chunk.

The local scratch space has been named /local.

Although this request sounds a bit like “belt and braces”: please use /local/tmp/$user for your temporary data!

Try to avoid unnecessary copies of files when a symbolic link can also do.

Scratch space can be addressed from each other node via /hypatia/${nodename} which is mapped to ${nodename}:/local by the automounter.

Wrapper scripts

The general structure of your wrapper script would be

invent a unique directory name $worktree (perhaps derived from the command line,

and possibly similar to the one you use below your home), or use $$ (the unique process id) to build something that doesn't get harmed by jobs running in parallel on the same local machine

rm -rf /local/tmp/$user/$worktree (to make sure you don't get stuff from a previous run you don't want)
mkdir -p /local/tmp/$user/$worktree (-p will create the whole directory hierarchy if it doesn't exist yet)
setup a “trap” to remove the working tree if the script gets murdered , ex: trap “/bin/rm -rf /scratch/tmp/$username/$worktree; exit 0” 0 1 2 6 9 15

where 0 1 2 … 15 are the signals you'd like to catch; “9” can't be trapped :)

cd …/$worktree
run your code
rsync -a …/$worktree /home/…/target/ (this will create a single-level subdirectory below target/ not the whole tree!)
clean up (or get murdered… it's possible to catch exit code 0 in the trap, as well)

In the worst case, some stuff remains on the /local disk - we'll discuss cleanups later.

DAGs

Citing Kent Wenger (2014-06-23, condor-users mailing list):

“The safest way of shutting down a running DAG is using a halt file. If your DAG is foo.dag, create the file foo.dag.halt. This will allow the DAG jobs to drain from the system, and create a rescue DAG at the end. The down side of this is that if you have long-running node jobs the draining can take a long time.

You can always condor_rm any running DAGs, but you'll lose whatever work the already-running node jobs have completed.”

Working your way through a text file with many parameter lines, one by one

Remember that the number of the job within a job-cluster, $(Process), can be submitted to your executable script by defining arguments=“…$(Process) …” .
Your jobs will run with the numbers 0 through (n-1), where n is the count given in the queue line.
Note that it's possible to create “homogeneous” submit files even if your jobs will perform inhomogeneous tasks - if you can hide these inhomogeneities in the individual command lines.
To select a single line from a file, I suggest something like

line=$(( $1+1 ))
commandline=`sed -ne "${line}p" $listfile`
exec $executable $commandline

(You get the idea? You might even put the executable part into the commandline itself.)

Don't forget the queue line at the end of your submit lines.

Removing jobs from the queue

You may have accidentally submitted a job-cluster, and want to get rid of it? It's generally a bad idea to kill running jobs (think of the cleanup part!), so it's better done in steps:

put idle jobs (job status 1) on hold: condor_q `whoami` -format '%d\n' ProcId -constraint “JobStatus==1” | xargs -r condor_hold
remove held (job status 5) jobs: condor_q `whoami` -format '%d\n' ProcId -constraint “JobStatus==5” | xargs -r condor_rm
check the result: condor_q `whoami` (Of course, you may replace `whoami` with your username.)

While there is a -global option, you better stick to the submit machine (scheduler) you used to submit the jobs.

Some additional info, for the daring:

If you don't care too much, the command to kill everything that's not running (job status 2) would be condor_rm -constraint “JobStatus =!= 2”

Since the signal will be sent to the executable named in the submit file, you might think about establishing a “trap” to run some cleanups - this won't work if the signal sent is a SIGKILL (9) which cannot be trapped. (There is contradicting information on the 'net, whether SIGKILL or SIGTERM will be used. You'll have to find out.) If you don't like that, you may add a SIGTERM (15) handler, and use the condor_vacate_job command, which will send a SIGTERM to the job before putting it on Idle.

Wrong number of jobs?

If you submitted too few jobs and want to add more, but are bound to keep the $(Process) number as an index: While you cannot extend an existing job cluster, you can create one that consists of two parts, one held (possibly forever) and a second that's going to run.

Instead of your single queue $number line, add:

hold = True
queue ${number_of_jobs_to_be_skipped}
#
hold = False
queue ${number_of_additional_jobs}

After submitting, the first part will be on hold (and can removed as described above) while the second part gets run with the right $(Process) arguments! Of course, this mechanism can also be used if you want to run a few processes as “test balloons” with the option of later adding more - you may do the splitting the other way round (set hold to True for the second part). And, of course, you can also submit a whole job cluster into hold state, and then start to release part of it (if you come up with a cron based solution, please get it included here!). Update: a method that works without holding (presented on the condor-users mailinglist)

noop_job = True
queue ${number_of_jobs_to_be_skipped}
#
noop_job = False
queue ${number_of_jobs_to_be_run}

To be extended

Anything you'd like to share which might be important for others?

(last modified by Steffen Grunewald on 16 Oct 2018)

Table of Contents