universe = vanilla executable = /home/steffeng/condor/test.sh initialdir = /home/steffeng/condor arguments = $(Process) # log = $(Process).log output = $(Process).out notification = Never # request_cpus = 1 queue 400
chmod +x
). In general, it's a good idea to use a wrapper script around your executable proper, to create a working directory (see “Using temporary storage” below), copy/symlink input and output data, and cleanup when your work is done (this is one of the most famous use cases of the trap
statement).condor_status -t
will always show unclaimed slots, even if these are depleted (no CPU available).While your job may behave nicely if it runs on your personal machine, multiplying it by 100 may result in problems.
Please keep in mind that your home directory resides on a fileserver, and is accessed via NFS - which is basically synchronous, adding considerable delays. Also, several dozen write accesses from different machines will keep the disk heads busy, as metadata and data have to be written to various locations on the disk. Not to forget that adding a few bytes to the end of a file would require a full block read/update/writeback cycle. Even a RAID controller had limited opportunities to balance this kind of load.
The better approach is to write to a local disk (where only a handful of jobs compete for the filesystem), make implicit use of the OS' file buffering mechanisms, and at the end of your job send all the data back to the central storage in one big chunk.
The local scratch space has been named /local
.
Although this request sounds a bit like “belt and braces”: please use /local/tmp/$user for your temporary data!
Try to avoid unnecessary copies of files when a symbolic link can also do.
Scratch space can be addressed from each other node via /hypatia/${nodename}
which is mapped
to ${nodename}:/local
by the automounter.
The general structure of your wrapper script would be
$worktree
(perhaps derived from the command line,and possibly similar to the one you use below your home), or use $$ (the unique process id) to build something that doesn't get harmed by jobs running in parallel on the same local machine
rm -rf /local/tmp/$user/$worktree
(to make sure you don't get stuff from a previous run you don't want) mkdir -p /local/tmp/$user/$worktree
(-p will create the whole directory hierarchy if it doesn't exist yet) trap “/bin/rm -rf /scratch/tmp/$username/$worktree; exit 0” 0 1 2 6 9 15
where 0 1 2 … 15 are the signals you'd like to catch; “9” can't be trapped :)
cd …/$worktree
rsync -a …/$worktree /home/…/target/
(this will create a single-level subdirectory below target/ not the whole tree!)
In the worst case, some stuff remains on the /local
disk - we'll discuss cleanups later.
“The safest way of shutting down a running DAG is using a halt file. If your DAG is foo.dag, create the file foo.dag.halt. This will allow the DAG jobs to drain from the system, and create a rescue DAG at the end. The down side of this is that if you have long-running node jobs the draining can take a long time.
You can always condor_rm any running DAGs, but you'll lose whatever work the already-running node jobs have completed.”
arguments=“…$(Process) …”
.queue
line.line=$(( $1+1 )) commandline=`sed -ne "${line}p" $listfile` exec $executable $commandline
(You get the idea? You might even put the executable part into the commandline itself.)
queue
line at the end of your submit lines.You may have accidentally submitted a job-cluster, and want to get rid of it? It's generally a bad idea to kill running jobs (think of the cleanup part!), so it's better done in steps:
condor_q `whoami` -format '%d\n' ProcId -constraint “JobStatus==1” | xargs -r condor_hold
condor_q `whoami` -format '%d\n' ProcId -constraint “JobStatus==5” | xargs -r condor_rm
condor_q `whoami`
(Of course, you may replace `whoami` with your username.)
While there is a -global
option, you better stick to the submit machine (scheduler) you used to submit the jobs.
Some additional info, for the daring:
If you don't care too much, the command to kill everything that's not running (job status 2) would be condor_rm -constraint “JobStatus =!= 2”
Since the signal will be sent to the executable named in the submit file, you might think
about establishing a “trap” to run some cleanups - this won't work if the signal sent
is a SIGKILL (9) which cannot be trapped.
(There is contradicting information on the 'net, whether SIGKILL or SIGTERM will be used.
You'll have to find out.)
If you don't like that, you may add a SIGTERM (15) handler, and use the condor_vacate_job
command, which will send a SIGTERM to the job before putting it on Idle.
If you submitted too few jobs and want to add more, but are bound to keep the $(Process) number as an index: While you cannot extend an existing job cluster, you can create one that consists of two parts, one held (possibly forever) and a second that's going to run.
Instead of your single queue $number
line, add:
hold = True queue ${number_of_jobs_to_be_skipped} # hold = False queue ${number_of_additional_jobs}
After submitting, the first part will be on hold (and can removed as described above) while the second part gets run with the right $(Process) arguments! Of course, this mechanism can also be used if you want to run a few processes as “test balloons” with the option of later adding more - you may do the splitting the other way round (set hold to True for the second part). And, of course, you can also submit a whole job cluster into hold state, and then start to release part of it (if you come up with a cron based solution, please get it included here!). Update: a method that works without holding (presented on the condor-users mailinglist)
noop_job = True queue ${number_of_jobs_to_be_skipped} # noop_job = False queue ${number_of_jobs_to_be_run}
Anything you'd like to share which might be important for others?
(last modified by Steffen Grunewald on 16 Oct 2018)