condor

How to help condor find the file it should execute in a job?

邮差的信 提交于 2021-02-05 11:41:31
问题 I am trying to run a job, but condor can't seem to find my file. I've made sure that: the file is there by doing an ls and cat on its absolute path run it from a condor interactive session give it the right permissions so that it runs it. I've done that but I get this error: (automl-meta-learning) miranda9~/automl-meta-learning/automl-proj/experiments/meta_learning $ cat condor_job_log_69.out 000 (069.000.000) 10/21 11:06:06 Job submitted from host: <130.126.112.32:9618?addrs=130.126.112.32

Dask with HTCondor scheduler

落花浮王杯 提交于 2020-06-01 09:20:48
问题 Background I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed . The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used. Issue The admin will install HTCondor as a scheduler for the cluster. Thought In order order to have my

Dask with HTCondor scheduler

♀尐吖头ヾ 提交于 2020-06-01 09:16:07
问题 Background I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed . The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used. Issue The admin will install HTCondor as a scheduler for the cluster. Thought In order order to have my

Condor job using DAG with some jobs needing to run the same host

こ雲淡風輕ζ 提交于 2020-01-15 04:24:06
问题 I have a computation task which is split in several individual program executions, with dependencies. I'm using Condor 7 as task scheduler (with the Vanilla Universe, due do constraints on the programs beyond my reach, so no checkpointing is involved), so DAG looks like a natural solution. However some of the programs need to run on the same host. I could not find a reference on how to do this in the Condor manuals. Example DAG file: JOB A A.condor JOB B B.condor JOB C C.condor JOB D D.condor

Condor Timeout for idle jobs

随声附和 提交于 2019-12-22 09:23:40
问题 I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile , then condor_rm , is there a more graceful (and automatic, built in) way of terminating a hung job? Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run? 回答1: Here are two ways to cause a job to be automatically removed after being idle for too long (24 hours

How can I check the status of the specific job that was send to HTcondor?

拈花ヽ惹草 提交于 2019-12-12 01:03:38
问题 Is there a way to check the status of the specific job (e.g by cluster/process id) and how to retrieve those ids when job is submitted? 回答1: For further reference i solved this by Condor's ClassAd Mechanism. I inserted a custom ClassAd attribute in my condor.submit file: +customAttribute = myID; Then i can check for example JobStatus for this Job by: condor_q -constraint 'customAttribute == myID' -f "%s" JobStatus 回答2: This is possible without requiring a custom ClassAd, as per micco's

Limiting number of concurrent processes scheduled by condor

蓝咒 提交于 2019-12-11 10:29:18
问题 I'm using condor to do batches of ~100 processes for a few hours. After these processes are finished, I need to start the next batch of runs with results from the first batch, and this process is repeated tens of times. My condor pool is >100 cores, and I'd like to limit my condor cluster to only do 100 processes at a time, so that condor only starts working on the next process after one of the first processes is finished. Is this possible? 回答1: This sounds like you're just running a job that