Python 3.6+ toolbox for submitting jobs to Slurm

Last update: Jan 03, 2023

Related tags

Overview

Submit it!

What is submitit?

Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally.

An example is worth a thousand words: performing an addition

From inside an environment with submitit installed:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

The Job class also provides tools for reading the log files (job.stdout() and job.stderr()).

If what you want to run is a command, turn it into a Python function using submitit.helpers.CommandFunction, then submit it. By default stdout is silenced in CommandFunction, but it can be unsilenced with verbose=True.

Find more examples here!!!

Submitit is a Python 3.6+ toolbox for submitting jobs to Slurm. It aims at running python function from python code.

Install

Quick install, in a virtualenv/conda environment where pip is installed (check which pip):

stable release:
```
pip install submitit
```
stable release using conda:
```
conda install -c conda-forge submitit
```

master branch:

pip install git+https://github.com/facebookincubator/[email protected]#egg=submitit

You can try running the MNIST example to check that everything is working as expected (requires sklearn).

Documentation

See the following pages for more detailled information:

Examples: for a bunch of examples dealing with errors, concurrency, multi-tasking etc...
Structure and main objects: to get a better understanding of how submitit works, which files are created for each job, and the main objects you will interact with.
Checkpointing: to understand how you can configure your job to get checkpointed when preempted and/or timed-out.
Tips and caveats: for a bunch of information that can be handy when working with submitit.
Hyperparameter search with nevergrad: basic example of nevergrad usage and how it interfaces with submitit.

Goals

The aim of this Python3 package is to be able to launch jobs on Slurm painlessly from inside Python, using the same submission and job patterns than the standard library package concurrent.futures:

Here are a few benefits of using this lightweight package:

submit any function, even lambda and script-defined functions.
raises an error with stack trace if the job failed.
requeue preempted jobs (Slurm only)
swap between submitit executor and one of concurrent.futures executors in a line, so that it is easy to run your code either on slurm, or locally with multithreading for instance.
checkpoints stateful callables when preempted or timed-out and requeue from current state (advanced feature).
easy access to task local/global rank for multi-nodes/tasks jobs.
same code can work for different clusters thanks to a plugin system.

Submitit is used by FAIR researchers on the FAIR cluster. The defaults are chosen to make their life easier, and might not be ideal for every cluster.

Non-goals

a commandline tool for running slurm jobs. Here, everything happens inside Python. To this end, you can however use Hydra's submitit plugin (version >= 1.0.0).
a task queue, this only implements the ability to launch tasks, but does not schedule them in any way.
being used in Python2! This is a Python3.6+ only package :)

Comparison with dask.distributed

dask is a nice framework for distributed computing. dask.distributed provides the same concurrent.futures executor API as submitit:

from distributed import Client
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=1, cores=2, memory="2GB")
cluster.scale(2)  # this may take a few seconds to launch
executor = Client(cluster)
executor.submit(...)

The key difference with submitit is that dask.distributed distributes the jobs to a pool of workers (see the cluster variable above) while submitit jobs are directly jobs on the cluster. In that sense submitit is a lower level interface than dask.distributed and you get more direct control over your jobs, including individual stdout and stderr, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with submitit, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through dask.distributed.

Contributors

By chronological order: Jérémy Rapin, Louis Martin, Lowik Chanussot, Lucas Hosseini, Fabio Petroni, Francisco Massa, Guillaume Wenzek, Thibaut Lavril, Vinayak Tantia, Andrea Vedaldi, Max Nickel, Quentin Duval (feel free to contribute and add your name ;) )

License

Submitit is released under the MIT License.

Comments

Import error
This bug is baffling me, I'm sure this is user error because I normally have no issues with your code. I have not figured out what is different than my other submissions, but maybe you've seen this before?

.../python3.8/site-packages/submitit/core/_submit.py", line 7, in <module> from .submission import submitit_main ImportError: attempted relative import with no known parent package
bug
opened by jgbos 26
Fixing deadlock when command prints a lot to stderr

Currently, only stdout is read on the fly. If the stderr pipe fills up, then, the subprocess will deadlock when trying to write to stderr. As the parent process, only reads to stdout, and waits for the process to finish, this will never resolve. This uses instead the select function to find which file descriptor can be read from.

This also adds a unit test for this specific case.
CLA Signed

opened by adefossez 9

Job is not ending - Bypassing signal SIGTERM

Hello,

I am sending a job to my Slurm cluster with submitit. The jobs runs as it is supposed to (you can see the Finished script log), but the slurm job itself does not finish. Instead, I get these Bypassing signal messages. Because I need this job to finish before moving on other jobs I am in a deadlock. I am really not sure what I should do and will appreciate the help.

Here are logs from my neverending job :(

03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,093) - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM

To provide more information, this happens when I run run job arrays:

jobs = executor.map_array(fn, configs)
job2cfg = {job: cfg for job, cfg in zip(jobs, configs)}
list(tqdm(submitit.helpers.as_completed(jobs), total=len(jobs)))

Some of them finish successfully, and other get stuck (until I had to clear the queue and scanceled them):

❯ tail /home/olab/kirstain/masking/log_test/71604_*/71604_*_0_log.err
==> /home/olab/kirstain/masking/log_test/71604_0/71604_0_0_log.err <==
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Flattening
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 292793 blocks to write
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Batching
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 3 shards to write
100%|██████████| 3/3 [00:00<00:00, 3093.14it/s]
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Writing 3 shards to /home/olab/kirstain/masking/data/5/1024/shards/ArXiv
100%|██████████| 3/3 [00:08<00:00,  2.78s/it]
03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished script
03/31/2022 20:53:18 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_10/71604_10_0_log.err <==
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,179) - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_11/71604_11_0_log.err <==
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,093) - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:50 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_12/71604_12_0_log.err <==
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_13/71604_13_0_log.err <==
100%|██████████| 65/65 [00:00<00:00, 10539.67it/s]
03/31/2022 21:06:43 - INFO - masking.scripts.shard_corpus - Writing 65 shards to /home/olab/kirstain/masking/data/5/1024/shards/Wikipedia_en
100%|██████████| 65/65 [01:37<00:00,  1.50s/it]
03/31/2022 21:08:22 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 21:08:23 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 21:08:23,557) - Bypassing signal SIGTERM
03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:08:23,561) - Bypassing signal SIGTERM
03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:08:31 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_1/71604_1_0_log.err <==
03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:59:04,562) - Bypassing signal SIGTERM
03/31/2022 20:59:04 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,084) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71606.0 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,085) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71606 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_2/71604_2_0_log.err <==
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Batching
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - We have 11 shards to write
100%|██████████| 11/11 [00:00<00:00, 3551.76it/s]
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Writing 11 shards to /home/olab/kirstain/masking/data/5/1024/shards/Books3
100%|██████████| 11/11 [00:33<00:00,  3.09s/it]
03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 21:20:33,823) - Bypassing signal SIGTERM
03/31/2022 21:20:33 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:20:34 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_3/71604_3_0_log.err <==
submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_4/71604_4_0_log.err <==
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,096) - Bypassing signal SIGCONT
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 71609.0 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,106) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71609 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_5/71604_5_0_log.err <==
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,092) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71610.0 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
submitit WARNING (2022-03-31 21:25:40,093) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
slurmstepd: error: *** JOB 71610 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_6/71604_6_0_log.err <==
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,089) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71611.0 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
submitit WARNING (2022-03-31 21:25:40,091) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
slurmstepd: error: *** JOB 71611 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_7/71604_7_0_log.err <==
100%|██████████| 13/13 [00:00<00:00, 5749.26it/s]
03/31/2022 20:53:43 - INFO - masking.scripts.shard_corpus - Writing 13 shards to /home/olab/kirstain/masking/data/5/1024/shards/OpenWebText2
100%|██████████| 13/13 [00:27<00:00,  2.08s/it]
03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:54:11,320) - Bypassing signal SIGTERM
03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:54:11,321) - Bypassing signal SIGTERM
03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:54:14 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_8/71604_8_0_log.err <==
submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,086) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71613.0 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,088) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71613 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_9/71604_9_0_log.err <==
100%|██████████| 21/21 [00:00<00:00, 4838.52it/s]
03/31/2022 20:56:39 - INFO - masking.scripts.shard_corpus - Writing 21 shards to /home/olab/kirstain/masking/data/5/1024/shards/Pile-CC
100%|██████████| 21/21 [00:42<00:00,  2.01s/it]
03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:57:21,672) - Bypassing signal SIGTERM
03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:57:21,674) - Bypassing signal SIGTERM
03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:57:24 - INFO - submitit - Job completed successfully

opened by yuvalkirstain 7

How to comment a slurm variable?

Hi All,

I observe the following error which is due to added: "#SBATCH --gpus-per-node=4" line in the generated slurm script.

Error : submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

Can developers/users of submitit guide me on where to comment/delete the above line in the slum script before it is submitted by the sbatch command?

Thanks, Amit Ruhela

opened by aruhela 7
Asyncio methods for job

asyncio has a lot of prebuild tools for dealing with asynchronous execution (like submitit jobs). gather allows to deal with parts of the jobs failing, as_completed is available out of the box, timeouts can be added, and we can transparently combine jobs with other async stuff easily.

async also sounds cooler than blocking :D

I added tests for the single task job cases as I didn't see other tests for the multi task code. But it might be worth adding these too.
CLA Signed

opened by Mortimerp9 7

Temporary saved file already exists

Hi,

Thank you for this amazing tool! I just started using it recently. I'm encountering some weird error and I was hoping you could help me fix it. Here is the error log:

submitit WARNING (2021-03-28 01:13:17,420) - Caught signal 15 on learnfair0463: this job is preempted.
slurmstepd: error: *** STEP 38544509.0 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
slurmstepd: error: *** JOB 38544509 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
submitit WARNING (2021-03-28 01:13:17,482) - Bypassing signal 18
submitit WARNING (2021-03-28 01:13:17,483) - Caught signal 15 on learnfair0463: this job is preempted.
38544484_16: Job is pending execution
submitit ERROR (2021-03-28 01:13:17,535) - Could not dump error:
Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.

because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:13:17,535) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 55, in process_job
    utils.cloudpickle_dump(("success", result), tmppath)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/job_environment.py", line 209, in checkpoint_and_try_requeue
    self.env._requeue(countdown)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 193, in _requeue
    subprocess.check_call(["scontrol", "requeue", jid])
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
/bin/bash: /public/apps/anaconda3/2020.11/lib/libtinfo.so.6: no version information available (required by /bin/bash)
submitit ERROR (2021-03-28 01:35:36,155) - Could not dump error:
A temporary saved file already exists.

because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:35:36,156) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
    with utils.temporary_save_path(paths.result_pickle) as tmppath:  # save somewhere else, and move
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 171, in temporary_save_path
    assert not tmppath.exists(), "A temporary saved file already exists."
AssertionError: A temporary saved file already exists.
srun: error: learnfair0292: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=38544509.1

My analysis of the error is as follows. The temporary save file error is thrown in process_job here. One possible reason why this could happen is if the tmppath was created previously in the try block, but there was a failure before the context ended.

This could happen either in the utils.cloudpickle_dump() call or in logger.info(). However, I can see a temporary save path 38544484_16_0_result.pkl.save_tmp that contains the following information ('success', None). So is the error with logger? Or am I completely off here?

I'm running a job array with 1024 jobs and 128 slurm_array_parallelism. The code run by the jobs actually completed and the results were saved. So I don't think this is an error in the python function I ran.

opened by srama2512 7

Adding SnapshotManager

This allows users to create a snapshot of the current git repository and launch the job from this snapshot. This can prevent jobs that are slow to start or re-queued from picking up local changes
CLA Signed

opened by lematt1991 7
[To be discussed] Add option to submit within a batch context

the aim is to be automatically batch jobs which can be batched together, but submit whenever we need information: Eg: in nevergrad we send 40 jobs for evaluation, which could be packed together, and then whenever a job is finished we reschedule a new evaluation. Currently it's impossible to do with a batch context (or any other option), but with this change, it would be possible, by running the optimization within a batch context. This way initial submissions are packed, and sent whenever we start checking there status.
CLA Signed

opened by jrapin 6

TypeError: an integer is required (got type bytes)

Since upgrading to python 3.8 I can't access my old jobs' submission pickle (error below).

The problem might be related to this issue or this one but I have no clue what it means.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-0248393e65dd> in <module>
     30         if state != "COMPLETED":
     31             continue
---> 32         row = job.submission().kwargs
     33         row["scores"] = job.result()
     34         row["exp_name"] = exp_name

~/dev/ext/submitit/submitit/core/core.py in submission(self)
    206             self.paths.submitted_pickle.exists()
    207         ), f"Cannot find job submission pickle: {self.paths.submitted_pickle}"
--> 208         return utils.DelayedSubmission.load(self.paths.submitted_pickle)
    209 
    210     def cancel_at_deletion(self, value: bool = True) -> "Job[R]":

~/dev/ext/submitit/submitit/core/utils.py in load(cls, filepath)
    133     @classmethod
    134     def load(cls: Type["DelayedSubmission"], filepath: Union[str, Path]) -> "DelayedSubmission":
--> 135         obj = pickle_load(filepath)
    136         # following assertion is relaxed compared to isinstance, to allow flexibility
    137         # (Eg: copying this class in a project to be able to have checkpointable jobs without adding submitit as dependency)

~/dev/ext/submitit/submitit/core/utils.py in pickle_load(filename)
    271     # this is used by cloudpickle as well
    272     with open(filename, "rb") as ifile:
--> 273         return pickle.load(ifile)
    274 
    275 

TypeError: an integer is required (got type bytes)

Repro: Start a job with python 3.7 and then try to access it in python 3.8. In python 3.7

Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import submitit
>>>
>>> def add(a, b):
...     return a + b
...
>>> executor = submitit.AutoExecutor(folder="log_test")
>>> executor.update_parameters(timeout_min=1, slurm_partition="dev")
>>> job = executor.submit(add, 5, 7)
>>> print(job.job_id)
33389760
>>> job.submission()
<submitit.core.utils.DelayedSubmission object at 0x7f42f5952bd0>

In python 3.8

Python 3.8.5 | packaged by conda-forge | (default, Jul 24 2020, 01:25:15)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import submitit
>>> job = submitit.SlurmJob("log_test", "33389760")
>>> job.submission()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
    return utils.DelayedSubmission.load(self.paths.submitted_pickle)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
    obj = pickle_load(filepath)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
    return pickle.load(ifile)
TypeError: an integer is required (got type bytes)
>>> job.submission()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
    return utils.DelayedSubmission.load(self.paths.submitted_pickle)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
    obj = pickle_load(filepath)
  File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
    return pickle.load(ifile)
TypeError: an integer is required (got type bytes)

opened by louismartin 6

Set additional slurm parameters

Hello,

I would like to know if it's possible to set additional slurm parameters (and how to set them), because I couldn't find this information in the documentation.

For example, I have a few arguments that I usually set using srun, such as --account=myaccount --hint=nomultithread --distribution=block:block --exclusive, but I have no idea how to set them in submitit.

Thank you in advance for your answer!

opened by netw0rkf10w 6
[BUG] `Scontrol` Error when checkpointing / preemption on slurm
Hi,

For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint: FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout

If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.

Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

Also scontrol seems to work fine on my login node, so I don't understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn't be an issue, am I correct?

Thank you !
opened by YannDubs 5
Can submitit manage chain dependencies?

Hi, thanks for this awesome project!

I started to write something similar but then realized that submitit exists and is much more advanced than my small project!

However, I realized that the chain dependency (as it is implemented in dask.distributed) seems missing in submitit.

Would it make sense to implement it? Or maybe it's already there?

It should be quite easy to implement by using the f sbatch option --dependency.

opened by eserie 0
Should we submit job on login node?

Hi, I'm trying to use submitit to submit a job to my slurm cluster on gcp. In this case, does it make sense to run a submitit script from the login node? When I run the example script I see it execute on the local machine rather than on the 'compute' instances of the cluster. It does not seem to allocate an instance from the partition that I give either.

opened by surajmenon72 0

No user code logging output is shown in logs

Summary: I am not seeing expected logging info in SLURM log files.

Given this submitit script:

import submitit

from src.the_module.the_func

with open("log.yml", "rt", encoding="utf-8") as logconfig:
    config = yaml.load(logconfig.read(), Loader=yaml.FullLoader)
logging.config.dictConfig(config)

executor = submitit.AutoExecutor(folder="log_test")

executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(the_func, the, args)

output = job.result()

the logging in src.the_module:

LOGGER = logging.getLogger(__name__)
...
LOGGER.info(...)
...

the logging config in "log.yml":

version: 1
...
loggers:
  src.the_module:
    level: INFO
    handlers: [console, file]
  the_module:
    level: INFO
    handlers: [console, file]

I do not see the_module INFO lines in "log_test/JOBID_0_log.out", only the default submitit INFO log lines and the job stdout. Is this even supposed to work that way or does logging have to be configured some other way in submitit?

opened by fleimgruber 0

be tolerating about sacct error?

On my slurm cluster I haven't setup accounting yet. Is the following error msg related to that? Maybe the accounting option can be turned off to avoid this error message?

I was running it with hydra.

[2022-11-08 14:00:40,915][HYDRA] Call #2 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zero exit statu
s 1., status may be inaccurate.
Slurm accounting storage is disabled
submitit WARNING (2022-11-08 14:00:43,921) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zer
o exit status 1., status may be inaccurate.
submitit WARNING (2022-11-08 14:00:43,921) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zer
o exit status 1., status may be inaccurate.
[2022-11-08 14:00:43,921][HYDRA] Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zero exit statu
s 1., status may be inaccurate.

opened by min-xu-ai 0

array_parallelism for LocalExecutor
When using job arrays, SlurmExecutor may limit the number of concurrent jobs running via array_parallelism parameter. However, it seems to me LocalExecutor does not have the corresponding functionality. Would it be meaningful to add an option to LocalExecutor which limits the number of concurrently running jobs? Would it be too cumbersome to make PicklingExecutor._internal_process_submissions limit the concurrent jobs without problems?

Use case

A big job is partitioned into many smaller jobs

The slurm queue is full with many pending jobs

Would like to use the same codebase using submitit in a separate compute environment without slurm, with minimal code changes

Current solution

Use ThreadPoolExecutor, where the workers run a function which creates its own LocalExecutor, submit, and wait until finishes.

And the pool controls the number of concurrent jobs

Suggestion

The following code to execute without running more than 5 jobs concurrently

executor = LocalExecutor(folder=somewhere) executor.update_parameters(array_parallelism=5) with executor.batch(): ...
opened by se-ok 0

Releases(1.2.0)

1.2.0(Feb 1, 2021)

#1604 Load numpy first if available #1603 Don't rely on Slurm for detecting timeout vs preemption. Due to a regression in Slurm between 19.04 and 20.02. #1602 Fix quoting of paths in various places #1598 Snapshot manager to copy code before starting job
Source code(tar.gz)
Source code(zip)
1.1.3(Oct 22, 2020)

#30 Use cloudpickle everywhere

https://pypi.org/project/submitit/1.1.3/
Source code(tar.gz)
Source code(zip)

Owner

Facebook Incubator

We work hard to contribute our work back to the web, mobile, big data, & infrastructure communities. NB: members must have two-factor auth.

GitHub Repository

Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

1.3k Dec 28, 2022

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

4 Sep 22, 2022

Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Bottleneck Bottleneck is a collection of fast, NaN-aware NumPy array functions written in C. As one example, to check if a np.array has any NaNs using

835 Dec 27, 2022

Code for the TCAV ML interpretability project

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim, Martin Wattenberg, Justin Gilmer, C

552 Dec 27, 2022

mlpack: a scalable C++ machine learning library --

4.2k Jan 01, 2023

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

648 Dec 16, 2022

Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

Olá! Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogr

10 Apr 04, 2022

A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

194 Jan 04, 2023

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera

858 Dec 25, 2022

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your

6 Nov 05, 2022

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

92 Dec 14, 2022

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

1.3k Dec 22, 2022

Python 3.6+ toolbox for submitting jobs to Slurm

Related tags

Overview

Submit it!

What is submitit?

An example is worth a thousand words: performing an addition

Install

Documentation

Goals

Non-goals

Comparison with dask.distributed

Contributors

License

Comments

Use case

Current solution

Suggestion

Releases(1.2.0)

1.2.0(Feb 1, 2021)

1.1.3(Oct 22, 2020)

Owner

Facebook Incubator

Metric learning algorithms in Python

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Code for the TCAV ML interpretability project

mlpack: a scalable C++ machine learning library --

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

A visual dataflow programming language for sklearn

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

Machine-Learning with python (jupyter)

Machine Learning from Scratch

Relevance Vector Machine implementation using the scikit-learn API.

This project has Classification and Clustering done Via kNN and K-Means respectfully

Simple structured learning framework for python

Exemplary lightweight and ready-to-deploy machine learning project

Diabetes Prediction with Logistic Regression

2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.