Python 3.6+ toolbox for submitting jobs to Slurm

Overview

CircleCI Code style: black Pypi conda-forge

Submit it!

What is submitit?

Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally.

An example is worth a thousand words: performing an addition

From inside an environment with submitit installed:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

The Job class also provides tools for reading the log files (job.stdout() and job.stderr()).

If what you want to run is a command, turn it into a Python function using submitit.helpers.CommandFunction, then submit it. By default stdout is silenced in CommandFunction, but it can be unsilenced with verbose=True.

Find more examples here!!!

Submitit is a Python 3.6+ toolbox for submitting jobs to Slurm. It aims at running python function from python code.

Install

Quick install, in a virtualenv/conda environment where pip is installed (check which pip):

  • stable release:
    pip install submitit
    
  • stable release using conda:
    conda install -c conda-forge submitit
    
  • master branch:
    pip install git+https://github.com/facebookincubator/[email protected]#egg=submitit
    

You can try running the MNIST example to check that everything is working as expected (requires sklearn).

Documentation

See the following pages for more detailled information:

  • Examples: for a bunch of examples dealing with errors, concurrency, multi-tasking etc...
  • Structure and main objects: to get a better understanding of how submitit works, which files are created for each job, and the main objects you will interact with.
  • Checkpointing: to understand how you can configure your job to get checkpointed when preempted and/or timed-out.
  • Tips and caveats: for a bunch of information that can be handy when working with submitit.
  • Hyperparameter search with nevergrad: basic example of nevergrad usage and how it interfaces with submitit.

Goals

The aim of this Python3 package is to be able to launch jobs on Slurm painlessly from inside Python, using the same submission and job patterns than the standard library package concurrent.futures:

Here are a few benefits of using this lightweight package:

  • submit any function, even lambda and script-defined functions.
  • raises an error with stack trace if the job failed.
  • requeue preempted jobs (Slurm only)
  • swap between submitit executor and one of concurrent.futures executors in a line, so that it is easy to run your code either on slurm, or locally with multithreading for instance.
  • checkpoints stateful callables when preempted or timed-out and requeue from current state (advanced feature).
  • easy access to task local/global rank for multi-nodes/tasks jobs.
  • same code can work for different clusters thanks to a plugin system.

Submitit is used by FAIR researchers on the FAIR cluster. The defaults are chosen to make their life easier, and might not be ideal for every cluster.

Non-goals

  • a commandline tool for running slurm jobs. Here, everything happens inside Python. To this end, you can however use Hydra's submitit plugin (version >= 1.0.0).
  • a task queue, this only implements the ability to launch tasks, but does not schedule them in any way.
  • being used in Python2! This is a Python3.6+ only package :)

Comparison with dask.distributed

dask is a nice framework for distributed computing. dask.distributed provides the same concurrent.futures executor API as submitit:

from distributed import Client
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=1, cores=2, memory="2GB")
cluster.scale(2)  # this may take a few seconds to launch
executor = Client(cluster)
executor.submit(...)

The key difference with submitit is that dask.distributed distributes the jobs to a pool of workers (see the cluster variable above) while submitit jobs are directly jobs on the cluster. In that sense submitit is a lower level interface than dask.distributed and you get more direct control over your jobs, including individual stdout and stderr, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with submitit, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through dask.distributed.

Contributors

By chronological order: Jérémy Rapin, Louis Martin, Lowik Chanussot, Lucas Hosseini, Fabio Petroni, Francisco Massa, Guillaume Wenzek, Thibaut Lavril, Vinayak Tantia, Andrea Vedaldi, Max Nickel, Quentin Duval (feel free to contribute and add your name ;) )

License

Submitit is released under the MIT License.

Comments
  • Import error

    Import error

    This bug is baffling me, I'm sure this is user error because I normally have no issues with your code. I have not figured out what is different than my other submissions, but maybe you've seen this before?

    .../python3.8/site-packages/submitit/core/_submit.py", line 7, in <module>
        from .submission import submitit_main
    ImportError: attempted relative import with no known parent package
    
    bug 
    opened by jgbos 26
  • Fixing deadlock when command prints a lot to stderr

    Fixing deadlock when command prints a lot to stderr

    Currently, only stdout is read on the fly. If the stderr pipe fills up, then, the subprocess will deadlock when trying to write to stderr. As the parent process, only reads to stdout, and waits for the process to finish, this will never resolve. This uses instead the select function to find which file descriptor can be read from.

    This also adds a unit test for this specific case.

    CLA Signed 
    opened by adefossez 9
  • Job is not ending - Bypassing signal SIGTERM

    Job is not ending - Bypassing signal SIGTERM

    Hello,

    I am sending a job to my Slurm cluster with submitit. The jobs runs as it is supposed to (you can see the Finished script log), but the slurm job itself does not finish. Instead, I get these Bypassing signal messages. Because I need this job to finish before moving on other jobs I am in a deadlock. I am really not sure what I should do and will appreciate the help.

    Here are logs from my neverending job :(

    03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 19:56:03,093) - Bypassing signal SIGTERM
    03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
    

    To provide more information, this happens when I run run job arrays:

    jobs = executor.map_array(fn, configs)
    job2cfg = {job: cfg for job, cfg in zip(jobs, configs)}
    list(tqdm(submitit.helpers.as_completed(jobs), total=len(jobs)))
    

    Some of them finish successfully, and other get stuck (until I had to clear the queue and scanceled them):

    ❯ tail /home/olab/kirstain/masking/log_test/71604_*/71604_*_0_log.err
    ==> /home/olab/kirstain/masking/log_test/71604_0/71604_0_0_log.err <==
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Flattening
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 292793 blocks to write
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Batching
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 3 shards to write
    100%|██████████| 3/3 [00:00<00:00, 3093.14it/s]
    03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Writing 3 shards to /home/olab/kirstain/masking/data/5/1024/shards/ArXiv
    100%|██████████| 3/3 [00:08<00:00,  2.78s/it]
    03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished script
    03/31/2022 20:53:18 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_10/71604_10_0_log.err <==
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,179) - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:56:55 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_11/71604_11_0_log.err <==
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:00:42,093) - Bypassing signal SIGTERM
    03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:00:50 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_12/71604_12_0_log.err <==
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:51:13 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_13/71604_13_0_log.err <==
    100%|██████████| 65/65 [00:00<00:00, 10539.67it/s]
    03/31/2022 21:06:43 - INFO - masking.scripts.shard_corpus - Writing 65 shards to /home/olab/kirstain/masking/data/5/1024/shards/Wikipedia_en
    100%|██████████| 65/65 [01:37<00:00,  1.50s/it]
    03/31/2022 21:08:22 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 21:08:23 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 21:08:23,557) - Bypassing signal SIGTERM
    03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:08:23,561) - Bypassing signal SIGTERM
    03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:08:31 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_1/71604_1_0_log.err <==
    03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 20:59:04,562) - Bypassing signal SIGTERM
    03/31/2022 20:59:04 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,084) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71606.0 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
    submitit WARNING (2022-03-31 21:25:40,085) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    slurmstepd: error: *** JOB 71606 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_2/71604_2_0_log.err <==
    03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Batching
    03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - We have 11 shards to write
    100%|██████████| 11/11 [00:00<00:00, 3551.76it/s]
    03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Writing 11 shards to /home/olab/kirstain/masking/data/5/1024/shards/Books3
    100%|██████████| 11/11 [00:33<00:00,  3.09s/it]
    03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 21:20:33,823) - Bypassing signal SIGTERM
    03/31/2022 21:20:33 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:20:34 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_3/71604_3_0_log.err <==
    submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:52:43 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_4/71604_4_0_log.err <==
    03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,096) - Bypassing signal SIGCONT
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 71609.0 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
    submitit WARNING (2022-03-31 21:25:40,106) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    slurmstepd: error: *** JOB 71609 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_5/71604_5_0_log.err <==
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,092) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71610.0 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    submitit WARNING (2022-03-31 21:25:40,093) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    slurmstepd: error: *** JOB 71610 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_6/71604_6_0_log.err <==
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,089) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71611.0 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    submitit WARNING (2022-03-31 21:25:40,091) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    slurmstepd: error: *** JOB 71611 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_7/71604_7_0_log.err <==
    100%|██████████| 13/13 [00:00<00:00, 5749.26it/s]
    03/31/2022 20:53:43 - INFO - masking.scripts.shard_corpus - Writing 13 shards to /home/olab/kirstain/masking/data/5/1024/shards/OpenWebText2
    100%|██████████| 13/13 [00:27<00:00,  2.08s/it]
    03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 20:54:11,320) - Bypassing signal SIGTERM
    03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:54:11,321) - Bypassing signal SIGTERM
    03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:54:14 - INFO - submitit - Job completed successfully
    
    ==> /home/olab/kirstain/masking/log_test/71604_8/71604_8_0_log.err <==
    submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
    03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 21:25:40,086) - Bypassing signal SIGCONT
    slurmstepd: error: *** STEP 71613.0 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
    submitit WARNING (2022-03-31 21:25:40,088) - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
    slurmstepd: error: *** JOB 71613 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
    
    ==> /home/olab/kirstain/masking/log_test/71604_9/71604_9_0_log.err <==
    100%|██████████| 21/21 [00:00<00:00, 4838.52it/s]
    03/31/2022 20:56:39 - INFO - masking.scripts.shard_corpus - Writing 21 shards to /home/olab/kirstain/masking/data/5/1024/shards/Pile-CC
    100%|██████████| 21/21 [00:42<00:00,  2.01s/it]
    03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished Writing
    03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished script
    submitit WARNING (2022-03-31 20:57:21,672) - Bypassing signal SIGTERM
    03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
    submitit WARNING (2022-03-31 20:57:21,674) - Bypassing signal SIGTERM
    03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
    03/31/2022 20:57:24 - INFO - submitit - Job completed successfully
    
    opened by yuvalkirstain 7
  • How to comment a slurm variable?

    How to comment a slurm variable?

    Hi All,

    I observe the following error which is due to added: "#SBATCH --gpus-per-node=4" line in the generated slurm script.

    Error : submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

    Can developers/users of submitit guide me on where to comment/delete the above line in the slum script before it is submitted by the sbatch command?

    Thanks, Amit Ruhela

    opened by aruhela 7
  • Asyncio methods for job

    Asyncio methods for job

    asyncio has a lot of prebuild tools for dealing with asynchronous execution (like submitit jobs). gather allows to deal with parts of the jobs failing, as_completed is available out of the box, timeouts can be added, and we can transparently combine jobs with other async stuff easily.

    async also sounds cooler than blocking :D

    I added tests for the single task job cases as I didn't see other tests for the multi task code. But it might be worth adding these too.

    CLA Signed 
    opened by Mortimerp9 7
  • Temporary saved file already exists

    Temporary saved file already exists

    Hi,

    Thank you for this amazing tool! I just started using it recently. I'm encountering some weird error and I was hoping you could help me fix it. Here is the error log:

    submitit WARNING (2021-03-28 01:13:17,420) - Caught signal 15 on learnfair0463: this job is preempted.
    slurmstepd: error: *** STEP 38544509.0 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
    slurmstepd: error: *** JOB 38544509 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
    submitit WARNING (2021-03-28 01:13:17,482) - Bypassing signal 18
    submitit WARNING (2021-03-28 01:13:17,483) - Caught signal 15 on learnfair0463: this job is preempted.
    38544484_16: Job is pending execution
    submitit ERROR (2021-03-28 01:13:17,535) - Could not dump error:
    Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
    
    because of A temporary saved file already exists.
    submitit ERROR (2021-03-28 01:13:17,535) - Submitted job triggered an exception
    Traceback (most recent call last):
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
        submitit_main()
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
        process_job(args.folder)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
        raise error
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 55, in process_job
        utils.cloudpickle_dump(("success", result), tmppath)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
        cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/job_environment.py", line 209, in checkpoint_and_try_requeue
        self.env._requeue(countdown)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 193, in _requeue
        subprocess.check_call(["scontrol", "requeue", jid])
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
    /bin/bash: /public/apps/anaconda3/2020.11/lib/libtinfo.so.6: no version information available (required by /bin/bash)
    submitit ERROR (2021-03-28 01:35:36,155) - Could not dump error:
    A temporary saved file already exists.
    
    because of A temporary saved file already exists.
    submitit ERROR (2021-03-28 01:35:36,156) - Submitted job triggered an exception
    Traceback (most recent call last):
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
        submitit_main()
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
        process_job(args.folder)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
        raise error
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
        with utils.temporary_save_path(paths.result_pickle) as tmppath:  # save somewhere else, and move
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/contextlib.py", line 113, in __enter__
        return next(self.gen)
      File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 171, in temporary_save_path
        assert not tmppath.exists(), "A temporary saved file already exists."
    AssertionError: A temporary saved file already exists.
    srun: error: learnfair0292: task 0: Exited with exit code 1
    srun: launch/slurm: _step_signal: Terminating StepId=38544509.1
    

    My analysis of the error is as follows. The temporary save file error is thrown in process_job here. One possible reason why this could happen is if the tmppath was created previously in the try block, but there was a failure before the context ended.

    This could happen either in the utils.cloudpickle_dump() call or in logger.info(). However, I can see a temporary save path 38544484_16_0_result.pkl.save_tmp that contains the following information ('success', None). So is the error with logger? Or am I completely off here?

    I'm running a job array with 1024 jobs and 128 slurm_array_parallelism. The code run by the jobs actually completed and the results were saved. So I don't think this is an error in the python function I ran.

    opened by srama2512 7
  • Adding SnapshotManager

    Adding SnapshotManager

    This allows users to create a snapshot of the current git repository and launch the job from this snapshot. This can prevent jobs that are slow to start or re-queued from picking up local changes

    CLA Signed 
    opened by lematt1991 7
  • [To be discussed] Add option to submit within a batch context

    [To be discussed] Add option to submit within a batch context

    the aim is to be automatically batch jobs which can be batched together, but submit whenever we need information: Eg: in nevergrad we send 40 jobs for evaluation, which could be packed together, and then whenever a job is finished we reschedule a new evaluation. Currently it's impossible to do with a batch context (or any other option), but with this change, it would be possible, by running the optimization within a batch context. This way initial submissions are packed, and sent whenever we start checking there status.

    CLA Signed 
    opened by jrapin 6
  • TypeError: an integer is required (got type bytes)

    TypeError: an integer is required (got type bytes)

    Since upgrading to python 3.8 I can't access my old jobs' submission pickle (error below).

    The problem might be related to this issue or this one but I have no clue what it means.

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-17-0248393e65dd> in <module>
         30         if state != "COMPLETED":
         31             continue
    ---> 32         row = job.submission().kwargs
         33         row["scores"] = job.result()
         34         row["exp_name"] = exp_name
    
    ~/dev/ext/submitit/submitit/core/core.py in submission(self)
        206             self.paths.submitted_pickle.exists()
        207         ), f"Cannot find job submission pickle: {self.paths.submitted_pickle}"
    --> 208         return utils.DelayedSubmission.load(self.paths.submitted_pickle)
        209 
        210     def cancel_at_deletion(self, value: bool = True) -> "Job[R]":
    
    ~/dev/ext/submitit/submitit/core/utils.py in load(cls, filepath)
        133     @classmethod
        134     def load(cls: Type["DelayedSubmission"], filepath: Union[str, Path]) -> "DelayedSubmission":
    --> 135         obj = pickle_load(filepath)
        136         # following assertion is relaxed compared to isinstance, to allow flexibility
        137         # (Eg: copying this class in a project to be able to have checkpointable jobs without adding submitit as dependency)
    
    ~/dev/ext/submitit/submitit/core/utils.py in pickle_load(filename)
        271     # this is used by cloudpickle as well
        272     with open(filename, "rb") as ifile:
    --> 273         return pickle.load(ifile)
        274 
        275 
    
    TypeError: an integer is required (got type bytes)
    

    Repro: Start a job with python 3.7 and then try to access it in python 3.8. In python 3.7

    Python 3.7.4 (default, Aug 13 2019, 20:35:49)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import submitit
    >>>
    >>> def add(a, b):
    ...     return a + b
    ...
    >>> executor = submitit.AutoExecutor(folder="log_test")
    >>> executor.update_parameters(timeout_min=1, slurm_partition="dev")
    >>> job = executor.submit(add, 5, 7)
    >>> print(job.job_id)
    33389760
    >>> job.submission()
    <submitit.core.utils.DelayedSubmission object at 0x7f42f5952bd0>
    

    In python 3.8

    Python 3.8.5 | packaged by conda-forge | (default, Jul 24 2020, 01:25:15)
    [GCC 7.5.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import submitit
    >>> job = submitit.SlurmJob("log_test", "33389760")
    >>> job.submission()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
        return utils.DelayedSubmission.load(self.paths.submitted_pickle)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
        obj = pickle_load(filepath)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
        return pickle.load(ifile)
    TypeError: an integer is required (got type bytes)
    >>> job.submission()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/core.py", line 208, in submission
        return utils.DelayedSubmission.load(self.paths.submitted_pickle)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 135, in load
        obj = pickle_load(filepath)
      File "/private/home/louismartin/dev/ext/submitit/submitit/core/utils.py", line 273, in pickle_load
        return pickle.load(ifile)
    TypeError: an integer is required (got type bytes)
    
    opened by louismartin 6
  • Set additional slurm parameters

    Set additional slurm parameters

    Hello,

    I would like to know if it's possible to set additional slurm parameters (and how to set them), because I couldn't find this information in the documentation.

    For example, I have a few arguments that I usually set using srun, such as --account=myaccount --hint=nomultithread --distribution=block:block --exclusive, but I have no idea how to set them in submitit.

    Thank you in advance for your answer!

    opened by netw0rkf10w 6
  • [BUG] `Scontrol` Error when checkpointing / preemption on slurm

    [BUG] `Scontrol` Error when checkpointing / preemption on slurm

    Hi,

    For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint: FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

    Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

    • Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout
    • If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.
    • Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

    Also scontrol seems to work fine on my login node, so I don't understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn't be an issue, am I correct?

    Thank you !

    opened by YannDubs 5
  • Can submitit manage chain dependencies?

    Can submitit manage chain dependencies?

    Hi, thanks for this awesome project!

    I started to write something similar but then realized that submitit exists and is much more advanced than my small project!

    However, I realized that the chain dependency (as it is implemented in dask.distributed) seems missing in submitit.

    Would it make sense to implement it? Or maybe it's already there?

    It should be quite easy to implement by using the f sbatch option --dependency.

    opened by eserie 0
  • Should we submit job on login node?

    Should we submit job on login node?

    Hi, I'm trying to use submitit to submit a job to my slurm cluster on gcp. In this case, does it make sense to run a submitit script from the login node? When I run the example script I see it execute on the local machine rather than on the 'compute' instances of the cluster. It does not seem to allocate an instance from the partition that I give either.

    opened by surajmenon72 0
  • No user code logging output is shown in logs

    No user code logging output is shown in logs

    Summary: I am not seeing expected logging info in SLURM log files.

    Given this submitit script:

    import submitit
    
    from src.the_module.the_func
    
    with open("log.yml", "rt", encoding="utf-8") as logconfig:
        config = yaml.load(logconfig.read(), Loader=yaml.FullLoader)
    logging.config.dictConfig(config)
    
    executor = submitit.AutoExecutor(folder="log_test")
    
    executor.update_parameters(timeout_min=1, slurm_partition="dev")
    job = executor.submit(the_func, the, args)
    
    output = job.result()
    

    the logging in src.the_module:

    LOGGER = logging.getLogger(__name__)
    ...
    LOGGER.info(...)
    ...
    

    the logging config in "log.yml":

    version: 1
    ...
    loggers:
      src.the_module:
        level: INFO
        handlers: [console, file]
      the_module:
        level: INFO
        handlers: [console, file]
    

    I do not see the_module INFO lines in "log_test/JOBID_0_log.out", only the default submitit INFO log lines and the job stdout. Is this even supposed to work that way or does logging have to be configured some other way in submitit?

    opened by fleimgruber 0
  • be tolerating about sacct error?

    be tolerating about sacct error?

    On my slurm cluster I haven't setup accounting yet. Is the following error msg related to that? Maybe the accounting option can be turned off to avoid this error message?

    I was running it with hydra.

    [2022-11-08 14:00:40,915][HYDRA] Call #2 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zero exit statu
    s 1., status may be inaccurate.
    Slurm accounting storage is disabled
    submitit WARNING (2022-11-08 14:00:43,921) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zer
    o exit status 1., status may be inaccurate.
    submitit WARNING (2022-11-08 14:00:43,921) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zer
    o exit status 1., status may be inaccurate.
    [2022-11-08 14:00:43,921][HYDRA] Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '37']' returned non-zero exit statu
    s 1., status may be inaccurate.
    
    opened by min-xu-ai 0
  • array_parallelism for LocalExecutor

    array_parallelism for LocalExecutor

    When using job arrays, SlurmExecutor may limit the number of concurrent jobs running via array_parallelism parameter. However, it seems to me LocalExecutor does not have the corresponding functionality. Would it be meaningful to add an option to LocalExecutor which limits the number of concurrently running jobs? Would it be too cumbersome to make PicklingExecutor._internal_process_submissions limit the concurrent jobs without problems?

    Use case

    • A big job is partitioned into many smaller jobs
    • The slurm queue is full with many pending jobs
    • Would like to use the same codebase using submitit in a separate compute environment without slurm, with minimal code changes

    Current solution

    • Use ThreadPoolExecutor, where the workers run a function which creates its own LocalExecutor, submit, and wait until finishes.
    • And the pool controls the number of concurrent jobs

    Suggestion

    • The following code to execute without running more than 5 jobs concurrently
    executor = LocalExecutor(folder=somewhere)
    executor.update_parameters(array_parallelism=5)
    
    with executor.batch():
        ...
    
    opened by se-ok 0
Releases(1.2.0)
  • 1.2.0(Feb 1, 2021)

    #1604 Load numpy first if available #1603 Don't rely on Slurm for detecting timeout vs preemption. Due to a regression in Slurm between 19.04 and 20.02. #1602 Fix quoting of paths in various places #1598 Snapshot manager to copy code before starting job

    Source code(tar.gz)
    Source code(zip)
  • 1.1.3(Oct 22, 2020)

Owner
Facebook Incubator
We work hard to contribute our work back to the web, mobile, big data, & infrastructure communities. NB: members must have two-factor auth.
Facebook Incubator
ML Optimizers from scratch using JAX

Toy implementations of some popular ML optimizers using Python/JAX

Shreyansh Singh 38 Jul 29, 2022
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex

Taylor G Smith 54 Aug 20, 2022
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 03, 2023
Hierarchical Time Series Forecasting using Prophet

htsprophet Hierarchical Time Series Forecasting using Prophet Credit to Rob J. Hyndman and research partners as much of the code was developed with th

Collin Rooney 131 Dec 02, 2022
Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

Arjun U 1 Jan 23, 2022
XManager: A framework for managing machine learning experiments 🧑‍🔬

XManager is a platform for packaging, running and keeping track of machine learning experiments. It currently enables one to launch experiments locally or on Google Cloud Platform (GCP). Interaction

DeepMind 620 Dec 27, 2022
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

🤖 Interactive Machine Learning experiments: 🏋️models training + 🎨models demo

Oleksii Trekhleb 1.4k Jan 06, 2023
Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

Mohammad Dori 3 Jul 15, 2022
YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

MohamadReza Taalebi 5 Sep 27, 2022
Python ML pipeline that showcases mltrace functionality.

mltrace tutorial Date: October 2021 This tutorial builds a training and testing pipeline for a toy ML prediction problem: to predict whether a passeng

Log Labs 28 Nov 09, 2022
Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

Upgini 175 Jan 08, 2023
Data from "Datamodels: Predicting Predictions with Training Data"

Data from "Datamodels: Predicting Predictions with Training Data" Here we provid

Madry Lab 51 Dec 09, 2022
This is my implementation on the K-nearest neighbors algorithm from scratch using Python

K Nearest Neighbors (KNN) algorithm In this Machine Learning world, there are various algorithms designed for classification problems such as Logistic

sonny1902 1 Jan 08, 2022
Python Research Framework

Python Research Framework

EleutherAI 106 Dec 13, 2022
The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

MLOps The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it insid

Maykon Schots 25 Nov 27, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

2.5k Dec 28, 2022