Manging wandb agents on a slurm cluster

The wandb-slurm top-level command can be used to perform various actions on slurm like submitting a job that starts wandb sweep agents, stop existing agents for a sweep, etc.

Starting agents for a sweep

Usage: wandb-slurm start-agents [OPTIONS]

  --inform-before-time INTEGER  How many seconds before should the agents be
                                informed before shutting them down?
  --signals LIST                Signals to handle seperated buy '|' for
                                example 'TERM|INT|CONT'  (default:
  --mem TEXT                    #SBATCH --mem=
  --run-count INTEGER           Runs per agent, ie `wandb agent --count <run-
                                count> sweep_id`. (default: unlimited).
  --cpus-per-task INTEGER       #SBATCH --cpus-per-task=
  --partition TEXT              #SBATCH --partition=
  --num-gpus INTEGER            #SBATCH --gres=gpu:
  --num-agents INTEGER
  --edit / --no-edit            Edit final
  --chain / --no-chain          Insert dependencies between jobs by starting
                                num-agents serially.
  --dependency TEXT             Dependency types:

                                    after:jobid[:jobid...]      job can begin
                                    after the specified jobs have started

                                    afterany:jobid[:jobid...]   job can begin
                                    after the specified jobs have terminated

                                    afternotok:jobid[:jobid...] job can begin
                                    after the specified jobs have failed

                                    afterok:jobid[:jobid...]    job can begin
                                    after the specified jobs have run to
                                    completion with an exit code of zero (see
                                    the user guide for caveats).

                                    singleton   jobs can begin execution after
                                    all previously launched jobs with the same
                                    name and user have ended. This is useful
                                    to collate results of a swarm or to send a
                                    notification at the end of a swarm.

                                        See `sbatch <
                                        /sbatch.html>`_ doc for details.
  --verbatim-args LIST          arguments in kw=value seperated by | form to
                                drop verbatim in For example
  --dry-run                     Only create files and show command but do not
                                submit jobs.
  --confirm                     Whether to ask for confirmation before
  --help                        Show this message and exit.

Following is an example invocation:

$ wandb-slurm \
--entity wandb_team_or_username \
--project wandb_project_name \
--sweep 216pxkwa \
start-agents \
--mem 10GB \
--run-count 10 \
--cpus-per-task 6 \
--partition titanx-long \
--num-gpus 1 \
--num-agents 2 \
--verbatim-args "exclude=node030,node095,node029"