Resources

SO Campaign Manager models HPC systems as Resource objects that capture node counts, core layout, memory, and SLURM Quality of Service (QoS) policies. The planner uses this information to schedule workflows into the right QoS tier automatically.

Core Concepts

Resource

A Resource describes the physical characteristics of an HPC system:

  • name: Identifier used to select the resource in configuration files

  • nodes: Total number of compute nodes available

  • cores_per_node: CPU cores per node

  • memory_per_node: Memory per node in MB

  • qos: List of QoS policies available on the system

QoS Policy

A QosPolicy maps to a SLURM QoS tier and defines its limits:

  • name: QoS name as known to SLURM (e.g. short, regular)

  • max_walltime: Maximum walltime in minutes (None = unlimited)

  • max_jobs: Maximum number of concurrent jobs (None = unlimited)

  • max_cores: Maximum total cores that can be requested at once (None = unlimited)

The planner evaluates these limits at scheduling time and selects the smallest QoS tier that satisfies each workflow’s walltime and core requirements.

Specifying a Resource in Configuration

TOML campaigns

Set the target resource and the number of nodes to request in the [campaign] and [campaign.resources] sections:

[campaign]
deadline = "2d"
resource = "tiger3"

[campaign.resources]
nodes = 4
cores-per-node = 112

DAG YAML campaigns

campaign:
  deadline: 24h
  resource: tiger3
  requested_resources: 3359   # total cores requested

Supported Resources

Tiger 3 (Princeton)

The primary SO mapmaking resource. Pre-configured as TigerResource.

Property

Value

Unit

Nodes

492

Cores per node

112

Memory per node

1 000 000

MB

QoS tiers:

QoS

Max walltime

Max jobs

Max cores

test

1 h

1

8 000

vshort

5 h

2 000

55 104

short

24 h

50

8 000

medium

3 d

80

4 000

long

6 d

16

1 000

vlong

15 d

8

900

Perlmutter (NERSC)

NERSC Perlmutter system. Pre-configured as PerlmutterResource.

Property

Value

Unit

Nodes

3 072

Cores per node

128

Memory per node

1 000 000

MB

QoS tiers:

QoS

Max walltime

Max jobs

Max cores

regular

48 h

5 000

393 216

interactive

4 h

2

512

shared_interactive

4 h

2

64

debug

30 min

5

1 024

Universe

Princeton Universe cluster. Pre-configured as UniverseResource.

Property

Value

Unit

Nodes

28

Cores per node

224

Memory per node

1 000 000

MB

QoS tiers:

QoS

Max walltime

Max jobs

Max cores

main

30 d

5 000

6 272

Adding a Custom Resource

To define a new HPC system, subclass Resource and provide the QoS policies in __init__:

from socm.core import QosPolicy, Resource

class MyClusterResource(Resource):
    name: str = "mycluster"
    nodes: int = 100
    cores_per_node: int = 64
    memory_per_node: int = 512000  # MB

    def __init__(self, **data):
        super().__init__(**data)
        self.qos = [
            QosPolicy(name="short", max_walltime=1440, max_jobs=100, max_cores=6400),
            QosPolicy(name="long",  max_walltime=10080, max_jobs=20, max_cores=3200),
        ]

Register the resource name so it can be referenced from configuration files by passing the instance directly to Bookkeeper.