Resources¶
SO Campaign Manager models HPC systems as Resource objects that capture node counts,
core layout, memory, and SLURM Quality of Service (QoS) policies. The planner uses this
information to schedule workflows into the right QoS tier automatically.
Core Concepts¶
Resource¶
A Resource describes the physical characteristics of an HPC system:
name: Identifier used to select the resource in configuration filesnodes: Total number of compute nodes availablecores_per_node: CPU cores per nodememory_per_node: Memory per node in MBqos: List of QoS policies available on the system
QoS Policy¶
A QosPolicy maps to a SLURM QoS tier and defines its limits:
name: QoS name as known to SLURM (e.g.short,regular)max_walltime: Maximum walltime in minutes (None= unlimited)max_jobs: Maximum number of concurrent jobs (None= unlimited)max_cores: Maximum total cores that can be requested at once (None= unlimited)
The planner evaluates these limits at scheduling time and selects the smallest QoS tier that satisfies each workflow’s walltime and core requirements.
Specifying a Resource in Configuration¶
TOML campaigns¶
Set the target resource and the number of nodes to request in the [campaign] and
[campaign.resources] sections:
[campaign]
deadline = "2d"
resource = "tiger3"
[campaign.resources]
nodes = 4
cores-per-node = 112
DAG YAML campaigns¶
campaign:
deadline: 24h
resource: tiger3
requested_resources: 3359 # total cores requested
Supported Resources¶
Tiger 3 (Princeton)¶
The primary SO mapmaking resource. Pre-configured as TigerResource.
Property |
Value |
Unit |
|---|---|---|
Nodes |
492 |
|
Cores per node |
112 |
|
Memory per node |
1 000 000 |
MB |
QoS tiers:
QoS |
Max walltime |
Max jobs |
Max cores |
|---|---|---|---|
|
1 h |
1 |
8 000 |
|
5 h |
2 000 |
55 104 |
|
24 h |
50 |
8 000 |
|
3 d |
80 |
4 000 |
|
6 d |
16 |
1 000 |
|
15 d |
8 |
900 |
Perlmutter (NERSC)¶
NERSC Perlmutter system. Pre-configured as PerlmutterResource.
Property |
Value |
Unit |
|---|---|---|
Nodes |
3 072 |
|
Cores per node |
128 |
|
Memory per node |
1 000 000 |
MB |
QoS tiers:
QoS |
Max walltime |
Max jobs |
Max cores |
|---|---|---|---|
|
48 h |
5 000 |
393 216 |
|
4 h |
2 |
512 |
|
4 h |
2 |
64 |
|
30 min |
5 |
1 024 |
Universe¶
Princeton Universe cluster. Pre-configured as UniverseResource.
Property |
Value |
Unit |
|---|---|---|
Nodes |
28 |
|
Cores per node |
224 |
|
Memory per node |
1 000 000 |
MB |
QoS tiers:
QoS |
Max walltime |
Max jobs |
Max cores |
|---|---|---|---|
|
30 d |
5 000 |
6 272 |
Adding a Custom Resource¶
To define a new HPC system, subclass Resource and provide the QoS policies in
__init__:
from socm.core import QosPolicy, Resource
class MyClusterResource(Resource):
name: str = "mycluster"
nodes: int = 100
cores_per_node: int = 64
memory_per_node: int = 512000 # MB
def __init__(self, **data):
super().__init__(**data)
self.qos = [
QosPolicy(name="short", max_walltime=1440, max_jobs=100, max_cores=6400),
QosPolicy(name="long", max_walltime=10080, max_jobs=20, max_cores=3200),
]
Register the resource name so it can be referenced from configuration files by passing
the instance directly to Bookkeeper.