Tutorial

This tutorial will guide you through using SO Campaign Manager step-by-step, from basic usage to advanced configurations.

Tutorial Overview

In this tutorial, you will learn:

  1. How to set up your first campaign

  2. How to configure workflows and resources

  3. How to run and monitor campaigns

  4. How to work with null tests

  5. How to optimize resource usage

  6. How to troubleshoot common issues

Prerequisites

Before starting, ensure you have:

  • SO Campaign Manager installed (see Installation)

  • Access to an HPC system (e.g., Tiger 3)

  • Basic understanding of SLURM job scheduling

  • Sample data for SO mapmaking (context files, area definitions, etc.)

Tutorial 1: Your First Campaign

Let’s create a simple campaign with a single ML mapmaking workflow.

Step 1: Understand the Campaign Structure

A campaign consists of:

  • Campaign configuration - Deadline and global settings

  • Workflows - Analysis tasks to execute

  • Resources - HPC resource requirements per workflow

  • Environment - Environment variables for execution

Step 2: Create a Basic Configuration File

Create a file named my_first_campaign.toml:

# Campaign-level configuration
[campaign]
deadline = "4h"  # Complete within 4 hours
resource = "tiger3"  # Target HPC resource
execution_schema = "remote"  # Execute on remote HPC

# ML Mapmaking workflow configuration
[campaign.ml-mapmaking]
# Input files (use absolute paths)
context = "file:///path/to/your/context.yaml"
area = "file:///path/to/your/area.fits"
output_dir = "/path/to/output"

# Analysis parameters
bands = "f090"          # Frequency band to process
maxiter = "100"         # Maximum iterations
query = "obs_id='your_observation_id'"  # Data selection query
tiled = 0               # Don't use tiled processing
site = "so_lat"         # Observatory site

# Resource requirements for this workflow
[campaign.ml-mapmaking.resources]
ranks = 1               # Number of MPI ranks
threads = 32            # Number of OpenMP threads
memory = "80000"        # Memory in MB (80 GB)
runtime = "2h"          # Expected runtime

# Environment variables
[campaign.ml-mapmaking.environment]
SOTODLIB_SITECONFIG = "/path/to/site.yaml"

Note

Important: All file paths in the configuration must be absolute paths starting with file:///.

Step 3: Validate Your Configuration

Before running, let’s understand what each parameter means:

Campaign Parameters:

  • deadline: Maximum time for the entire campaign to complete

    • Format: "2h" (2 hours), "30m" (30 minutes), "3d" (3 days)

    • The planner will optimize workflow scheduling to meet this deadline

  • resource: Target HPC system (currently only "tiger3" is supported)

  • execution_schema: How to execute ("remote" for HPC, "local" for testing)

Workflow Parameters:

  • context: Path to SOTODLIB context YAML file defining data structure

  • area: Path to FITS file defining the sky area to map

  • output_dir: Directory where output maps will be written

  • bands: Frequency band(s) to process ("f090", "f150", or "f090,f150")

  • maxiter: Maximum iterations for convergence (typically 100-200)

  • query: SQL-like query to select observations from context

  • tiled: Whether to use tiled processing (0 = no, 1 = yes)

  • site: Observatory site identifier ("so_lat" or "so_sat")

Resource Parameters:

  • ranks: Number of MPI processes

  • threads: Number of OpenMP threads per process

  • memory: Total memory requirement in MB

  • runtime: Expected execution time (used for QoS selection)

Step 4: Run Your Campaign

Execute the campaign using the command-line interface:

socm -t my_first_campaign.toml

You should see output similar to:

[INFO] Loading campaign configuration from my_first_campaign.toml
[INFO] Validating configuration...
[INFO] Creating 1 workflow(s)
[INFO] Planning campaign with deadline: 240 minutes
[INFO] Selected QoS: medium (max walltime: 4320 minutes)
[INFO] Submitting workflows to SLURM...
[INFO] Campaign execution started
[INFO] Monitoring workflow progress...

Step 5: Monitor Execution

The campaign manager will:

  1. Parse your configuration

  2. Create workflow instances

  3. Plan the execution schedule

  4. Submit jobs to SLURM

  5. Monitor execution and report progress

You can monitor SLURM jobs directly:

# Check your SLURM queue
squeue -u $USER

# Monitor specific job
scontrol show job <job_id>

Step 6: Check Output

Once complete, check your output directory:

ls -lh /path/to/output/

You should find:

  • Output maps (FITS files)

  • Log files

  • Metadata about the run

Congratulations! You’ve run your first campaign.

Tutorial 2: Multiple Workflows in Parallel

Now let’s create a campaign with multiple workflows that run in parallel.

Step 1: Create Multi-Workflow Configuration

Create multi_workflow_campaign.toml:

[campaign]
deadline = "8h"
resource = "tiger3"
execution_schema = "remote"
requested_resources = 1000  # Total core-hours budget

# First workflow: f090 band
[campaign.ml-mapmaking]
name = "mapmaking_f090"
context = "file:///path/to/context.yaml"
area = "file:///path/to/area.fits"
output_dir = "/path/to/output/f090"
bands = "f090"
maxiter = "100"
query = "obs_id='observation_1'"
site = "so_lat"

[campaign.ml-mapmaking.resources]
ranks = 32
threads = 8
memory = "120000"
runtime = "3h"

Since TOML doesn’t support multiple sections with the same name, you’ll need to use arrays or separate workflow types. Let me show you a better approach using subcampaigns:

[campaign]
deadline = "8h"
resource = "tiger3"
execution_schema = "remote"

# Common configuration for all mapmaking workflows
[campaign.ml-mapmaking]
context = "file:///path/to/context.yaml"
area = "file:///path/to/area.fits"
output_dir = "/path/to/output"
maxiter = "100"
site = "so_lat"

[campaign.ml-mapmaking.resources]
ranks = 32
threads = 8
memory = "120000"
runtime = "3h"

# You can then modify the query or bands in the code
# or use different workflow types

Step 2: Understanding Parallel Execution

The campaign manager automatically:

  • Detects workflows that can run in parallel

  • Schedules them based on resource availability

  • Optimizes to meet the deadline

  • Manages dependencies between workflows

Tutorial 3: Null Test Campaigns

Null tests are crucial for validating mapmaking results. Let’s create a comprehensive null test campaign.

Step 1: Understanding Null Tests

Null tests validate your mapmaking by creating maps from data splits:

  • Mission Tests: Split by time

  • Wafer Tests: Split by detector

  • Direction Tests: Split by scan direction

  • PWV Tests: Split by precipitable water vapor

  • Day/Night Tests: Split by time of day

  • Elevation Tests: Split by telescope elevation

Step 2: Configure Null Tests

Create null_test_campaign.toml:

[campaign]
deadline = "12h"
resource = "tiger3"
execution_schema = "remote"

# Common configuration for all null tests
[campaign.ml-null-tests]
context = "file:///path/to/context.yaml"
area = "file:///path/to/area.fits"
output_dir = "/path/to/output/null_tests"
bands = "f090"
maxiter = "100,100"  # Two iteration stages
downsample = "4,2"   # Downsample factors for each stage
query = "file:///path/to/query.txt"
tiled = 1
site = "so_lat"

# Mission tests: time-based splits
[campaign.ml-null-tests.mission-tests]
chunk_nobs = 10      # Chunk size in days
nsplits = 4          # Number of splits (must be power of 2)

[campaign.ml-null-tests.mission-tests.resources]
ranks = 35
threads = 8
memory = "2400000"   # 2.4 TB
runtime = "4h"

# Wafer tests: detector-based splits
[campaign.ml-null-tests.wafer-tests]
chunk_nobs = 10
nsplits = 4

[campaign.ml-null-tests.wafer-tests.resources]
ranks = 12
threads = 8
memory = "80000"
runtime = "4h"

# Direction tests: scan direction splits
[campaign.ml-null-tests.direction-tests]
chunk_nobs = 10
nsplits = 4

[campaign.ml-null-tests.direction-tests.resources]
ranks = 17
threads = 8
memory = "80000"
runtime = "4h"

Step 3: Understanding Subcampaign Inheritance

Notice how the null test workflows inherit common configuration:

  • context, area, bands, etc. are defined once in [campaign.ml-null-tests]

  • Each specific test (mission-tests, wafer-tests, etc.) inherits these

  • Specific tests only need to define their unique parameters

This follows the DRY principle (Don’t Repeat Yourself).

Step 4: Run Null Test Campaign

socm -t null_test_campaign.toml

The campaign manager will:

  1. Create 3 null test workflows (mission, wafer, direction)

  2. Schedule them for parallel execution

  3. Monitor all workflows

  4. Report completion status

Tutorial 4: Resource Optimization

Learn how to optimize resource usage for cost-effective campaigns.

Step 1: Understanding Resource Parameters

The key resource parameters affect both performance and cost:

Ranks (MPI Processes):

  • More ranks → faster for I/O-heavy tasks

  • Diminishing returns beyond data parallelism limit

  • Rule of thumb: 1 rank per ~2-4 GB of data

Threads (OpenMP):

  • More threads → faster for compute-heavy tasks

  • Limited by memory bandwidth

  • Rule of thumb: 4-16 threads per rank

Memory:

  • Must accommodate: data + working set + overhead

  • Rule of thumb: 2x data size for mapmaking

  • Monitor actual usage and adjust

Runtime:

  • Affects QoS selection (queue priority)

  • Overestimate to avoid timeout

  • Underestimate wastes resources if too conservative

Step 2: QoS Tiers on Tiger

Tiger has several QoS tiers with different limits:

QoS Tier

Max Walltime

Max Jobs

Best For

test

60 minutes

Limited

Quick tests, debugging

vshort

5 hours

Many

Small workflows

short

24 hours

Many

Standard workflows

medium

3 days

Moderate

Large workflows

long

6 days

Few

Very large workflows

vlong

15 days

Very few

Extremely large workflows

The campaign manager automatically selects the appropriate QoS based on your runtime estimate.

Step 3: Right-Sizing Example

Let’s optimize a workflow:

Initial (over-allocated) configuration:

[campaign.ml-mapmaking.resources]
ranks = 100           # Too many?
threads = 32          # Too many?
memory = "500000"     # 500 GB - too much?
runtime = "10h"       # Too long?

Steps to optimize:

  1. Start with a small test:

    ranks = 10
    threads = 8
    memory = "100000"
    runtime = "1h"
    
  2. Run and monitor:

    # During execution, monitor memory usage
    ssh tiger-node-xx  # SSH to compute node
    top -u $USER
    
  3. Check logs for actual usage:

    • Memory high-water mark

    • Actual walltime

    • CPU utilization

  4. Adjust based on findings:

    • If memory maxed out → increase memory

    • If completed in 30min with 1h limit → reduce runtime

    • If CPU idle → reduce ranks or threads

    • If walltime nearly exceeded → increase runtime

Optimized configuration:

[campaign.ml-mapmaking.resources]
ranks = 35            # Sufficient for data size
threads = 8           # Good balance
memory = "2400000"    # 20% overhead over observed
runtime = "4h"        # 50% buffer over observed

Step 4: Scaling Rules

Weak Scaling (more data, same time):

  • Double data size → double ranks

  • Keep threads constant

  • Double memory

Strong Scaling (same data, less time):

  • Limited by Amdahl’s Law

  • Doubling ranks doesn’t halve time

  • Test to find optimal parallelism

Tutorial 5: Advanced Configuration Patterns

Step 1: Using Environment Variables

Many workflows require specific environment variables:

[campaign.ml-mapmaking.environment]
# SOTODLIB configuration
SOTODLIB_SITECONFIG = "/path/to/site.yaml"

# Temporary storage
TMPDIR = "/scratch/network/$USER/tmp"

# Performance tuning
OMP_NUM_THREADS = "8"
OMP_PROC_BIND = "true"
OMP_PLACES = "cores"

# Debugging
SOTODLIB_DEBUG = "1"

Step 2: Working with Query Files

For complex observation selections, use query files:

query.txt:

obs_id IN (
    'obs_1234567890.1234567900.ar5_1',
    'obs_1234567901.1234567911.ar5_1',
    'obs_1234567912.1234567922.ar5_1'
)

Configuration:

[campaign.ml-mapmaking]
query = "file:///path/to/query.txt"

Step 3: Multi-Stage Processing

Use multi-stage parameters for progressive refinement:

[campaign.ml-mapmaking]
maxiter = "200,200"      # 200 iterations in each of 2 stages
downsample = "4,2"       # Downsample by 4x, then 2x

# Stage 1: Coarse (4x downsampled), 200 iterations
# Stage 2: Fine (2x downsampled), 200 iterations

This approach:

  • Faster initial convergence at coarse resolution

  • Refinement at higher resolution

  • Better overall performance than single-stage

Tutorial 6: Monitoring and Debugging

Step 1: Understanding Log Output

The campaign manager produces several types of log messages:

[INFO] Normal informational messages
[WARNING] Potential issues (non-fatal)
[ERROR] Errors that stop execution
[DEBUG] Detailed debugging information

Step 2: Checking Workflow Status

Monitor SLURM jobs:

# List your jobs
squeue -u $USER

# Detailed job info
scontrol show job <job_id>

# Job accounting info
sacct -j <job_id> --format=JobID,JobName,State,Elapsed,MaxRSS

Step 3: Accessing Job Logs

RADICAL-Pilot creates detailed logs:

# Find RADICAL-Pilot session directory
ls -lrt ~/radical.pilot.sandbox/

# Check pilot logs
cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/pilot.log

# Check task logs
cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/task.*/task.log

Step 4: Common Issues and Solutions

Issue: Configuration validation error

[ERROR] ValidationError: field required: context

Solution: Check TOML syntax and ensure all required fields are present.

Issue: Out of memory error

[ERROR] Workflow failed: MemoryError

Solution: Increase memory parameter in resources section.

Issue: Walltime exceeded

[ERROR] Job terminated: TIMEOUT

Solution: Increase runtime parameter or optimize workflow.

Issue: File not found

[ERROR] FileNotFoundError: /path/to/context.yaml

Solution:

  • Verify file paths are absolute

  • Use file:/// prefix for file URIs

  • Ensure files are accessible from compute nodes

Tutorial 7: Testing Before Production

Step 1: Dry Run Mode

Test your configuration without submitting jobs:

socm -t campaign.toml --dry-run

This will:

  • Validate configuration

  • Create workflow objects

  • Run planning

  • Show what would be executed

  • Not submit any jobs

Step 2: Small-Scale Test

Before running on full dataset:

  1. Create a test configuration with subset of data

  2. Use shorter runtime limits

  3. Use test QoS tier

  4. Verify outputs are correct

Test configuration:

[campaign]
deadline = "1h"
resource = "tiger3"

[campaign.ml-mapmaking]
# ... same configuration but with:
query = "obs_id='single_test_observation'"  # Just one obs
maxiter = "10"  # Fewer iterations

[campaign.ml-mapmaking.resources]
runtime = "30m"  # Short runtime for test QoS

Step 3: Validation Checklist

Before submitting production campaigns:

☐ Configuration validates without errors

☐ File paths are correct and accessible

☐ Environment variables are set correctly

☐ Resource estimates are reasonable

☐ Output directories exist and are writable

☐ Test run completed successfully

☐ Output files are in expected format

☐ Resource usage matches estimates

Tutorial 8: Programmatic Usage

For advanced use cases, use the Python API directly.

Step 1: Basic Python Script

from socm.bookkeeper import Bookkeeper
from socm.core import Campaign
from socm.resources import TigerResource
from socm.workflows import MLMapmakingWorkflow

# Create workflow
workflow = MLMapmakingWorkflow(
    name="test_mapmaking",
    executable="so-site-pipeline",
    subcommand="make-filterbin-map",
    context="/path/to/context.yaml",
    area="/path/to/area.fits",
    output_dir="/path/to/output",
    bands="f090",
    maxiter="100",
    query="obs_id='test'",
    site="so_lat",
    resources={
        "ranks": 32,
        "threads": 8,
        "memory": 120000,
        "runtime": "3h"
    }
)

# Create campaign
campaign = Campaign(
    id=1,
    workflows=[workflow],
    campaign_policy="time",
    deadline=240  # minutes
)

# Create resource
resource = TigerResource()

# Execute
bookkeeper = Bookkeeper(
    campaign=campaign,
    resources={"tiger3": resource},
    policy="time",
    target_resource="tiger3",
    deadline=240
)

bookkeeper.run()

Step 2: Dynamic Workflow Generation

Generate workflows programmatically:

from socm.workflows import MLMapmakingWorkflow

# List of bands to process
bands_list = ["f090", "f150", "f220"]

# Create workflow for each band
workflows = []
for band in bands_list:
    wf = MLMapmakingWorkflow(
        name=f"mapmaking_{band}",
        executable="so-site-pipeline",
        subcommand="make-filterbin-map",
        context="/path/to/context.yaml",
        area="/path/to/area.fits",
        output_dir=f"/path/to/output/{band}",
        bands=band,
        maxiter="100",
        query="obs_id='test'",
        site="so_lat",
        resources={
            "ranks": 32,
            "threads": 8,
            "memory": 120000,
            "runtime": "3h"
        }
    )
    workflows.append(wf)

# Create campaign with all workflows
campaign = Campaign(
    id=1,
    workflows=workflows,
    campaign_policy="time"
)

Next Steps

Now that you’ve completed the tutorials, you can:

Congratulations on completing the tutorial! You’re now ready to run production campaigns with SO Campaign Manager.