Tutorial¶

This tutorial will guide you through using SO Campaign Manager step-by-step, from basic usage to advanced configurations.

Tutorial Overview¶

In this tutorial, you will learn:

How to set up your first campaign
How to configure workflows and resources
How to run and monitor campaigns
How to work with null tests
How to optimize resource usage
How to troubleshoot common issues

Prerequisites¶

Before starting, ensure you have:

SO Campaign Manager installed (see Installation)
Access to an HPC system (e.g., Tiger 3)
Basic understanding of SLURM job scheduling
Sample data for SO mapmaking (context files, area definitions, etc.)

Tutorial 1: Your First Campaign¶

Let’s create a simple campaign with a single ML mapmaking workflow.

Step 1: Understand the Campaign Structure¶

A campaign consists of:

Campaign configuration - Deadline and global settings
Workflows - Analysis tasks to execute
Resources - HPC resource requirements per workflow
Environment - Environment variables for execution

Step 2: Create a Basic Configuration File¶

Create a file named my_first_campaign.toml:

# Campaign-level configuration
[campaign]
deadline = "4h"  # Complete within 4 hours
resource = "tiger3"  # Target HPC resource
execution_schema = "remote"  # Execute on remote HPC

# ML Mapmaking workflow configuration
[campaign.ml-mapmaking]
# Input files (use absolute paths)
context = "file:///path/to/your/context.yaml"
area = "file:///path/to/your/area.fits"
output_dir = "/path/to/output"

# Analysis parameters
bands = "f090"          # Frequency band to process
maxiter = "100"         # Maximum iterations
query = "obs_id='your_observation_id'"  # Data selection query
tiled = 0               # Don't use tiled processing
site = "so_lat"         # Observatory site

# Resource requirements for this workflow
[campaign.ml-mapmaking.resources]
ranks = 1               # Number of MPI ranks
threads = 32            # Number of OpenMP threads
memory = "80000"        # Memory in MB (80 GB)
runtime = "2h"          # Expected runtime

# Environment variables
[campaign.ml-mapmaking.environment]
SOTODLIB_SITECONFIG = "/path/to/site.yaml"

Note

Important: All file paths in the configuration must be absolute paths starting with file:///.

Step 3: Validate Your Configuration¶

Before running, let’s understand what each parameter means:

Campaign Parameters:

deadline: Maximum time for the entire campaign to complete
- Format: "2h" (2 hours), "30m" (30 minutes), "3d" (3 days)
- The planner will optimize workflow scheduling to meet this deadline
resource: Target HPC system (currently only "tiger3" is supported)
execution_schema: How to execute ("remote" for HPC, "local" for testing)

Workflow Parameters:

context: Path to SOTODLIB context YAML file defining data structure
area: Path to FITS file defining the sky area to map
output_dir: Directory where output maps will be written
bands: Frequency band(s) to process ("f090", "f150", or "f090,f150")
maxiter: Maximum iterations for convergence (typically 100-200)
query: SQL-like query to select observations from context
tiled: Whether to use tiled processing (0 = no, 1 = yes)
site: Observatory site identifier ("so_lat" or "so_sat")

Resource Parameters:

ranks: Number of MPI processes
threads: Number of OpenMP threads per process
memory: Total memory requirement in MB
runtime: Expected execution time (used for QoS selection)

Step 4: Run Your Campaign¶

Execute the campaign using the command-line interface:

socm -t my_first_campaign.toml

You should see output similar to:

[INFO] Loading campaign configuration from my_first_campaign.toml
[INFO] Validating configuration...
[INFO] Creating 1 workflow(s)
[INFO] Planning campaign with deadline: 240 minutes
[INFO] Selected QoS: medium (max walltime: 4320 minutes)
[INFO] Submitting workflows to SLURM...
[INFO] Campaign execution started
[INFO] Monitoring workflow progress...

Step 5: Monitor Execution¶

The campaign manager will:

Parse your configuration
Create workflow instances
Plan the execution schedule
Submit jobs to SLURM
Monitor execution and report progress

You can monitor SLURM jobs directly:

# Check your SLURM queue
squeue -u $USER

# Monitor specific job
scontrol show job <job_id>

Step 6: Check Output¶

Once complete, check your output directory:

ls -lh /path/to/output/

You should find:

Output maps (FITS files)
Log files
Metadata about the run

Congratulations! You’ve run your first campaign.

Tutorial 2: Multiple Workflows in Parallel¶

Now let’s create a campaign with multiple workflows that run in parallel.

Step 1: Create Multi-Workflow Configuration¶

Create multi_workflow_campaign.toml:

[campaign]
deadline = "8h"
resource = "tiger3"
execution_schema = "remote"
requested_resources = 1000  # Total core-hours budget

# First workflow: f090 band
[campaign.ml-mapmaking]
name = "mapmaking_f090"
context = "file:///path/to/context.yaml"
area = "file:///path/to/area.fits"
output_dir = "/path/to/output/f090"
bands = "f090"
maxiter = "100"
query = "obs_id='observation_1'"
site = "so_lat"

[campaign.ml-mapmaking.resources]
ranks = 32
threads = 8
memory = "120000"
runtime = "3h"

Since TOML doesn’t support multiple sections with the same name, you’ll need to use arrays or separate workflow types. Let me show you a better approach using subcampaigns:

[campaign]
deadline = "8h"
resource = "tiger3"
execution_schema = "remote"

# Common configuration for all mapmaking workflows
[campaign.ml-mapmaking]
context = "file:///path/to/context.yaml"
area = "file:///path/to/area.fits"
output_dir = "/path/to/output"
maxiter = "100"
site = "so_lat"

[campaign.ml-mapmaking.resources]
ranks = 32
threads = 8
memory = "120000"
runtime = "3h"

# You can then modify the query or bands in the code
# or use different workflow types

Step 2: Understanding Parallel Execution¶

The campaign manager automatically:

Detects workflows that can run in parallel
Schedules them based on resource availability
Optimizes to meet the deadline
Manages dependencies between workflows

Tutorial 3: Null Test Campaigns¶

Null tests are crucial for validating mapmaking results. Let’s create a comprehensive null test campaign.

Step 1: Understanding Null Tests¶

Null tests validate your mapmaking by creating maps from data splits:

Mission Tests: Split by time
Wafer Tests: Split by detector
Direction Tests: Split by scan direction
PWV Tests: Split by precipitable water vapor
Day/Night Tests: Split by time of day
Elevation Tests: Split by telescope elevation

Step 2: Configure Null Tests¶

Create null_test_campaign.toml:

[campaign]
deadline = "12h"
resource = "tiger3"
execution_schema = "remote"

# Common configuration for all null tests
[campaign.ml-null-tests]
context = "file:///path/to/context.yaml"
area = "file:///path/to/area.fits"
output_dir = "/path/to/output/null_tests"
bands = "f090"
maxiter = "100,100"  # Two iteration stages
downsample = "4,2"   # Downsample factors for each stage
query = "file:///path/to/query.txt"
tiled = 1
site = "so_lat"

# Mission tests: time-based splits
[campaign.ml-null-tests.mission-tests]
chunk_nobs = 10      # Chunk size in days
nsplits = 4          # Number of splits (must be power of 2)

[campaign.ml-null-tests.mission-tests.resources]
ranks = 35
threads = 8
memory = "2400000"   # 2.4 TB
runtime = "4h"

# Wafer tests: detector-based splits
[campaign.ml-null-tests.wafer-tests]
chunk_nobs = 10
nsplits = 4

[campaign.ml-null-tests.wafer-tests.resources]
ranks = 12
threads = 8
memory = "80000"
runtime = "4h"

# Direction tests: scan direction splits
[campaign.ml-null-tests.direction-tests]
chunk_nobs = 10
nsplits = 4

[campaign.ml-null-tests.direction-tests.resources]
ranks = 17
threads = 8
memory = "80000"
runtime = "4h"

Step 3: Understanding Subcampaign Inheritance¶

Notice how the null test workflows inherit common configuration:

context, area, bands, etc. are defined once in [campaign.ml-null-tests]
Each specific test (mission-tests, wafer-tests, etc.) inherits these
Specific tests only need to define their unique parameters

This follows the DRY principle (Don’t Repeat Yourself).

Step 4: Run Null Test Campaign¶

socm -t null_test_campaign.toml

The campaign manager will:

Create 3 null test workflows (mission, wafer, direction)
Schedule them for parallel execution
Monitor all workflows
Report completion status

Tutorial 4: Resource Optimization¶

Learn how to optimize resource usage for cost-effective campaigns.

Step 1: Understanding Resource Parameters¶

The key resource parameters affect both performance and cost:

Ranks (MPI Processes):

More ranks → faster for I/O-heavy tasks
Diminishing returns beyond data parallelism limit
Rule of thumb: 1 rank per ~2-4 GB of data

Threads (OpenMP):

More threads → faster for compute-heavy tasks
Limited by memory bandwidth
Rule of thumb: 4-16 threads per rank

Memory:

Must accommodate: data + working set + overhead
Rule of thumb: 2x data size for mapmaking
Monitor actual usage and adjust

Runtime:

Affects QoS selection (queue priority)
Overestimate to avoid timeout
Underestimate wastes resources if too conservative

Step 2: QoS Tiers on Tiger¶

Tiger has several QoS tiers with different limits:

QoS Tier	Max Walltime	Max Jobs	Best For
test	60 minutes	Limited	Quick tests, debugging
vshort	5 hours	Many	Small workflows
short	24 hours	Many	Standard workflows
medium	3 days	Moderate	Large workflows
long	6 days	Few	Very large workflows
vlong	15 days	Very few	Extremely large workflows

The campaign manager automatically selects the appropriate QoS based on your runtime estimate.

Step 3: Right-Sizing Example¶

Let’s optimize a workflow:

Initial (over-allocated) configuration:

[campaign.ml-mapmaking.resources]
ranks = 100           # Too many?
threads = 32          # Too many?
memory = "500000"     # 500 GB - too much?
runtime = "10h"       # Too long?

Steps to optimize:

Start with a small test:

ranks = 10
threads = 8
memory = "100000"
runtime = "1h"

Run and monitor:

# During execution, monitor memory usage
ssh tiger-node-xx  # SSH to compute node
top -u $USER

Check logs for actual usage:
- Memory high-water mark
- Actual walltime
- CPU utilization
Adjust based on findings:
- If memory maxed out → increase memory
- If completed in 30min with 1h limit → reduce runtime
- If CPU idle → reduce ranks or threads
- If walltime nearly exceeded → increase runtime

Optimized configuration:

[campaign.ml-mapmaking.resources]
ranks = 35            # Sufficient for data size
threads = 8           # Good balance
memory = "2400000"    # 20% overhead over observed
runtime = "4h"        # 50% buffer over observed

Step 4: Scaling Rules¶

Weak Scaling (more data, same time):

Double data size → double ranks
Keep threads constant
Double memory

Strong Scaling (same data, less time):

Limited by Amdahl’s Law
Doubling ranks doesn’t halve time
Test to find optimal parallelism

Tutorial 5: Advanced Configuration Patterns¶

Step 1: Using Environment Variables¶

Many workflows require specific environment variables:

[campaign.ml-mapmaking.environment]
# SOTODLIB configuration
SOTODLIB_SITECONFIG = "/path/to/site.yaml"

# Temporary storage
TMPDIR = "/scratch/network/$USER/tmp"

# Performance tuning
OMP_NUM_THREADS = "8"
OMP_PROC_BIND = "true"
OMP_PLACES = "cores"

# Debugging
SOTODLIB_DEBUG = "1"

Step 2: Working with Query Files¶

For complex observation selections, use query files:

query.txt:

obs_id IN (
    'obs_1234567890.1234567900.ar5_1',
    'obs_1234567901.1234567911.ar5_1',
    'obs_1234567912.1234567922.ar5_1'
)

Configuration:

[campaign.ml-mapmaking]
query = "file:///path/to/query.txt"

Step 3: Multi-Stage Processing¶

Use multi-stage parameters for progressive refinement:

[campaign.ml-mapmaking]
maxiter = "200,200"      # 200 iterations in each of 2 stages
downsample = "4,2"       # Downsample by 4x, then 2x

# Stage 1: Coarse (4x downsampled), 200 iterations
# Stage 2: Fine (2x downsampled), 200 iterations

This approach:

Faster initial convergence at coarse resolution
Refinement at higher resolution
Better overall performance than single-stage

Tutorial 6: Monitoring and Debugging¶

Step 1: Understanding Log Output¶

The campaign manager produces several types of log messages:

[INFO] Normal informational messages
[WARNING] Potential issues (non-fatal)
[ERROR] Errors that stop execution
[DEBUG] Detailed debugging information

Step 2: Checking Workflow Status¶

Monitor SLURM jobs:

# List your jobs
squeue -u $USER

# Detailed job info
scontrol show job <job_id>

# Job accounting info
sacct -j <job_id> --format=JobID,JobName,State,Elapsed,MaxRSS

Step 3: Accessing Job Logs¶

RADICAL-Pilot creates detailed logs:

# Find RADICAL-Pilot session directory
ls -lrt ~/radical.pilot.sandbox/

# Check pilot logs
cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/pilot.log

# Check task logs
cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/task.*/task.log

Step 4: Common Issues and Solutions¶

Issue: Configuration validation error

[ERROR] ValidationError: field required: context

Solution: Check TOML syntax and ensure all required fields are present.

—

Issue: Out of memory error

[ERROR] Workflow failed: MemoryError

Solution: Increase memory parameter in resources section.

—

Issue: Walltime exceeded

[ERROR] Job terminated: TIMEOUT

Solution: Increase runtime parameter or optimize workflow.

—

Issue: File not found

[ERROR] FileNotFoundError: /path/to/context.yaml

Solution:

Verify file paths are absolute
Use file:/// prefix for file URIs
Ensure files are accessible from compute nodes

Tutorial 7: Testing Before Production¶

Step 1: Dry Run Mode¶

Test your configuration without submitting jobs:

socm -t campaign.toml --dry-run

This will:

Validate configuration
Create workflow objects
Run planning
Show what would be executed
Not submit any jobs

Step 2: Small-Scale Test¶

Before running on full dataset:

Create a test configuration with subset of data
Use shorter runtime limits
Use test QoS tier
Verify outputs are correct

Test configuration:

[campaign]
deadline = "1h"
resource = "tiger3"

[campaign.ml-mapmaking]
# ... same configuration but with:
query = "obs_id='single_test_observation'"  # Just one obs
maxiter = "10"  # Fewer iterations

[campaign.ml-mapmaking.resources]
runtime = "30m"  # Short runtime for test QoS

Step 3: Validation Checklist¶

Before submitting production campaigns:

☐ Configuration validates without errors

☐ File paths are correct and accessible

☐ Environment variables are set correctly

☐ Resource estimates are reasonable

☐ Output directories exist and are writable

☐ Test run completed successfully

☐ Output files are in expected format

☐ Resource usage matches estimates

Tutorial 8: Programmatic Usage¶

For advanced use cases, use the Python API directly.

Step 1: Basic Python Script¶

from socm.bookkeeper import Bookkeeper
from socm.core import Campaign
from socm.resources import TigerResource
from socm.workflows import MLMapmakingWorkflow

# Create workflow
workflow = MLMapmakingWorkflow(
    name="test_mapmaking",
    executable="so-site-pipeline",
    subcommand="make-filterbin-map",
    context="/path/to/context.yaml",
    area="/path/to/area.fits",
    output_dir="/path/to/output",
    bands="f090",
    maxiter="100",
    query="obs_id='test'",
    site="so_lat",
    resources={
        "ranks": 32,
        "threads": 8,
        "memory": 120000,
        "runtime": "3h"
    }
)

# Create campaign
campaign = Campaign(
    id=1,
    workflows=[workflow],
    campaign_policy="time",
    deadline=240  # minutes
)

# Create resource
resource = TigerResource()

# Execute
bookkeeper = Bookkeeper(
    campaign=campaign,
    resources={"tiger3": resource},
    policy="time",
    target_resource="tiger3",
    deadline=240
)

bookkeeper.run()

Step 2: Dynamic Workflow Generation¶

Generate workflows programmatically:

from socm.workflows import MLMapmakingWorkflow

# List of bands to process
bands_list = ["f090", "f150", "f220"]

# Create workflow for each band
workflows = []
for band in bands_list:
    wf = MLMapmakingWorkflow(
        name=f"mapmaking_{band}",
        executable="so-site-pipeline",
        subcommand="make-filterbin-map",
        context="/path/to/context.yaml",
        area="/path/to/area.fits",
        output_dir=f"/path/to/output/{band}",
        bands=band,
        maxiter="100",
        query="obs_id='test'",
        site="so_lat",
        resources={
            "ranks": 32,
            "threads": 8,
            "memory": 120000,
            "runtime": "3h"
        }
    )
    workflows.append(wf)

# Create campaign with all workflows
campaign = Campaign(
    id=1,
    workflows=workflows,
    campaign_policy="time"
)

Next Steps¶

Now that you’ve completed the tutorials, you can:

Read the User Guide for comprehensive reference
Explore Workflows for detailed workflow documentation
Check Advanced Topics for advanced features
Review Architecture to understand system internals
Consult FAQ and Troubleshooting for common questions

Congratulations on completing the tutorial! You’re now ready to run production campaigns with SO Campaign Manager.