Tutorial ======== This tutorial will guide you through using SO Campaign Manager step-by-step, from basic usage to advanced configurations. Tutorial Overview ----------------- In this tutorial, you will learn: 1. How to set up your first campaign 2. How to configure workflows and resources 3. How to run and monitor campaigns 4. How to work with null tests 5. How to optimize resource usage 6. How to troubleshoot common issues Prerequisites ------------- Before starting, ensure you have: * SO Campaign Manager installed (see :doc:`installation`) * Access to an HPC system (e.g., Tiger 3) * Basic understanding of SLURM job scheduling * Sample data for SO mapmaking (context files, area definitions, etc.) Tutorial 1: Your First Campaign -------------------------------- Let's create a simple campaign with a single ML mapmaking workflow. Step 1: Understand the Campaign Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A campaign consists of: * **Campaign configuration** - Deadline and global settings * **Workflows** - Analysis tasks to execute * **Resources** - HPC resource requirements per workflow * **Environment** - Environment variables for execution Step 2: Create a Basic Configuration File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a file named ``my_first_campaign.toml``: .. code-block:: toml # Campaign-level configuration [campaign] deadline = "4h" # Complete within 4 hours resource = "tiger3" # Target HPC resource execution_schema = "remote" # Execute on remote HPC # ML Mapmaking workflow configuration [campaign.ml-mapmaking] # Input files (use absolute paths) context = "file:///path/to/your/context.yaml" area = "file:///path/to/your/area.fits" output_dir = "/path/to/output" # Analysis parameters bands = "f090" # Frequency band to process maxiter = "100" # Maximum iterations query = "obs_id='your_observation_id'" # Data selection query tiled = 0 # Don't use tiled processing site = "so_lat" # Observatory site # Resource requirements for this workflow [campaign.ml-mapmaking.resources] ranks = 1 # Number of MPI ranks threads = 32 # Number of OpenMP threads memory = "80000" # Memory in MB (80 GB) runtime = "2h" # Expected runtime # Environment variables [campaign.ml-mapmaking.environment] SOTODLIB_SITECONFIG = "/path/to/site.yaml" .. note:: **Important:** All file paths in the configuration must be **absolute paths** starting with ``file:///``. Step 3: Validate Your Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before running, let's understand what each parameter means: **Campaign Parameters:** * ``deadline``: Maximum time for the entire campaign to complete * Format: ``"2h"`` (2 hours), ``"30m"`` (30 minutes), ``"3d"`` (3 days) * The planner will optimize workflow scheduling to meet this deadline * ``resource``: Target HPC system (currently only ``"tiger3"`` is supported) * ``execution_schema``: How to execute (``"remote"`` for HPC, ``"local"`` for testing) **Workflow Parameters:** * ``context``: Path to SOTODLIB context YAML file defining data structure * ``area``: Path to FITS file defining the sky area to map * ``output_dir``: Directory where output maps will be written * ``bands``: Frequency band(s) to process (``"f090"``, ``"f150"``, or ``"f090,f150"``) * ``maxiter``: Maximum iterations for convergence (typically 100-200) * ``query``: SQL-like query to select observations from context * ``tiled``: Whether to use tiled processing (0 = no, 1 = yes) * ``site``: Observatory site identifier (``"so_lat"`` or ``"so_sat"``) **Resource Parameters:** * ``ranks``: Number of MPI processes * ``threads``: Number of OpenMP threads per process * ``memory``: Total memory requirement in MB * ``runtime``: Expected execution time (used for QoS selection) Step 4: Run Your Campaign ~~~~~~~~~~~~~~~~~~~~~~~~~~ Execute the campaign using the command-line interface: .. code-block:: bash socm -t my_first_campaign.toml You should see output similar to: .. code-block:: text [INFO] Loading campaign configuration from my_first_campaign.toml [INFO] Validating configuration... [INFO] Creating 1 workflow(s) [INFO] Planning campaign with deadline: 240 minutes [INFO] Selected QoS: medium (max walltime: 4320 minutes) [INFO] Submitting workflows to SLURM... [INFO] Campaign execution started [INFO] Monitoring workflow progress... Step 5: Monitor Execution ~~~~~~~~~~~~~~~~~~~~~~~~~~ The campaign manager will: 1. Parse your configuration 2. Create workflow instances 3. Plan the execution schedule 4. Submit jobs to SLURM 5. Monitor execution and report progress You can monitor SLURM jobs directly: .. code-block:: bash # Check your SLURM queue squeue -u $USER # Monitor specific job scontrol show job Step 6: Check Output ~~~~~~~~~~~~~~~~~~~~~ Once complete, check your output directory: .. code-block:: bash ls -lh /path/to/output/ You should find: * Output maps (FITS files) * Log files * Metadata about the run Congratulations! You've run your first campaign. Tutorial 2: Multiple Workflows in Parallel ------------------------------------------- Now let's create a campaign with multiple workflows that run in parallel. Step 1: Create Multi-Workflow Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create ``multi_workflow_campaign.toml``: .. code-block:: toml [campaign] deadline = "8h" resource = "tiger3" execution_schema = "remote" requested_resources = 1000 # Total core-hours budget # First workflow: f090 band [campaign.ml-mapmaking] name = "mapmaking_f090" context = "file:///path/to/context.yaml" area = "file:///path/to/area.fits" output_dir = "/path/to/output/f090" bands = "f090" maxiter = "100" query = "obs_id='observation_1'" site = "so_lat" [campaign.ml-mapmaking.resources] ranks = 32 threads = 8 memory = "120000" runtime = "3h" Since TOML doesn't support multiple sections with the same name, you'll need to use arrays or separate workflow types. Let me show you a better approach using subcampaigns: .. code-block:: toml [campaign] deadline = "8h" resource = "tiger3" execution_schema = "remote" # Common configuration for all mapmaking workflows [campaign.ml-mapmaking] context = "file:///path/to/context.yaml" area = "file:///path/to/area.fits" output_dir = "/path/to/output" maxiter = "100" site = "so_lat" [campaign.ml-mapmaking.resources] ranks = 32 threads = 8 memory = "120000" runtime = "3h" # You can then modify the query or bands in the code # or use different workflow types Step 2: Understanding Parallel Execution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The campaign manager automatically: * Detects workflows that can run in parallel * Schedules them based on resource availability * Optimizes to meet the deadline * Manages dependencies between workflows Tutorial 3: Null Test Campaigns -------------------------------- Null tests are crucial for validating mapmaking results. Let's create a comprehensive null test campaign. Step 1: Understanding Null Tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Null tests validate your mapmaking by creating maps from data splits: * **Mission Tests**: Split by time * **Wafer Tests**: Split by detector * **Direction Tests**: Split by scan direction * **PWV Tests**: Split by precipitable water vapor * **Day/Night Tests**: Split by time of day * **Elevation Tests**: Split by telescope elevation Step 2: Configure Null Tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create ``null_test_campaign.toml``: .. code-block:: toml [campaign] deadline = "12h" resource = "tiger3" execution_schema = "remote" # Common configuration for all null tests [campaign.ml-null-tests] context = "file:///path/to/context.yaml" area = "file:///path/to/area.fits" output_dir = "/path/to/output/null_tests" bands = "f090" maxiter = "100,100" # Two iteration stages downsample = "4,2" # Downsample factors for each stage query = "file:///path/to/query.txt" tiled = 1 site = "so_lat" # Mission tests: time-based splits [campaign.ml-null-tests.mission-tests] chunk_nobs = 10 # Chunk size in days nsplits = 4 # Number of splits (must be power of 2) [campaign.ml-null-tests.mission-tests.resources] ranks = 35 threads = 8 memory = "2400000" # 2.4 TB runtime = "4h" # Wafer tests: detector-based splits [campaign.ml-null-tests.wafer-tests] chunk_nobs = 10 nsplits = 4 [campaign.ml-null-tests.wafer-tests.resources] ranks = 12 threads = 8 memory = "80000" runtime = "4h" # Direction tests: scan direction splits [campaign.ml-null-tests.direction-tests] chunk_nobs = 10 nsplits = 4 [campaign.ml-null-tests.direction-tests.resources] ranks = 17 threads = 8 memory = "80000" runtime = "4h" Step 3: Understanding Subcampaign Inheritance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Notice how the null test workflows inherit common configuration: * ``context``, ``area``, ``bands``, etc. are defined once in ``[campaign.ml-null-tests]`` * Each specific test (``mission-tests``, ``wafer-tests``, etc.) inherits these * Specific tests only need to define their unique parameters This follows the **DRY principle** (Don't Repeat Yourself). Step 4: Run Null Test Campaign ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash socm -t null_test_campaign.toml The campaign manager will: 1. Create 3 null test workflows (mission, wafer, direction) 2. Schedule them for parallel execution 3. Monitor all workflows 4. Report completion status Tutorial 4: Resource Optimization ---------------------------------- Learn how to optimize resource usage for cost-effective campaigns. Step 1: Understanding Resource Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The key resource parameters affect both performance and cost: **Ranks (MPI Processes):** * More ranks → faster for I/O-heavy tasks * Diminishing returns beyond data parallelism limit * Rule of thumb: 1 rank per ~2-4 GB of data **Threads (OpenMP):** * More threads → faster for compute-heavy tasks * Limited by memory bandwidth * Rule of thumb: 4-16 threads per rank **Memory:** * Must accommodate: data + working set + overhead * Rule of thumb: 2x data size for mapmaking * Monitor actual usage and adjust **Runtime:** * Affects QoS selection (queue priority) * Overestimate to avoid timeout * Underestimate wastes resources if too conservative Step 2: QoS Tiers on Tiger ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tiger has several QoS tiers with different limits: .. list-table:: :header-rows: 1 :widths: 20 20 20 40 * - QoS Tier - Max Walltime - Max Jobs - Best For * - test - 60 minutes - Limited - Quick tests, debugging * - vshort - 5 hours - Many - Small workflows * - short - 24 hours - Many - Standard workflows * - medium - 3 days - Moderate - Large workflows * - long - 6 days - Few - Very large workflows * - vlong - 15 days - Very few - Extremely large workflows The campaign manager automatically selects the appropriate QoS based on your ``runtime`` estimate. Step 3: Right-Sizing Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's optimize a workflow: **Initial (over-allocated) configuration:** .. code-block:: toml [campaign.ml-mapmaking.resources] ranks = 100 # Too many? threads = 32 # Too many? memory = "500000" # 500 GB - too much? runtime = "10h" # Too long? **Steps to optimize:** 1. **Start with a small test:** .. code-block:: toml ranks = 10 threads = 8 memory = "100000" runtime = "1h" 2. **Run and monitor:** .. code-block:: bash # During execution, monitor memory usage ssh tiger-node-xx # SSH to compute node top -u $USER 3. **Check logs for actual usage:** * Memory high-water mark * Actual walltime * CPU utilization 4. **Adjust based on findings:** * If memory maxed out → increase memory * If completed in 30min with 1h limit → reduce runtime * If CPU idle → reduce ranks or threads * If walltime nearly exceeded → increase runtime **Optimized configuration:** .. code-block:: toml [campaign.ml-mapmaking.resources] ranks = 35 # Sufficient for data size threads = 8 # Good balance memory = "2400000" # 20% overhead over observed runtime = "4h" # 50% buffer over observed Step 4: Scaling Rules ~~~~~~~~~~~~~~~~~~~~~ **Weak Scaling (more data, same time):** * Double data size → double ranks * Keep threads constant * Double memory **Strong Scaling (same data, less time):** * Limited by Amdahl's Law * Doubling ranks doesn't halve time * Test to find optimal parallelism Tutorial 5: Advanced Configuration Patterns -------------------------------------------- Step 1: Using Environment Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Many workflows require specific environment variables: .. code-block:: toml [campaign.ml-mapmaking.environment] # SOTODLIB configuration SOTODLIB_SITECONFIG = "/path/to/site.yaml" # Temporary storage TMPDIR = "/scratch/network/$USER/tmp" # Performance tuning OMP_NUM_THREADS = "8" OMP_PROC_BIND = "true" OMP_PLACES = "cores" # Debugging SOTODLIB_DEBUG = "1" Step 2: Working with Query Files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For complex observation selections, use query files: **query.txt:** .. code-block:: sql obs_id IN ( 'obs_1234567890.1234567900.ar5_1', 'obs_1234567901.1234567911.ar5_1', 'obs_1234567912.1234567922.ar5_1' ) **Configuration:** .. code-block:: toml [campaign.ml-mapmaking] query = "file:///path/to/query.txt" Step 3: Multi-Stage Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use multi-stage parameters for progressive refinement: .. code-block:: toml [campaign.ml-mapmaking] maxiter = "200,200" # 200 iterations in each of 2 stages downsample = "4,2" # Downsample by 4x, then 2x # Stage 1: Coarse (4x downsampled), 200 iterations # Stage 2: Fine (2x downsampled), 200 iterations This approach: * Faster initial convergence at coarse resolution * Refinement at higher resolution * Better overall performance than single-stage Tutorial 6: Monitoring and Debugging ------------------------------------- Step 1: Understanding Log Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The campaign manager produces several types of log messages: .. code-block:: text [INFO] Normal informational messages [WARNING] Potential issues (non-fatal) [ERROR] Errors that stop execution [DEBUG] Detailed debugging information Step 2: Checking Workflow Status ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Monitor SLURM jobs: .. code-block:: bash # List your jobs squeue -u $USER # Detailed job info scontrol show job # Job accounting info sacct -j --format=JobID,JobName,State,Elapsed,MaxRSS Step 3: Accessing Job Logs ~~~~~~~~~~~~~~~~~~~~~~~~~~~ RADICAL-Pilot creates detailed logs: .. code-block:: bash # Find RADICAL-Pilot session directory ls -lrt ~/radical.pilot.sandbox/ # Check pilot logs cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/pilot.log # Check task logs cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/task.*/task.log Step 4: Common Issues and Solutions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Issue: Configuration validation error** .. code-block:: text [ERROR] ValidationError: field required: context **Solution:** Check TOML syntax and ensure all required fields are present. --- **Issue: Out of memory error** .. code-block:: text [ERROR] Workflow failed: MemoryError **Solution:** Increase ``memory`` parameter in resources section. --- **Issue: Walltime exceeded** .. code-block:: text [ERROR] Job terminated: TIMEOUT **Solution:** Increase ``runtime`` parameter or optimize workflow. --- **Issue: File not found** .. code-block:: text [ERROR] FileNotFoundError: /path/to/context.yaml **Solution:** * Verify file paths are absolute * Use ``file:///`` prefix for file URIs * Ensure files are accessible from compute nodes Tutorial 7: Testing Before Production -------------------------------------- Step 1: Dry Run Mode ~~~~~~~~~~~~~~~~~~~~~ Test your configuration without submitting jobs: .. code-block:: bash socm -t campaign.toml --dry-run This will: * Validate configuration * Create workflow objects * Run planning * Show what would be executed * **Not submit any jobs** Step 2: Small-Scale Test ~~~~~~~~~~~~~~~~~~~~~~~~~ Before running on full dataset: 1. Create a test configuration with subset of data 2. Use shorter runtime limits 3. Use test QoS tier 4. Verify outputs are correct **Test configuration:** .. code-block:: toml [campaign] deadline = "1h" resource = "tiger3" [campaign.ml-mapmaking] # ... same configuration but with: query = "obs_id='single_test_observation'" # Just one obs maxiter = "10" # Fewer iterations [campaign.ml-mapmaking.resources] runtime = "30m" # Short runtime for test QoS Step 3: Validation Checklist ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before submitting production campaigns: ☐ Configuration validates without errors ☐ File paths are correct and accessible ☐ Environment variables are set correctly ☐ Resource estimates are reasonable ☐ Output directories exist and are writable ☐ Test run completed successfully ☐ Output files are in expected format ☐ Resource usage matches estimates Tutorial 8: Programmatic Usage ------------------------------- For advanced use cases, use the Python API directly. Step 1: Basic Python Script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from socm.bookkeeper import Bookkeeper from socm.core import Campaign from socm.resources import TigerResource from socm.workflows import MLMapmakingWorkflow # Create workflow workflow = MLMapmakingWorkflow( name="test_mapmaking", executable="so-site-pipeline", subcommand="make-filterbin-map", context="/path/to/context.yaml", area="/path/to/area.fits", output_dir="/path/to/output", bands="f090", maxiter="100", query="obs_id='test'", site="so_lat", resources={ "ranks": 32, "threads": 8, "memory": 120000, "runtime": "3h" } ) # Create campaign campaign = Campaign( id=1, workflows=[workflow], campaign_policy="time", deadline=240 # minutes ) # Create resource resource = TigerResource() # Execute bookkeeper = Bookkeeper( campaign=campaign, resources={"tiger3": resource}, policy="time", target_resource="tiger3", deadline=240 ) bookkeeper.run() Step 2: Dynamic Workflow Generation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generate workflows programmatically: .. code-block:: python from socm.workflows import MLMapmakingWorkflow # List of bands to process bands_list = ["f090", "f150", "f220"] # Create workflow for each band workflows = [] for band in bands_list: wf = MLMapmakingWorkflow( name=f"mapmaking_{band}", executable="so-site-pipeline", subcommand="make-filterbin-map", context="/path/to/context.yaml", area="/path/to/area.fits", output_dir=f"/path/to/output/{band}", bands=band, maxiter="100", query="obs_id='test'", site="so_lat", resources={ "ranks": 32, "threads": 8, "memory": 120000, "runtime": "3h" } ) workflows.append(wf) # Create campaign with all workflows campaign = Campaign( id=1, workflows=workflows, campaign_policy="time" ) Next Steps ---------- Now that you've completed the tutorials, you can: * Read the :doc:`user_guide` for comprehensive reference * Explore :doc:`workflows` for detailed workflow documentation * Check :doc:`advanced_topics` for advanced features * Review :doc:`architecture` to understand system internals * Consult :doc:`faq` for common questions Congratulations on completing the tutorial! You're now ready to run production campaigns with SO Campaign Manager.