FAQ and Troubleshooting¶
This document provides answers to frequently asked questions and solutions to common problems.
Frequently Asked Questions¶
General Questions¶
Q: What is SO Campaign Manager?
A: SO Campaign Manager is a workflow orchestration system designed for running mapmaking campaigns on HPC systems. It handles workflow scheduling, resource allocation, SLURM job submission, and monitoring for Simons Observatory data analysis.
Q: Which HPC systems are supported?
A: Currently, SO Campaign Manager is optimized for Tiger 3 (Princeton’s HPC cluster), but it can be adapted to other SLURM-based HPC systems by creating custom Resource classes.
Q: What is the difference between a workflow and a campaign?
A: A workflow is a single computational task (e.g., one mapmaking job). A campaign is a collection of workflows that are scheduled and executed together to meet a deadline.
Q: Can I run campaigns locally for testing?
A: Yes, use execution_schema = "local" in your configuration, or use the --dry-run flag to test without actual execution.
Configuration Questions¶
Q: Why do I need to use file:/// prefix for paths?
A: The file:/// prefix is a URI scheme that explicitly indicates a local file path. This allows the system to potentially support other URI schemes (e.g., http://, s3://) in the future.
Q: What time formats are supported for deadline and runtime?
A: Supported formats include:
"30m"- 30 minutes"2h"- 2 hours"3d"- 3 days"1w"- 1 weekAlso accepts raw minutes as integer:
deadline = 240(240 minutes)
Q: How do I configure workflows that run in multiple stages?
A: Use comma-separated values for multi-stage parameters:
maxiter = "200,200" # 200 iterations per stage
downsample = "4,2" # Downsample factors per stage
Q: Can I use environment variables in configuration files?
A: TOML doesn’t natively support environment variable expansion. Use a pre-processing script or template system (like Jinja2) if you need dynamic values.
Q: How do I specify different resources for different workflows?
A: Each workflow section can have its own resources subsection:
[campaign.ml-mapmaking]
# ... workflow config
[campaign.ml-mapmaking.resources]
ranks = 32
memory = "120000"
[campaign.ml-null-tests.mission-tests]
# ... workflow config
[campaign.ml-null-tests.mission-tests.resources]
ranks = 16
memory = "60000"
Resource and Scheduling Questions¶
Q: How does the system select the QoS tier?
A: The system automatically selects the lowest QoS tier that can accommodate your runtime estimate. For example, if runtime is "3h", it will select short (max 24h) rather than medium (max 3d).
Q: What happens if my workflow exceeds the estimated runtime?
A: The SLURM scheduler will terminate the job. Always add a safety buffer (20-50%) to your runtime estimates.
Q: How many cores/memory should I allocate?
A: General guidelines:
Ranks: ~1 rank per 2-4 GB of data
Threads: 4-16 per rank (diminishing returns beyond 16)
Memory: 2x your data size + overhead
Runtime: Actual expected time + 50% buffer
Start conservative and refine based on actual usage.
Q: Can I limit the total resources used by a campaign?
A: Yes, use the requested_resources parameter:
[campaign]
requested_resources = 3359 # Total core-hours
The planner will optimize scheduling within this budget.
Q: What is the HEFT algorithm?
A: HEFT (Heterogeneous Earliest Finish Time) is a scheduling algorithm that:
Ranks workflows by priority (computation + communication costs)
Assigns each workflow to resources that minimize finish time
Respects dependencies between workflows
Optimizes for minimal total campaign time (makespan)
Workflow Questions¶
Q: What workflows are available?
A: Built-in workflows include:
ml-mapmaking- Maximum likelihood mapmakingsat-sims- SAT simulationsml-null-tests.mission-tests- Time-based null testsml-null-tests.wafer-tests- Detector-based null testsml-null-tests.direction-tests- Scan direction null testsml-null-tests.pwv-tests- PWV-based null testsml-null-tests.day-night-tests- Day/night null testsml-null-tests.elevation-tests- Elevation null testsml-null-tests.moon-close-tests- Moon proximity null testsml-null-tests.moonrise-set-tests- Moonrise/set null testsml-null-tests.sun-close-tests- Sun proximity null tests
Q: How do I create a custom workflow?
A: See Advanced Topics for detailed instructions on creating custom workflows.
Q: What does tiled = 1 do?
A: Tiled processing breaks the sky area into smaller tiles that are processed independently. This:
Reduces memory requirements
Enables parallelization across tiles
May increase total runtime due to overhead
Use tiled processing for very large sky areas.
Q: What are null tests and why are they important?
A: Null tests validate mapmaking by creating maps from data splits (e.g., first half vs. second half of observations). The difference map (null map) should be consistent with noise. Large signal in null maps indicates systematic errors.
Execution and Monitoring Questions¶
Q: How do I monitor campaign progress?
A: Several methods:
Campaign manager logs to stdout
Check SLURM queue:
squeue -u $USERCheck RADICAL-Pilot logs:
~/radical.pilot.sandbox/Monitor output directory for completed files
Q: Can I cancel a running campaign?
A: Yes, use Ctrl+C to stop the campaign manager, then cancel SLURM jobs:
# Cancel all your jobs
scancel -u $USER
# Cancel specific job
scancel <job_id>
Q: How do I check if a workflow completed successfully?
A: Check:
Campaign manager logs for completion message
SLURM job status:
sacct -j <job_id>Output files in the configured output directory
RADICAL-Pilot task logs for errors
Q: Can I resume a failed campaign?
A: Currently, campaigns don’t support automatic resume. You need to:
Identify which workflows completed
Remove completed workflows from configuration
Rerun campaign with remaining workflows
Error and Debugging Questions¶
Q: What does “ValidationError: field required” mean?
A: A required parameter is missing from your configuration. Check the error message for the field name and add it to your TOML file.
Q: Why am I getting “FileNotFoundError”?
A: Common causes:
Path is not absolute
Missing
file:///prefixFile doesn’t exist
File not accessible from compute nodes
Typo in path
Q: What does “QoS not available” mean?
A: Your estimated runtime exceeds all available QoS tiers, or the specified QoS doesn’t exist on the target resource. Check your runtime estimate and QoS name.
Q: Why is my job stuck in pending state?
A: Common reasons:
Resource request too large (reduce nodes/cores)
QoS limits reached (too many jobs in queue)
System maintenance
Account limits exceeded
Check with: squeue -j <job_id> --start
Troubleshooting Guide¶
Configuration Errors¶
Problem: TOML Syntax Error
Error: Invalid TOML syntax at line 15
Solution:
Validate TOML syntax using an online validator
Check for:
Unmatched quotes
Missing closing brackets
Invalid escape sequences
Duplicate section headers
Problem: Pydantic Validation Error
ValidationError: 1 validation error for Workflow
context
field required (type=value_error.missing)
Solution:
Add the missing field to your configuration:
[campaign.ml-mapmaking]
context = "file:///path/to/context.yaml"
Problem: Invalid Time Format
Error: Cannot parse time string: '2hrs'
Solution:
Use correct format: "2h" (not "2hrs")
Resource Allocation Errors¶
Problem: Out of Memory (OOM)
[ERROR] Job failed: Out of memory
Solution:
Check actual memory usage from SLURM:
sacct -j <job_id> --format=JobID,MaxRSS,ReqMem
Increase memory allocation:
[campaign.ml-mapmaking.resources] memory = "240000" # Increase from previous value
Or reduce data chunk size:
chunk_nobs = 5 # Reduce from 10
Problem: Job Timeout
[ERROR] Job terminated: TIMEOUT
Solution:
Check actual runtime from SLURM:
sacct -j <job_id> --format=JobID,Elapsed,Timelimit
Increase runtime estimate:
[campaign.ml-mapmaking.resources] runtime = "8h" # Increase with buffer
Problem: Node Allocation Failed
[ERROR] SLURM reject: Requested node configuration not available
Solution:
Reduce nodes requested
Check node availability:
sinfoUse appropriate partition
Check account limits
Execution Errors¶
Problem: Command Not Found
[ERROR] /bin/bash: so-site-pipeline: command not found
Solution:
Load required modules in environment:
[campaign.ml-mapmaking.environment] MODULE_LOAD = "module load python/3.11"
Or use full path to executable:
[campaign.ml-mapmaking] executable = "/full/path/to/so-site-pipeline"
Problem: Permission Denied
[ERROR] Permission denied: /path/to/output
Solution:
Check directory exists and is writable:
ls -ld /path/to/output
Create directory if needed:
mkdir -p /path/to/output
Check file system is mounted on compute nodes
Problem: Import Error
ImportError: No module named 'sotodlib'
Solution:
Ensure Python environment is activated:
[campaign.ml-mapmaking.environment] PYTHONPATH = "/path/to/sotodlib:$PYTHONPATH"
Or load module:
[campaign.ml-mapmaking.environment] MODULE_LOAD = "module load sotodlib"
Data Errors¶
Problem: Context File Not Found
FileNotFoundError: /path/to/context.yaml
Solution:
Use absolute path with
file:///prefixVerify file exists on compute nodes
Check file permissions
Problem: Invalid Query
[ERROR] SQL syntax error in query
Solution:
Validate query syntax
Use query file for complex queries:
query = "file:///path/to/query.txt"
Test query against context file manually
Problem: No Data Matches Query
[WARNING] Query returned 0 observations
Solution:
Verify query syntax
Check observation IDs exist in context
Broaden query criteria
RADICAL-Pilot Errors¶
Problem: Pilot Failed to Start
[ERROR] Pilot submission failed
Solution:
Check SLURM job logs:
cat ~/radical.pilot.sandbox/rp.session.*/pilot.*/pilot.logVerify resource configuration
Check SLURM account is valid
Ensure adequate resources available
Problem: Task Submission Failed
[ERROR] Task submission to pilot failed
Solution:
Check pilot is running:
squeue -u $USERVerify task description is valid
Check pilot has sufficient resources for task
Performance Issues¶
Problem: Slow Execution
Diagnosis:
# Check CPU utilization
ssh <compute-node>
top
# Check I/O wait
iostat -x 5
Solutions:
If CPU idle: Increase parallelism (ranks/threads)
If I/O bound: Use faster storage, reduce I/O operations
If memory bandwidth limited: Reduce threads per rank
Problem: Inefficient Scheduling
Diagnosis:
Jobs running sequentially instead of parallel
Long idle times between jobs
Solution:
Review workflow dependencies
Check deadline is realistic
Consider manual scheduling for small campaigns
Debugging Workflow¶
Step-by-Step Debugging Process¶
Validate Configuration
socm -t campaign.toml --dry-run
Check File Paths
ls -l /path/to/context.yaml ls -ld /path/to/output
Test on Small Dataset
Create minimal configuration with single observation
Monitor SLURM
# Watch job queue watch -n 5 'squeue -u $USER' # Check job details scontrol show job <job_id>
Check Logs
Campaign manager stdout
SLURM job output files
RADICAL-Pilot logs
Application logs in output directory
Verify Environment
SSH to compute node and verify:
Modules loaded
Environment variables set
Executables in PATH
Data files accessible
Common Patterns¶
Pattern: Incremental Testing¶
Start small and scale up:
Single observation, minimal iterations
maxiter = "10" query = "obs_id='single_obs'"
Small dataset, full iterations
maxiter = "100" query = "file:///path/to/small_query.txt"
Full dataset
maxiter = "200,200" query = "file:///path/to/full_query.txt"
Pattern: Resource Tuning¶
Systematically find optimal resources:
Run with conservative estimates
Monitor actual usage
Adjust based on observations:
Memory: Actual max + 20%
Runtime: Actual + 50%
Cores: Test weak scaling
Document findings for future campaigns
Pattern: Error Recovery¶
When campaigns fail:
Identify failed workflows
Check logs and output directories
Determine cause
Read error messages, check resource usage
Fix configuration
Adjust based on cause (more memory, longer runtime, etc.)
Remove completed workflows
Comment out successful workflows in TOML
Rerun failed workflows
Run campaign with updated configuration
Getting Help¶
When to Seek Help¶
Seek help if:
Error messages are unclear
Issue persists after troubleshooting
Suspected bug in SO Campaign Manager
Need feature not available
How to Report Issues¶
When reporting issues, include:
Minimal reproducible example
Simplified configuration
Sample data if possible
Error messages
Full error output
Relevant log excerpts
Environment information
SO Campaign Manager version
Python version
HPC system details
SLURM version
What you’ve tried
Troubleshooting steps taken
Configuration changes attempted
Where to Get Help¶
Documentation: Check SO Campaign Manager Documentation for comprehensive guides
GitHub Issues: https://github.com/simonsobs/so_campaign_manager/issues
HPC Support: Contact your HPC center for SLURM/system issues
Community: Simons Observatory Slack or mailing lists
Tips and Best Practices¶
Configuration Tips¶
Use version control for configuration files
Document your configurations with comments
Template common patterns for reuse
Test configurations before production runs
Keep configurations DRY using subcampaigns
Resource Management Tips¶
Start conservative with resource estimates
Monitor actual usage and adjust
Add safety buffers (20% memory, 50% runtime)
Use appropriate QoS for job priority
Consider cost (core-hours) vs. time tradeoff
Workflow Organization Tips¶
Group related workflows using subcampaigns
Name workflows descriptively
Document workflow purpose in configuration
Test workflows individually before campaigns
Track workflow versions for reproducibility
Debugging Tips¶
Enable verbose logging during debugging
Use dry-run mode to validate configuration
Test incrementally from simple to complex
Keep logs for successful runs (for comparison)
Document solutions to recurring issues
Additional Resources¶
Tutorial - Step-by-step tutorials
User Guide - Comprehensive user documentation
Workflows - Workflow-specific documentation
Architecture - System architecture and design
Advanced Topics - Advanced features and customization
Developer Guide - Contributing and development
Glossary¶
- Campaign
Collection of workflows scheduled together
- Workflow
Single computational task
- QoS (Quality of Service)
SLURM policy defining resource limits
- HEFT
Heterogeneous Earliest Finish Time scheduling algorithm
- Rank
MPI process
- Thread
OpenMP thread within a process
- Makespan
Total time to complete all workflows
- DAG
Directed Acyclic Graph (workflow dependencies)
- Enactor
Execution backend (e.g., RADICAL-Pilot)
- Planner
Scheduling algorithm (e.g., HEFT)
- Bookkeeper
Main orchestration component
- Null Test
Validation test using data splits
- Pilot Job
SLURM allocation managed by RADICAL-Pilot
- Task
RADICAL-Pilot unit of work (workflow instance)