Pipeline Resources¶
This page explains how to configure computational resources (CPU, memory, time) for pipeline modules. This is only necessary if you are having trouble with the default settings.
In many cases, if the pipeline is failing, it is due to memory issues on the DADA2 step. In most of these cases, using the config file with the extra memory configuration that is already supplied in the repository is sufficient. For example:
nextflow run main.nf \
--readDIR tests/example_data/example_fastq \
--pools D1,R1,R2 \
-profile sge,apptainer \
-c conf/custom.config
Read the information below if more advanced optimization is required.
Understanding Resource Tiers¶
The pipeline organizes modules into resource tiers defined in conf/base.config:
| Tier | Typical Use | Example Modules |
|---|---|---|
| Single | Lightweight modules (file creation, simple processing) | CREATE_PRIMER_FILES, BUILD_ALLELETABLE |
| Low | Moderate processing (filtering, quality reports) | CUTADAPT, FILTER_ASVS, QUALITY_REPORT |
| Medium | Moderate-heavy processing (masking, reference creation) | MASK_SEQUENCES, CREATE_REFERENCE_FROM_GENOMES |
| High | Heavy processing (DADA2, alignment) | DADA2_ANALYSIS, ALIGN_TO_REFERENCE |
Customizing Resources¶
Creating a Custom Config File¶
Create or edit conf/custom.config to override resource settings:
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EPPIcenter / mad4hatter Nextflow config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Custom config options for institutions
----------------------------------------------------------------------------------------
*/
process {
// Example: Customize DADA2_ANALYSIS resources
withName: 'DADA2_ANALYSIS' {
time = '600m' // 10 hours
cpus = 4 // 4 CPU cores
penv = 'smp' // Parallel environment (SGE-specific)
memory = '16 GB' // 16 GB RAM
}
// Example: Customize CUTADAPT resources
withName: 'CUTADAPT' {
time = '120m' // 2 hours
cpus = 2
memory = '8 GB'
}
}
Resource Parameters¶
| Parameter | Description | Example Values |
|---|---|---|
time | Maximum execution time | '120m' (2 hours), '600m' (10 hours), '24h' |
cpus | Number of CPU cores | 2, 4, 8 |
memory | RAM allocation | '8 GB', '16 GB', '32 GB' |
penv | Parallel environment (SGE-specific) | 'smp' (usually don't change) |
Threading¶
Only certain modules benefit from multiple CPUs:
Multithreaded Modules¶
- DADA2_ANALYSIS - Benefits significantly from multiple cores
- DADA2_POSTPROC - Can use multiple cores
Single-threaded Modules¶
Most other modules are single-threaded. Increasing cpus for these won't speed them up, but may help if multiple samples are processed in parallel.
Memory vs. CPUs
Increasing cpus for multithreaded modules also increases memory usage. Monitor your system to find the right balance.
Executor Settings¶
Queue Size¶
Control how many jobs can run simultaneously:
executor {
$sge {
queueSize = 1000 // Maximum jobs in queue
}
}
executor {
$local {
queueSize = 1000 // For local execution
}
}
When to adjust: - Decrease (500) if your scheduler becomes overloaded - Increase (2000) for large datasets with many parallel samples
Queue Size vs. Threading
queueSize controls the number of jobs submitted, not threading within a module. This is useful for managing cluster load.
Example: Complete Custom Config¶
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Custom config for large dataset (20+ GB)
----------------------------------------------------------------------------------------
*/
process {
// High-resource modules
withName: 'DADA2_ANALYSIS' {
time = '1200m' // 20 hours
cpus = 8
memory = '32 GB'
}
withName: 'ALIGN_TO_REFERENCE' {
time = '600m'
cpus = 4
memory = '16 GB'
}
// Medium-resource modules
withName: 'MASK_SEQUENCES' {
time = '300m'
cpus = 2
memory = '8 GB'
}
// Low-resource modules
withName: 'CUTADAPT' {
time = '180m'
cpus = 2
memory = '4 GB'
}
}
executor {
$sge {
queueSize = 2000 // Allow more parallel jobs
}
}
Tips for Resource Management¶
- Start conservative - Begin with default values, then adjust based on actual usage
- Monitor job logs - Check if jobs are timing out or running out of memory
- Adjust incrementally - Change one parameter at a time to understand its impact
- Consider your cluster - Different HPC systems have different resource limits
Troubleshooting¶
Jobs Timing Out¶
Symptom: Jobs fail with timeout errors
Solution: Increase time parameter for the failing module
Out of Memory Errors¶
Symptom: Jobs fail with memory errors
Solution: Increase memory parameter for the failing module
Slow Processing¶
Symptom: Pipeline runs very slowly
Solutions:
- Increase
cpusfor multithreaded modules (DADA2_ANALYSIS, DADA2_POSTPROC) - Increase
queueSizeto allow more parallel jobs - Check if other users are using cluster resources