Pipeline Resources¶

This page explains how to configure computational resources (CPU, memory, time) for pipeline modules. This is only necessary if you are having trouble with the default settings.

In many cases, if the pipeline is failing, it is due to memory issues on the DADA2 step. In most of these cases, using the config file with the extra memory configuration that is already supplied in the repository is sufficient. For example:

nextflow run main.nf \
  --readDIR tests/example_data/example_fastq \
  --pools D1,R1,R2 \
  -profile sge,apptainer \
  -c conf/custom.config

Read the information below if more advanced optimization is required.

Understanding Resource Tiers¶

The pipeline organizes modules into resource tiers defined in conf/base.config:

Tier	Typical Use	Example Modules
Single	Lightweight modules (file creation, simple processing)	CREATE_PRIMER_FILES, BUILD_ALLELETABLE
Low	Moderate processing (filtering, quality reports)	CUTADAPT, FILTER_ASVS, QUALITY_REPORT
Medium	Moderate-heavy processing (masking, reference creation)	MASK_SEQUENCES, CREATE_REFERENCE_FROM_GENOMES
High	Heavy processing (DADA2, alignment)	DADA2_ANALYSIS, ALIGN_TO_REFERENCE

Customizing Resources¶

Creating a Custom Config File¶

Create or edit conf/custom.config to override resource settings:

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     EPPIcenter / mad4hatter Nextflow config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Custom config options for institutions
----------------------------------------------------------------------------------------
*/

process {
    // Example: Customize DADA2_ANALYSIS resources
    withName: 'DADA2_ANALYSIS' {
      time = '600m'      // 10 hours
      cpus = 4           // 4 CPU cores
      penv = 'smp'       // Parallel environment (SGE-specific)
      memory = '16 GB'   // 16 GB RAM
    }

    // Example: Customize CUTADAPT resources
    withName: 'CUTADAPT' {
      time = '120m'      // 2 hours
      cpus = 2
      memory = '8 GB'
    }
}

Resource Parameters¶

Parameter	Description	Example Values
`time`	Maximum execution time	`'120m'` (2 hours), `'600m'` (10 hours), `'24h'`
`cpus`	Number of CPU cores	`2`, `4`, `8`
`memory`	RAM allocation	`'8 GB'`, `'16 GB'`, `'32 GB'`
`penv`	Parallel environment (SGE-specific)	`'smp'` (usually don't change)

Threading¶

Only certain modules benefit from multiple CPUs:

Multithreaded Modules¶

DADA2_ANALYSIS - Benefits significantly from multiple cores
DADA2_POSTPROC - Can use multiple cores

Single-threaded Modules¶

Most other modules are single-threaded. Increasing cpus for these won't speed them up, but may help if multiple samples are processed in parallel.

Memory vs. CPUs

Increasing cpus for multithreaded modules also increases memory usage. Monitor your system to find the right balance.

Executor Settings¶

Queue Size¶

Control how many jobs can run simultaneously:

executor {
  $sge {
      queueSize = 1000  // Maximum jobs in queue
  }
}

executor {
  $local {
      queueSize = 1000  // For local execution
  }
}

When to adjust: - Decrease (500) if your scheduler becomes overloaded - Increase (2000) for large datasets with many parallel samples

Queue Size vs. Threading

queueSize controls the number of jobs submitted, not threading within a module. This is useful for managing cluster load.

Example: Complete Custom Config¶

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Custom config for large dataset (20+ GB)
----------------------------------------------------------------------------------------
*/

process {
    // High-resource modules
    withName: 'DADA2_ANALYSIS' {
      time = '1200m'    // 20 hours
      cpus = 8
      memory = '32 GB'
    }

    withName: 'ALIGN_TO_REFERENCE' {
      time = '600m'
      cpus = 4
      memory = '16 GB'
    }

    // Medium-resource modules
    withName: 'MASK_SEQUENCES' {
      time = '300m'
      cpus = 2
      memory = '8 GB'
    }

    // Low-resource modules
    withName: 'CUTADAPT' {
      time = '180m'
      cpus = 2
      memory = '4 GB'
    }
}

executor {
  $sge {
      queueSize = 2000  // Allow more parallel jobs
  }
}

Tips for Resource Management¶

Start conservative - Begin with default values, then adjust based on actual usage
Monitor job logs - Check if jobs are timing out or running out of memory
Adjust incrementally - Change one parameter at a time to understand its impact
Consider your cluster - Different HPC systems have different resource limits

Troubleshooting¶

Jobs Timing Out¶

Symptom: Jobs fail with timeout errors

Solution: Increase time parameter for the failing module

Out of Memory Errors¶

Symptom: Jobs fail with memory errors

Solution: Increase memory parameter for the failing module

Slow Processing¶

Symptom: Pipeline runs very slowly

Solutions:

Increase cpus for multithreaded modules (DADA2_ANALYSIS, DADA2_POSTPROC)
Increase queueSize to allow more parallel jobs
Check if other users are using cluster resources