Skip to content

Pipeline Parameters

This page describes all customizable parameters in the MAD4HATTER pipeline, organized by module.

Cutadapt Parameters

Cutadapt handles adapter removal, primer trimming, and quality filtering. These parameters control the demultiplexing and quality control steps.

Parameter Description Default When to Adjust
--cutadapt_minlen Minimum read length after trimming (shorter reads are discarded) 100 Increase if you have short amplicons; decrease to keep more reads
--quality_score Quality score threshold for trimming 20 Lower (e.g., 15) to keep more reads; higher (e.g., 25) for stricter filtering
--gtrim Enable NextSeq-specific quality trimming false Set to true if seeing issues with polyG tails (uses --nextseq-trim instead of -q)
--allowed_errors Number of mismatches allowed in adapter/primer sequences 0 Increase (e.g., 1-2) if primers have known mismatches

Quality Trimming Behavior

  • When --gtrim false (default): Uses standard quality trimming with -q flag
  • When --gtrim true: Uses NextSeq-specific trimming with --nextseq-trim flag

Both use the --quality_score value as the threshold.

Quality Trimming Behavior

--nextseq-trim performs standard quality trimming, and also trims any trailing G tails in the read. This is a common issue when using two colour instruments (e.g., nextseq). However, we found that real biological G bases were being removed by this filter in mad4hatter data. After testing, we found that these poly G tails were being filtered out elsewhere in the pipeline. If you see issues in your data that could be attributed to poly G tails, then you can apply this filtering using the --gtrim flag, however you should be careful to look for unintended consequences.

Example: Custom Cutadapt Parameters

nextflow run main.nf \
  --readDIR /path/to/data \
  --pools D1,R1,R2 \
  --sequencer nextseq \
  --cutadapt_minlen 75 \
  --quality_score 15 \
  --gtrim true \
  --allowed_errors 1 \
  -profile docker
For more information about Cutadapt parameters, refer to the Cutadapt documentation.


DADA2 Parameters

DADA2 infers amplicon sequences and can be tuned depending on your needs. These parameters control the sequence inference module.

Parameter Description Default
--omega_a Abundance threshold - controls whether a sequence is likely a true variant vs. error 1e-120
--dada2_pool Pooling method for information sharing across samples pseudo
--band_size Alignment heuristic - controls alignment when indels exceed this threshold 16
--maxEE Maximum expected errors - reads exceeding this are discarded during filtering 3
--maxMismatch Maximum mismatches allowed in overlap region during read merging 0
--just_concatenate Concatenate non-overlapping reads instead of discarding true

DADA2 Pooling Methods

  • pseudo (default): Two-round approach that rescues low-abundance alleles
  • true: Full pooling across all samples
  • false: No pooling - each sample analyzed independently

Pseudo Pooling Details

Pseudo pooling involves:

  1. First round: DADA2 clustering on individual samples
  2. Second round: Pooling alleles from all samples as priors, then re-running DADA2

Benefits: - Rescues low-abundance alleles that appear in multiple samples - Increases sensitivity for variant detection

Trade-offs: - Approximately doubles runtime - May introduce false positives from PCR/sequencing errors - Default omega_a (1e-120) helps mitigate false positives

Example: Custom DADA2 Parameters

nextflow run main.nf \
  --readDIR /path/to/data \
  --pools D1,R1,R2 \
  --sequencer nextseq \
  --omega_a 1e-100 \
  --dada2_pool pseudo \
  --band_size 20 \
  --maxEE 4 \
  --maxMismatch 1 \
  -profile docker

Schematic of DADA2 with pseudo pooling Pseudo-pooling schematic: Benjamin Callahan, https://benjjneb.github.io/dada2/pseudo.html

For more information about DADA2 parameters, refer to the DADA2 documentation.


Masking Parameters

Control how low-complexity regions (homopolymers and tandem repeats) are masked in sequences.

Parameter Description Default
--mask_homopolymers Enable homopolymer masking true
--homopolymer_threshold Minimum homopolymer length to mask (e.g., 5 = mask runs of 5+ identical bases) 5
--mask_tandem_repeats Enable tandem repeat masking true
--trf_min_score Tandem Repeat Finder minimum alignment score 25
--trf_max_period Tandem Repeat Finder maximum pattern size 3

For more information about Tandem Repeat Finder, refer to the Tandem Repeat Finder documentation

Example: Custom Masking Parameters

nextflow run main.nf \
  --readDIR /path/to/data \
  --pools D1,R1,R2 \
  --sequencer nextseq \
  --mask_homopolymers true \
  --homopolymer_threshold 3 \
  --mask_tandem_repeats true \
  --trf_min_score 30 \
  -profile docker

Alignment Parameters

Control sequence alignment during post-processing.

Parameter Description Default
--alignment_threshold Minimum alignment score - sequences below this are filtered out 60

Lower values keep more sequences (including potential off-targets); higher values are more stringent.


Reference Sequence Parameters

Control how reference sequences are provided or generated.

Parameter Description Default
--refseq_fasta Path to targeted reference sequence file Auto-generated from pools
--genome Path to full genome file (used to generate reference) Not used
--amplicon_info Path to custom amplicon info file Auto-generated from pools

Reference Priority

If both --refseq_fasta and --genome are provided, --refseq_fasta takes priority. If neither is provided, the pipeline automatically builds a reference from pool configurations.


Resistance Markers

Parameter Description Default
--resmarker_info Path to resistance markers of interest covered by the panel Auto-generated from pools

Resistance

A list of resistance markers of interest is stored in the repository here. The pipeline will check which markers are covered by targets in your panel/pools and report on those. Although this list is extensive, if a marker you care about is missing, you can add it to your local copy of panel_information/principal_resistance_marker_info_table.tsv or provide it in a table via --resmarker_info. If you think the marker would be useful to others, please raise a new issue and label it as a feature request.

Getting Help

To see all available parameters with descriptions:

nextflow run main.nf --help