Pipeline Outputs¶
This page describes all output files generated by the MAD4HATTER pipeline, organized by output type.
Output Structure¶
Below you will find the folder structure outputs from each of the different workflow options (complete,qc,postprocessing).
Complete Workflow Outputs¶
results/
├── allele_data.txt # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt # Collapsed allele table (masked sequences only)
├── sample_coverage.txt # Sample-level coverage statistics
├── amplicon_coverage.txt # Amplicon-level coverage statistics
├── raw_dada2_output/
│ └── dada2.clusters.txt # Raw DADA2 output (ASV sequences and counts)
├── resistance_marker_module/ # Resistance marker analysis outputs (if enabled)
│ ├── resmarker_table.txt
│ ├── resmarker_table_by_locus.txt
│ ├── resmarker_microhaplotype_table.txt
│ └── all_mutations_table.txt
├── panel_information/ # Panel configuration files used
│ ├── amplicon_info.tsv
│ ├── reference.fasta
│ └── resmarker_info.tsv
└── run/ # Run metadata
├── parameters.tsv # Parameters used in this run
└── runtime.tsv # Runtime information
QC Output Files¶
results/
├── sample_coverage.txt # Sample-level coverage statistics
├── amplicon_coverage.txt # Target-level coverage statistics
├── panel_information/ # Panel configuration files used
│ └── amplicon_info.tsv
└── run/ # Run metadata
├── parameters.tsv # Parameters used in this run
└── runtime.tsv # Runtime information
Postprocessing Output Files¶
results/
├── allele_data.txt # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt # Collapsed allele table (masked sequences only)
├── panel_information/ # Panel configuration files used
│ ├── amplicon_info.tsv
│ └── reference.fasta
└── run/ # Run metadata
├── parameters.tsv # Parameters used in this run
└── runtime.tsv # Runtime information
Coverage Files¶
sample_coverage.txt¶
Description: Sample-level coverage statistics showing how many reads pass through each processing stage.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
stage | Processing stage name |
reads | Number of reads at this stage |
Stages:
Input: Starting number of reads from raw FASTQ filesNo Dimers: Reads remaining after removing Illumina adapter dimersAmplicons: Reads with expected primers attached (after cutadapt)
The following stages are only included if the complete workflow is run.
OutputDada2: Number of denoised sequences (ASVs) after DADA2 processingOutputPostprocessing: Final number of sequences after alignment filtering
Example:
sample_name stage reads
SRR26819135_S1_L001 Input 50000
SRR26819135_S1_L001 No Dimers 49494
SRR26819135_S1_L001 Amplicons 40259
SRR26819135_S1_L001 OutputDada2 40217
SRR26819135_S1_L001 OutputPostprocessing 40217
amplicon_coverage.txt¶
Description: Amplicon-level coverage statistics showing read counts per target for each sample. The columns OutputDada2 and OutputPostprocessing will only be added if the complete workflow has been run.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
target_name | Target identifier (format: chromosome-start-end) |
reads | Number of reads for this target after cutadapt |
OutputDada2 | Number of denoised sequences (ASVs) for this target after DADA2 |
OutputPostprocessing | Final number of sequences for this target after alignment filtering |
Example:
sample_name target_name reads OutputDada2 OutputPostprocessing
SRR26819135_S1_L001 Pf3D7_01_v3-0145420-0145630 54 54 54
SRR26819135_S1_L001 Pf3D7_01_v3-0162888-0163092 400 398 398
Interpreting Coverage
reads: Raw read count after primer dimer removal and demultiplexing.OutputDada2: May be slightly lower thanreadsif some reads were filtered during DADA2 error correction.OutputPostprocessing: May be lower if sequences failed alignment thresholds. This could be due to off target amplification, poor read quality, or contamination with human reads.
Allele Data Files¶
allele_data.txt¶
Description: This is a very informative output file containing all identified alleles (ASVs) with both unmasked and masked sequences, plus mutation annotations.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
target_name | Target identifier (common format: chromosome-insert_start-insert_end) |
asv | Original ASV sequence (unmasked, with all bases) |
pseudocigar_unmasked | PseudoCIGAR string describing mutations in unmasked sequence |
asv_masked | ASV sequence with masked regions (low-complexity regions replaced with n) |
pseudocigar_masked | PseudoCIGAR string describing mutations in masked sequence |
reads | Number of reads supporting this ASV |
pool | Pool identifier(s) (e.g., D1.1, R1.2) |
Key Features:
- Unmasked columns (
asv,pseudocigar_unmasked): Full sequence information before masking - Masked columns (
asv_masked,pseudocigar_masked): Sequences with low-complexity regions masked (used for downstream analysis). Note thatasv_maskedcontains alignment information to allow for masking. - PseudoCIGAR format: See PseudoCIGAR Format section below
Example:
sample_name target_name asv pseudocigar_unmasked asv_masked pseudocigar_masked reads pool
SRR26819135_S1_L001 Pf3D7_01_v3-0145420-0145630 GATATGTTTAAATATA... 94A GATATGTTTAAATATA... 25+25N94A169+8N188+9N 54 D1.1
allele_data_collapsed.txt¶
Description: Simplified allele table containing only masked sequences amd PseudoCIGARS. This is a collapsed version of allele_data.txt that removes duplicate ASVs where they were only different due to suspected errors in the masked region.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
target_name | Target identifier |
asv_masked | ASV sequence with masked regions (low-complexity regions replaced with n) |
pseudocigar_masked | PseudoCIGAR string describing mutations in masked sequence |
reads | Number of reads supporting this ASV |
pool | Pool identifier(s) |
When to use: This file is useful when you only need the masked sequences and mutation annotations, without the full unmasked sequences.
Raw DADA2 Output¶
raw_dada2_output/dada2.clusters.txt¶
Description: Raw output from the DADA2 sequence inference step, before alignment and masking.
Columns:
| Column | Description |
|---|---|
sampleID | Sample identifier (same as sample_name in other files ) |
locus | Target identifier (same as target_name in other files) |
asv | Denoised ASV sequence (unmasked) |
reads | Number of reads supporting this ASV |
allele | Unique allele identifier (format: locus.integer, e.g., Pf3D7_01_v3-0145420-0145630.1) |
norm.reads.locus | Normalized read count (proportion of reads for this locus) |
n.alleles | Number of unique alleles found at this locus for this sample |
When to use: This file contains the raw DADA2 output before any alignment or masking steps. Useful for understanding initial sequence inference results. This can also be used as input to the postprocessing workflow if you wish to reanalyze the data with alternate parameters.
Example:
sampleID locus asv reads allele norm.reads.locus n.alleles
SRR26819135_S1_L001 Pf3D7_01_v3-0145420-0145630 GATATGTTTAAATATA... 54 Pf3D7_01_v3-0145420-0145630.1 1 1
PseudoCIGAR Format¶
The PseudoCIGAR string provides a compact representation of mutations and masked regions in ASV sequences relative to the reference.
Mutation Syntax¶
SNPs (Single Nucleotide Polymorphisms)¶
Format: {position}{base}
position: Position in the reference sequence (1-based)base: The base in the ASV at this position (different from reference)
Example: 94A means the ASV has an A at position 94, while the reference has a different base.
Insertions¶
Format: {position}I={base}
position: Position in the reference where the insertion occursbase: The base inserted in the ASV
Example: 10I=G means a G was inserted at position 10 in the ASV.
Deletions¶
Format: {position}D={base}
position: Position in the reference sequencebase: The base that was deleted from the ASV (exists in reference but not in ASV)
Example: 144D=TCT means the sequence TCT was deleted starting at position 144.
Mask Syntax¶
Format: {start_position}+{length}N
start_position: Where the mask begins (1-based)length: Length of the masked regionN: Indicates masked (low-complexity) region
Example: 25+25N means positions 25-49 (25 bases) are masked.
Multiple Mutations¶
Multiple mutations are included in the PseudoCIGAR string. For example:
25+25N94A169+8N188+9N
This means: - Positions 25-49 are masked (25 bases) - Position 94 has SNP A - Positions 169-176 are masked (8 bases) - Positions 188-196 are masked (9 bases)
Interpreting Masked Sequences¶
In asv_masked, masked regions are replaced with lowercase n characters:
Reference: ACTTGATTGCACA
ASV: ACTTGATTGCACA
Masked: ACTnnnnnnnCACA (positions 4-10 masked)
PseudoCIGAR: 4+7N
Resistance Marker Module Outputs¶
These files are generated when resistance marker analysis is enabled. They provide codon-level and mutation-level information for resistance-associated genes.
resmarker_table.txt¶
Description: Comprehensive resistance marker table showing codon-level information for all resistance-associated genes.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
gene_id | Gene identifier (e.g., 0417200) |
gene | Gene name (e.g., dhfr, mdr1, crt) |
aa_position | Codon/amino acid position in the gene |
ref_codon | Reference codon sequence (3 bases) |
codon | Observed codon sequence in the sample |
codon_ref_alt | Whether codon is reference (REF) or alternative (ALT) |
ref_aa | Reference amino acid |
aa | Observed amino acid |
aa_ref_alt | Whether amino acid is reference (REF) or alternative (ALT) |
follows_indel | Boolean indicating if this codon follows an insertion/deletion |
codon_masked | Boolean indicating if this codon is in a masked region |
multiple_loci | Boolean indicating if this codon maps to multiple targets |
reads | Number of reads supporting this codon call |
Example:
sample_name gene_id gene aa_position ref_codon codon codon_ref_alt ref_aa aa aa_ref_alt follows_indel codon_masked multiple_loci reads
SRR26819135_S1_L001 0417200 dhfr 16 GCA GCA REF A A REF False False False 127
SRR26819135_S1_L001 0417200 dhfr 51 AAT ATT ALT N I ALT False False False 127
Interpreting Resmarker Table
This file is a collapsed version of resmarker_table_by_locus.txt. If a codon is found in multiple targets for a specific marker, then the reads are summed together and the rows are collapsed into one. If this has happened then multiple_loci will be True. If you need to analyse the data per target, for example if you have a poorly performing target you want to filter out or you want to collapse reads in a different way (e.g., mean), then use resmarker_table_by_locus.txt.
resmarker_table_by_locus.txt¶
Description: Resistance marker table organized by target, showing which target each codon call comes from. If you have no overlapping targets then this information will be the same as the above file (resmarker_table.txt).
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
gene_id | Gene identifier |
gene | Gene name |
target_name | Target identifier where this codon was observed |
aa_position | Codon/amino acid position in the gene |
ref_codon | Reference codon sequence |
codon | Observed codon sequence |
codon_ref_alt | Whether codon is REF or ALT |
ref_aa | Reference amino acid |
aa | Observed amino acid |
aa_ref_alt | Whether amino acid is REF or ALT |
follows_indel | Boolean indicating if codon follows an indel |
codon_masked | Boolean indicating if codon is in a masked region |
reads | Number of reads supporting this codon call |
Difference from resmarker_table.txt: This file includes target_name to show which target each codon call came from. This may be useful if you want to filter targets or collapse codons covered by multiple targets in a different way than the pipeline does (e.g., max or mean).
resmarker_microhaplotype_table.txt¶
Description: Table showing combinations of codons from the same microhaplotype for resistance markers of interest.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
gene_id | Gene identifier |
gene | Gene name |
target_name | Target identifier |
mhap_aa_positions | List of codon/amino acid positions in this microhaplotype (e.g., 16/51/59) |
ref_mhap | Reference microhaplotype (amino acids separated by /, e.g., A/N/C) |
mhap | Observed microhaplotype (amino acids separated by /, e.g., A/I/R) |
mhap_ref_alt | Whether microhaplotype is REF or ALT |
reads | Number of reads supporting this microhaplotype |
Example:
sample_name gene_id gene target_name mhap_aa_positions ref_mhap mhap mhap_ref_alt reads
SRR26819135_S1_L001 0417200 dhfr Pf3D7_04_v3-0748128-0748326 16/51/59 A/N/C A/I/R ALT 381
Microhaplotypes
Microhaplotypes represent combinations of mutations that are analyzed together. This is useful for understanding co-occurring mutations that may have different resistance implications than individual mutations.
all_mutations_table.txt¶
Description: Mutation table showing all single-nucleotide variants (SNVs) in targets that covered at least one marker of interest.
Columns:
| Column | Description |
|---|---|
sample_name | Sample identifier |
gene_id | Gene identifier |
gene | Gene name |
target_name | Target identifier |
target_position | Position in the target (not gene position) |
alt | Alternative (mutant) base |
ref | Reference base |
reads | Number of reads supporting this mutation |
When to use: This file provides a simple view of all mutations without codon/amino acid interpretation. Useful for looking for novel mutations, outside of markers of interest.
Example:
sample_name gene_id gene target_name target_position alt ref reads
SRR26819135_S1_L001 0417200 dhfr Pf3D7_04_v3-0748128-0748326 110 T A 127
SRR26819135_S1_L001 0417200 dhfr Pf3D7_04_v3-0748128-0748326 133 C T 127
Panel Information Files¶
panel_information/¶
This directory contains copies of the panel configuration files used during the run:
amplicon_info.tsv: Target definitions (targets, primers, coordinates)reference.fasta: Reference sequences used for alignmentresmarker_info.tsv: Resistance marker definitions
When to use: Useful when using multiple pre-configured pools. The pipeline will automatically generate these files from the list of pools. These files can be informative for downstream analysis and seeing which resmarkers were identified as being covered by targets.
Run Metadata Files¶
run/parameters.tsv¶
Description: Tab-separated file containing all parameters used in this pipeline run.
Format: Two columns: parameter name and parameter value.
Example:
pools D1,R1,R2
readDIR tests/example_data/example_fastq
workflow_name complete
cutadapt_minlen 100
quality_score 20
...
When to use: Useful for reproducing runs or understanding what settings were used.
run/runtime.tsv¶
Description: Runtime information about the pipeline execution.
Columns:
| Column | Description |
|---|---|
PipelineVersion | Version of the pipeline |
ContainerEngine | Container engine used (e.g., docker, apptainer) |
Duration | Total runtime (e.g., 10m 44s) |
CommandLine | Full command line used to run the pipeline |
CommitId | Git commit ID (if available) |
Complete | Timestamp when pipeline completed |
ConfigFiles | Path to configuration files used |
Container | Container image used |
ExitStatus | Exit status (0 = success) |
Profile | Nextflow profile used |
RunName | Nextflow run name |
When to use: Useful for tracking pipeline versions, runtime, and debugging issues.
Understanding Output File Relationships¶
Raw FASTQ files
↓
sample_coverage.txt (tracks reads through stages)
↓
raw_dada2_output/dada2.clusters.txt (DADA2 inference)
↓
allele_data.txt (alignment, masking, mutation calling)
↓
allele_data_collapsed.txt (simplified version)
↓
resistance_marker_module/ (if enabled)
├── resmarker_table.txt (codon-level)
├── resmarker_table_by_locus.txt (with target info)
├── resmarker_microhaplotype_table.txt (haplotypes)
└── all_mutations_table.txt (discovery of novel mutations)
Next Steps¶
- Analyze results: Use the allele tables to identify variants in your samples
- Check quality: Review coverage files to ensure sufficient read depth
- Resistance analysis: If enabled, review resistance marker tables for drug resistance mutations
- Reproducibility: Use
run/parameters.tsvto reproduce your analysis