Pipeline Outputs¶

This page describes all output files generated by the MAD4HATTER pipeline, organized by output type.

Output Structure¶

Below you will find the folder structure outputs from each of the different workflow options (complete,qc,postprocessing).

Complete Workflow Outputs¶

results/
├── allele_data.txt                    # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt          # Collapsed allele table (masked sequences only)
├── sample_coverage.txt                # Sample-level coverage statistics
├── amplicon_coverage.txt              # Amplicon-level coverage statistics
├── raw_dada2_output/
│   └── dada2.clusters.txt             # Raw DADA2 output (ASV sequences and counts)
├── resistance_marker_module/          # Resistance marker analysis outputs (if enabled)
│   ├── resmarker_table.txt
│   ├── resmarker_table_by_locus.txt
│   ├── resmarker_microhaplotype_table.txt
│   └── all_mutations_table.txt
├── panel_information/                 # Panel configuration files used
│   ├── amplicon_info.tsv
│   ├── reference.fasta
│   └── resmarker_info.tsv
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

QC Output Files¶

results/
├── sample_coverage.txt                # Sample-level coverage statistics
├── amplicon_coverage.txt              # Target-level coverage statistics
├── panel_information/                 # Panel configuration files used
│   └── amplicon_info.tsv
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

Postprocessing Output Files¶

results/
├── allele_data.txt                    # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt          # Collapsed allele table (masked sequences only)
├── panel_information/                 # Panel configuration files used
│   ├── amplicon_info.tsv
│   └── reference.fasta
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

Coverage Files¶

sample_coverage.txt¶

Description: Sample-level coverage statistics showing how many reads pass through each processing stage.

Columns:

Column	Description
`sample_name`	Sample identifier
`stage`	Processing stage name
`reads`	Number of reads at this stage

Stages:

Input: Starting number of reads from raw FASTQ files
No Dimers: Reads remaining after removing Illumina adapter dimers
Amplicons: Reads with expected primers attached (after cutadapt)

The following stages are only included if the complete workflow is run.

OutputDada2: Number of denoised sequences (ASVs) after DADA2 processing
OutputPostprocessing: Final number of sequences after alignment filtering

Example:

sample_name          stage              reads
SRR26819135_S1_L001  Input              50000
SRR26819135_S1_L001  No Dimers          49494
SRR26819135_S1_L001  Amplicons          40259
SRR26819135_S1_L001  OutputDada2        40217
SRR26819135_S1_L001  OutputPostprocessing 40217

amplicon_coverage.txt¶

Description: Amplicon-level coverage statistics showing read counts per target for each sample. The columns OutputDada2 and OutputPostprocessing will only be added if the complete workflow has been run.

Columns:

Column	Description
`sample_name`	Sample identifier
`target_name`	Target identifier (format: `chromosome-start-end`)
`reads`	Number of reads for this target after cutadapt
`OutputDada2`	Number of denoised sequences (ASVs) for this target after DADA2
`OutputPostprocessing`	Final number of sequences for this target after alignment filtering

Example:

sample_name          target_name                        reads  OutputDada2  OutputPostprocessing
SRR26819135_S1_L001  Pf3D7_01_v3-0145420-0145630       54     54           54
SRR26819135_S1_L001  Pf3D7_01_v3-0162888-0163092       400    398          398

Interpreting Coverage

reads: Raw read count after primer dimer removal and demultiplexing.
OutputDada2: May be slightly lower than reads if some reads were filtered during DADA2 error correction.
OutputPostprocessing: May be lower if sequences failed alignment thresholds. This could be due to off target amplification, poor read quality, or contamination with human reads.

Allele Data Files¶

allele_data.txt¶

Description: This is a very informative output file containing all identified alleles (ASVs) with both unmasked and masked sequences, plus mutation annotations.

Columns:

Column	Description
`sample_name`	Sample identifier
`target_name`	Target identifier (common format: `chromosome-insert_start-insert_end`)
`asv`	Original ASV sequence (unmasked, with all bases)
`pseudocigar_unmasked`	PseudoCIGAR string describing mutations in unmasked sequence
`asv_masked`	ASV sequence with masked regions (low-complexity regions replaced with `n`)
`pseudocigar_masked`	PseudoCIGAR string describing mutations in masked sequence
`reads`	Number of reads supporting this ASV
`pool`	Pool identifier(s) (e.g., `D1.1`, `R1.2`)

Key Features:

Unmasked columns (asv, pseudocigar_unmasked): Full sequence information before masking
Masked columns (asv_masked, pseudocigar_masked): Sequences with low-complexity regions masked (used for downstream analysis). Note that asv_masked contains alignment information to allow for masking.
PseudoCIGAR format: See PseudoCIGAR Format section below

Example:

sample_name          target_name                        asv                    pseudocigar_unmasked  asv_masked              pseudocigar_masked    reads  pool
SRR26819135_S1_L001  Pf3D7_01_v3-0145420-0145630        GATATGTTTAAATATA...    94A                    GATATGTTTAAATATA...    25+25N94A169+8N188+9N  54     D1.1

allele_data_collapsed.txt¶

Description: Simplified allele table containing only masked sequences amd PseudoCIGARS. This is a collapsed version of allele_data.txt that removes duplicate ASVs where they were only different due to suspected errors in the masked region.

Columns:

Column	Description
`sample_name`	Sample identifier
`target_name`	Target identifier
`asv_masked`	ASV sequence with masked regions (low-complexity regions replaced with `n`)
`pseudocigar_masked`	PseudoCIGAR string describing mutations in masked sequence
`reads`	Number of reads supporting this ASV
`pool`	Pool identifier(s)

When to use: This file is useful when you only need the masked sequences and mutation annotations, without the full unmasked sequences.

Raw DADA2 Output¶

raw_dada2_output/dada2.clusters.txt¶

Description: Raw output from the DADA2 sequence inference step, before alignment and masking.

Columns:

Column	Description
`sampleID`	Sample identifier (same as `sample_name` in other files )
`locus`	Target identifier (same as `target_name` in other files)
`asv`	Denoised ASV sequence (unmasked)
`reads`	Number of reads supporting this ASV
`allele`	Unique allele identifier (format: `locus.integer`, e.g., `Pf3D7_01_v3-0145420-0145630.1`)
`norm.reads.locus`	Normalized read count (proportion of reads for this locus)
`n.alleles`	Number of unique alleles found at this locus for this sample

When to use: This file contains the raw DADA2 output before any alignment or masking steps. Useful for understanding initial sequence inference results. This can also be used as input to the postprocessing workflow if you wish to reanalyze the data with alternate parameters.

Example:

sampleID            locus                           asv                    reads  allele                                    norm.reads.locus  n.alleles
SRR26819135_S1_L001 Pf3D7_01_v3-0145420-0145630    GATATGTTTAAATATA...    54     Pf3D7_01_v3-0145420-0145630.1              1                 1

PseudoCIGAR Format¶

The PseudoCIGAR string provides a compact representation of mutations and masked regions in ASV sequences relative to the reference.

Mutation Syntax¶

SNPs (Single Nucleotide Polymorphisms)¶

Format: {position}{base}

position: Position in the reference sequence (1-based)
base: The base in the ASV at this position (different from reference)

Example: 94A means the ASV has an A at position 94, while the reference has a different base.

Insertions¶

Format: {position}I={base}

position: Position in the reference where the insertion occurs
base: The base inserted in the ASV

Example: 10I=G means a G was inserted at position 10 in the ASV.

Deletions¶

Format: {position}D={base}

position: Position in the reference sequence
base: The base that was deleted from the ASV (exists in reference but not in ASV)

Example: 144D=TCT means the sequence TCT was deleted starting at position 144.

Mask Syntax¶

Format: {start_position}+{length}N

start_position: Where the mask begins (1-based)
length: Length of the masked region
N: Indicates masked (low-complexity) region

Example: 25+25N means positions 25-49 (25 bases) are masked.

Multiple Mutations¶

Multiple mutations are included in the PseudoCIGAR string. For example:

25+25N94A169+8N188+9N

This means: - Positions 25-49 are masked (25 bases) - Position 94 has SNP A - Positions 169-176 are masked (8 bases) - Positions 188-196 are masked (9 bases)

Interpreting Masked Sequences¶

In asv_masked, masked regions are replaced with lowercase n characters:

Reference:  ACTTGATTGCACA
ASV:        ACTTGATTGCACA
Masked:     ACTnnnnnnnCACA  (positions 4-10 masked)
PseudoCIGAR: 4+7N

Resistance Marker Module Outputs¶

These files are generated when resistance marker analysis is enabled. They provide codon-level and mutation-level information for resistance-associated genes.

resmarker_table.txt¶

Description: Comprehensive resistance marker table showing codon-level information for all resistance-associated genes.

Columns:

Column	Description
`sample_name`	Sample identifier
`gene_id`	Gene identifier (e.g., `0417200`)
`gene`	Gene name (e.g., `dhfr`, `mdr1`, `crt`)
`aa_position`	Codon/amino acid position in the gene
`ref_codon`	Reference codon sequence (3 bases)
`codon`	Observed codon sequence in the sample
`codon_ref_alt`	Whether codon is reference (`REF`) or alternative (`ALT`)
`ref_aa`	Reference amino acid
`aa`	Observed amino acid
`aa_ref_alt`	Whether amino acid is reference (`REF`) or alternative (`ALT`)
`follows_indel`	Boolean indicating if this codon follows an insertion/deletion
`codon_masked`	Boolean indicating if this codon is in a masked region
`multiple_loci`	Boolean indicating if this codon maps to multiple targets
`reads`	Number of reads supporting this codon call

Example:

sample_name          gene_id  gene  aa_position  ref_codon  codon  codon_ref_alt  ref_aa  aa  aa_ref_alt  follows_indel  codon_masked  multiple_loci  reads
SRR26819135_S1_L001  0417200  dhfr  16           GCA        GCA   REF            A       A   REF         False           False         False         127
SRR26819135_S1_L001  0417200  dhfr  51           AAT        ATT   ALT            N       I   ALT         False           False         False         127

Interpreting Resmarker Table

This file is a collapsed version of resmarker_table_by_locus.txt. If a codon is found in multiple targets for a specific marker, then the reads are summed together and the rows are collapsed into one. If this has happened then multiple_loci will be True. If you need to analyse the data per target, for example if you have a poorly performing target you want to filter out or you want to collapse reads in a different way (e.g., mean), then use resmarker_table_by_locus.txt.

resmarker_table_by_locus.txt¶

Description: Resistance marker table organized by target, showing which target each codon call comes from. If you have no overlapping targets then this information will be the same as the above file (resmarker_table.txt).

Columns:

Column	Description
`sample_name`	Sample identifier
`gene_id`	Gene identifier
`gene`	Gene name
`target_name`	Target identifier where this codon was observed
`aa_position`	Codon/amino acid position in the gene
`ref_codon`	Reference codon sequence
`codon`	Observed codon sequence
`codon_ref_alt`	Whether codon is `REF` or `ALT`
`ref_aa`	Reference amino acid
`aa`	Observed amino acid
`aa_ref_alt`	Whether amino acid is `REF` or `ALT`
`follows_indel`	Boolean indicating if codon follows an indel
`codon_masked`	Boolean indicating if codon is in a masked region
`reads`	Number of reads supporting this codon call

Difference from resmarker_table.txt: This file includes target_name to show which target each codon call came from. This may be useful if you want to filter targets or collapse codons covered by multiple targets in a different way than the pipeline does (e.g., max or mean).

resmarker_microhaplotype_table.txt¶

Description: Table showing combinations of codons from the same microhaplotype for resistance markers of interest.

Columns:

Column	Description
`sample_name`	Sample identifier
`gene_id`	Gene identifier
`gene`	Gene name
`target_name`	Target identifier
`mhap_aa_positions`	List of codon/amino acid positions in this microhaplotype (e.g., `16/51/59`)
`ref_mhap`	Reference microhaplotype (amino acids separated by `/`, e.g., `A/N/C`)
`mhap`	Observed microhaplotype (amino acids separated by `/`, e.g., `A/I/R`)
`mhap_ref_alt`	Whether microhaplotype is `REF` or `ALT`
`reads`	Number of reads supporting this microhaplotype

Example:

sample_name          gene_id  gene  target_name                    mhap_aa_positions  ref_mhap  mhap    mhap_ref_alt  reads
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    16/51/59           A/N/C     A/I/R   ALT            381

Microhaplotypes

Microhaplotypes represent combinations of mutations that are analyzed together. This is useful for understanding co-occurring mutations that may have different resistance implications than individual mutations.

all_mutations_table.txt¶

Description: Mutation table showing all single-nucleotide variants (SNVs) in targets that covered at least one marker of interest.

Columns:

Column	Description
`sample_name`	Sample identifier
`gene_id`	Gene identifier
`gene`	Gene name
`target_name`	Target identifier
`target_position`	Position in the target (not gene position)
`alt`	Alternative (mutant) base
`ref`	Reference base
`reads`	Number of reads supporting this mutation

When to use: This file provides a simple view of all mutations without codon/amino acid interpretation. Useful for looking for novel mutations, outside of markers of interest.

Example:

sample_name          gene_id  gene  target_name                    target_position  alt  ref  reads
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    110              T    A    127
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    133              C    T    127

Panel Information Files¶

panel_information/¶

This directory contains copies of the panel configuration files used during the run:

amplicon_info.tsv: Target definitions (targets, primers, coordinates)
reference.fasta: Reference sequences used for alignment
resmarker_info.tsv: Resistance marker definitions

When to use: Useful when using multiple pre-configured pools. The pipeline will automatically generate these files from the list of pools. These files can be informative for downstream analysis and seeing which resmarkers were identified as being covered by targets.

Run Metadata Files¶

run/parameters.tsv¶

Description: Tab-separated file containing all parameters used in this pipeline run.

Format: Two columns: parameter name and parameter value.

Example:

pools                D1,R1,R2
readDIR              tests/example_data/example_fastq
workflow_name        complete
cutadapt_minlen      100
quality_score        20
...

When to use: Useful for reproducing runs or understanding what settings were used.

run/runtime.tsv¶

Description: Runtime information about the pipeline execution.

Columns:

Column	Description
`PipelineVersion`	Version of the pipeline
`ContainerEngine`	Container engine used (e.g., `docker`, `apptainer`)
`Duration`	Total runtime (e.g., `10m 44s`)
`CommandLine`	Full command line used to run the pipeline
`CommitId`	Git commit ID (if available)
`Complete`	Timestamp when pipeline completed
`ConfigFiles`	Path to configuration files used
`Container`	Container image used
`ExitStatus`	Exit status (0 = success)
`Profile`	Nextflow profile used
`RunName`	Nextflow run name

When to use: Useful for tracking pipeline versions, runtime, and debugging issues.

Understanding Output File Relationships¶

Raw FASTQ files
    ↓
sample_coverage.txt (tracks reads through stages)
    ↓
raw_dada2_output/dada2.clusters.txt (DADA2 inference)
    ↓
allele_data.txt (alignment, masking, mutation calling)
    ↓
allele_data_collapsed.txt (simplified version)
    ↓
resistance_marker_module/ (if enabled)
    ├── resmarker_table.txt (codon-level)
    ├── resmarker_table_by_locus.txt (with target info)
    ├── resmarker_microhaplotype_table.txt (haplotypes)
    └── all_mutations_table.txt (discovery of novel mutations)

Next Steps¶

Analyze results: Use the allele tables to identify variants in your samples
Check quality: Review coverage files to ensure sufficient read depth
Resistance analysis: If enabled, review resistance marker tables for drug resistance mutations
Reproducibility: Use run/parameters.tsv to reproduce your analysis