Skip to content

Pipeline Outputs

This page describes all output files generated by the MAD4HATTER pipeline, organized by output type.

Output Structure

Below you will find the folder structure outputs from each of the different workflow options (complete,qc,postprocessing).

Complete Workflow Outputs

results/
├── allele_data.txt                    # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt          # Collapsed allele table (masked sequences only)
├── sample_coverage.txt                # Sample-level coverage statistics
├── amplicon_coverage.txt              # Amplicon-level coverage statistics
├── raw_dada2_output/
│   └── dada2.clusters.txt             # Raw DADA2 output (ASV sequences and counts)
├── resistance_marker_module/          # Resistance marker analysis outputs (if enabled)
│   ├── resmarker_table.txt
│   ├── resmarker_table_by_locus.txt
│   ├── resmarker_microhaplotype_table.txt
│   └── all_mutations_table.txt
├── panel_information/                 # Panel configuration files used
│   ├── amplicon_info.tsv
│   ├── reference.fasta
│   └── resmarker_info.tsv
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

QC Output Files

results/
├── sample_coverage.txt                # Sample-level coverage statistics
├── amplicon_coverage.txt              # Target-level coverage statistics
├── panel_information/                 # Panel configuration files used
│   └── amplicon_info.tsv
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

Postprocessing Output Files

results/
├── allele_data.txt                    # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt          # Collapsed allele table (masked sequences only)
├── panel_information/                 # Panel configuration files used
│   ├── amplicon_info.tsv
│   └── reference.fasta
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

Coverage Files

sample_coverage.txt

Description: Sample-level coverage statistics showing how many reads pass through each processing stage.

Columns:

Column Description
sample_name Sample identifier
stage Processing stage name
reads Number of reads at this stage

Stages:

  • Input: Starting number of reads from raw FASTQ files
  • No Dimers: Reads remaining after removing Illumina adapter dimers
  • Amplicons: Reads with expected primers attached (after cutadapt)

The following stages are only included if the complete workflow is run.

  • OutputDada2: Number of denoised sequences (ASVs) after DADA2 processing
  • OutputPostprocessing: Final number of sequences after alignment filtering

Example:

sample_name          stage              reads
SRR26819135_S1_L001  Input              50000
SRR26819135_S1_L001  No Dimers          49494
SRR26819135_S1_L001  Amplicons          40259
SRR26819135_S1_L001  OutputDada2        40217
SRR26819135_S1_L001  OutputPostprocessing 40217


amplicon_coverage.txt

Description: Amplicon-level coverage statistics showing read counts per target for each sample. The columns OutputDada2 and OutputPostprocessing will only be added if the complete workflow has been run.

Columns:

Column Description
sample_name Sample identifier
target_name Target identifier (format: chromosome-start-end)
reads Number of reads for this target after cutadapt
OutputDada2 Number of denoised sequences (ASVs) for this target after DADA2
OutputPostprocessing Final number of sequences for this target after alignment filtering

Example:

sample_name          target_name                        reads  OutputDada2  OutputPostprocessing
SRR26819135_S1_L001  Pf3D7_01_v3-0145420-0145630       54     54           54
SRR26819135_S1_L001  Pf3D7_01_v3-0162888-0163092       400    398          398

Interpreting Coverage

  • reads: Raw read count after primer dimer removal and demultiplexing.
  • OutputDada2: May be slightly lower than reads if some reads were filtered during DADA2 error correction.
  • OutputPostprocessing: May be lower if sequences failed alignment thresholds. This could be due to off target amplification, poor read quality, or contamination with human reads.

Allele Data Files

allele_data.txt

Description: This is a very informative output file containing all identified alleles (ASVs) with both unmasked and masked sequences, plus mutation annotations.

Columns:

Column Description
sample_name Sample identifier
target_name Target identifier (common format: chromosome-insert_start-insert_end)
asv Original ASV sequence (unmasked, with all bases)
pseudocigar_unmasked PseudoCIGAR string describing mutations in unmasked sequence
asv_masked ASV sequence with masked regions (low-complexity regions replaced with n)
pseudocigar_masked PseudoCIGAR string describing mutations in masked sequence
reads Number of reads supporting this ASV
pool Pool identifier(s) (e.g., D1.1, R1.2)

Key Features:

  • Unmasked columns (asv, pseudocigar_unmasked): Full sequence information before masking
  • Masked columns (asv_masked, pseudocigar_masked): Sequences with low-complexity regions masked (used for downstream analysis). Note that asv_masked contains alignment information to allow for masking.
  • PseudoCIGAR format: See PseudoCIGAR Format section below

Example:

sample_name          target_name                        asv                    pseudocigar_unmasked  asv_masked              pseudocigar_masked    reads  pool
SRR26819135_S1_L001  Pf3D7_01_v3-0145420-0145630        GATATGTTTAAATATA...    94A                    GATATGTTTAAATATA...    25+25N94A169+8N188+9N  54     D1.1


allele_data_collapsed.txt

Description: Simplified allele table containing only masked sequences amd PseudoCIGARS. This is a collapsed version of allele_data.txt that removes duplicate ASVs where they were only different due to suspected errors in the masked region.

Columns:

Column Description
sample_name Sample identifier
target_name Target identifier
asv_masked ASV sequence with masked regions (low-complexity regions replaced with n)
pseudocigar_masked PseudoCIGAR string describing mutations in masked sequence
reads Number of reads supporting this ASV
pool Pool identifier(s)

When to use: This file is useful when you only need the masked sequences and mutation annotations, without the full unmasked sequences.


Raw DADA2 Output

raw_dada2_output/dada2.clusters.txt

Description: Raw output from the DADA2 sequence inference step, before alignment and masking.

Columns:

Column Description
sampleID Sample identifier (same as sample_name in other files )
locus Target identifier (same as target_name in other files)
asv Denoised ASV sequence (unmasked)
reads Number of reads supporting this ASV
allele Unique allele identifier (format: locus.integer, e.g., Pf3D7_01_v3-0145420-0145630.1)
norm.reads.locus Normalized read count (proportion of reads for this locus)
n.alleles Number of unique alleles found at this locus for this sample

When to use: This file contains the raw DADA2 output before any alignment or masking steps. Useful for understanding initial sequence inference results. This can also be used as input to the postprocessing workflow if you wish to reanalyze the data with alternate parameters.

Example:

sampleID            locus                           asv                    reads  allele                                    norm.reads.locus  n.alleles
SRR26819135_S1_L001 Pf3D7_01_v3-0145420-0145630    GATATGTTTAAATATA...    54     Pf3D7_01_v3-0145420-0145630.1              1                 1


PseudoCIGAR Format

The PseudoCIGAR string provides a compact representation of mutations and masked regions in ASV sequences relative to the reference.

Mutation Syntax

SNPs (Single Nucleotide Polymorphisms)

Format: {position}{base}

  • position: Position in the reference sequence (1-based)
  • base: The base in the ASV at this position (different from reference)

Example: 94A means the ASV has an A at position 94, while the reference has a different base.

Insertions

Format: {position}I={base}

  • position: Position in the reference where the insertion occurs
  • base: The base inserted in the ASV

Example: 10I=G means a G was inserted at position 10 in the ASV.

Deletions

Format: {position}D={base}

  • position: Position in the reference sequence
  • base: The base that was deleted from the ASV (exists in reference but not in ASV)

Example: 144D=TCT means the sequence TCT was deleted starting at position 144.

Mask Syntax

Format: {start_position}+{length}N

  • start_position: Where the mask begins (1-based)
  • length: Length of the masked region
  • N: Indicates masked (low-complexity) region

Example: 25+25N means positions 25-49 (25 bases) are masked.

Multiple Mutations

Multiple mutations are included in the PseudoCIGAR string. For example:

25+25N94A169+8N188+9N

This means: - Positions 25-49 are masked (25 bases) - Position 94 has SNP A - Positions 169-176 are masked (8 bases) - Positions 188-196 are masked (9 bases)

Interpreting Masked Sequences

In asv_masked, masked regions are replaced with lowercase n characters:

Reference:  ACTTGATTGCACA
ASV:        ACTTGATTGCACA
Masked:     ACTnnnnnnnCACA  (positions 4-10 masked)
PseudoCIGAR: 4+7N

Resistance Marker Module Outputs

These files are generated when resistance marker analysis is enabled. They provide codon-level and mutation-level information for resistance-associated genes.

resmarker_table.txt

Description: Comprehensive resistance marker table showing codon-level information for all resistance-associated genes.

Columns:

Column Description
sample_name Sample identifier
gene_id Gene identifier (e.g., 0417200)
gene Gene name (e.g., dhfr, mdr1, crt)
aa_position Codon/amino acid position in the gene
ref_codon Reference codon sequence (3 bases)
codon Observed codon sequence in the sample
codon_ref_alt Whether codon is reference (REF) or alternative (ALT)
ref_aa Reference amino acid
aa Observed amino acid
aa_ref_alt Whether amino acid is reference (REF) or alternative (ALT)
follows_indel Boolean indicating if this codon follows an insertion/deletion
codon_masked Boolean indicating if this codon is in a masked region
multiple_loci Boolean indicating if this codon maps to multiple targets
reads Number of reads supporting this codon call

Example:

sample_name          gene_id  gene  aa_position  ref_codon  codon  codon_ref_alt  ref_aa  aa  aa_ref_alt  follows_indel  codon_masked  multiple_loci  reads
SRR26819135_S1_L001  0417200  dhfr  16           GCA        GCA   REF            A       A   REF         False           False         False         127
SRR26819135_S1_L001  0417200  dhfr  51           AAT        ATT   ALT            N       I   ALT         False           False         False         127

Interpreting Resmarker Table

This file is a collapsed version of resmarker_table_by_locus.txt. If a codon is found in multiple targets for a specific marker, then the reads are summed together and the rows are collapsed into one. If this has happened then multiple_loci will be True. If you need to analyse the data per target, for example if you have a poorly performing target you want to filter out or you want to collapse reads in a different way (e.g., mean), then use resmarker_table_by_locus.txt.


resmarker_table_by_locus.txt

Description: Resistance marker table organized by target, showing which target each codon call comes from. If you have no overlapping targets then this information will be the same as the above file (resmarker_table.txt).

Columns:

Column Description
sample_name Sample identifier
gene_id Gene identifier
gene Gene name
target_name Target identifier where this codon was observed
aa_position Codon/amino acid position in the gene
ref_codon Reference codon sequence
codon Observed codon sequence
codon_ref_alt Whether codon is REF or ALT
ref_aa Reference amino acid
aa Observed amino acid
aa_ref_alt Whether amino acid is REF or ALT
follows_indel Boolean indicating if codon follows an indel
codon_masked Boolean indicating if codon is in a masked region
reads Number of reads supporting this codon call

Difference from resmarker_table.txt: This file includes target_name to show which target each codon call came from. This may be useful if you want to filter targets or collapse codons covered by multiple targets in a different way than the pipeline does (e.g., max or mean).


resmarker_microhaplotype_table.txt

Description: Table showing combinations of codons from the same microhaplotype for resistance markers of interest.

Columns:

Column Description
sample_name Sample identifier
gene_id Gene identifier
gene Gene name
target_name Target identifier
mhap_aa_positions List of codon/amino acid positions in this microhaplotype (e.g., 16/51/59)
ref_mhap Reference microhaplotype (amino acids separated by /, e.g., A/N/C)
mhap Observed microhaplotype (amino acids separated by /, e.g., A/I/R)
mhap_ref_alt Whether microhaplotype is REF or ALT
reads Number of reads supporting this microhaplotype

Example:

sample_name          gene_id  gene  target_name                    mhap_aa_positions  ref_mhap  mhap    mhap_ref_alt  reads
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    16/51/59           A/N/C     A/I/R   ALT            381

Microhaplotypes

Microhaplotypes represent combinations of mutations that are analyzed together. This is useful for understanding co-occurring mutations that may have different resistance implications than individual mutations.


all_mutations_table.txt

Description: Mutation table showing all single-nucleotide variants (SNVs) in targets that covered at least one marker of interest.

Columns:

Column Description
sample_name Sample identifier
gene_id Gene identifier
gene Gene name
target_name Target identifier
target_position Position in the target (not gene position)
alt Alternative (mutant) base
ref Reference base
reads Number of reads supporting this mutation

When to use: This file provides a simple view of all mutations without codon/amino acid interpretation. Useful for looking for novel mutations, outside of markers of interest.

Example:

sample_name          gene_id  gene  target_name                    target_position  alt  ref  reads
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    110              T    A    127
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    133              C    T    127


Panel Information Files

panel_information/

This directory contains copies of the panel configuration files used during the run:

  • amplicon_info.tsv: Target definitions (targets, primers, coordinates)
  • reference.fasta: Reference sequences used for alignment
  • resmarker_info.tsv: Resistance marker definitions

When to use: Useful when using multiple pre-configured pools. The pipeline will automatically generate these files from the list of pools. These files can be informative for downstream analysis and seeing which resmarkers were identified as being covered by targets.


Run Metadata Files

run/parameters.tsv

Description: Tab-separated file containing all parameters used in this pipeline run.

Format: Two columns: parameter name and parameter value.

Example:

pools                D1,R1,R2
readDIR              tests/example_data/example_fastq
workflow_name        complete
cutadapt_minlen      100
quality_score        20
...

When to use: Useful for reproducing runs or understanding what settings were used.


run/runtime.tsv

Description: Runtime information about the pipeline execution.

Columns:

Column Description
PipelineVersion Version of the pipeline
ContainerEngine Container engine used (e.g., docker, apptainer)
Duration Total runtime (e.g., 10m 44s)
CommandLine Full command line used to run the pipeline
CommitId Git commit ID (if available)
Complete Timestamp when pipeline completed
ConfigFiles Path to configuration files used
Container Container image used
ExitStatus Exit status (0 = success)
Profile Nextflow profile used
RunName Nextflow run name

When to use: Useful for tracking pipeline versions, runtime, and debugging issues.


Understanding Output File Relationships

Raw FASTQ files
sample_coverage.txt (tracks reads through stages)
raw_dada2_output/dada2.clusters.txt (DADA2 inference)
allele_data.txt (alignment, masking, mutation calling)
allele_data_collapsed.txt (simplified version)
resistance_marker_module/ (if enabled)
    ├── resmarker_table.txt (codon-level)
    ├── resmarker_table_by_locus.txt (with target info)
    ├── resmarker_microhaplotype_table.txt (haplotypes)
    └── all_mutations_table.txt (discovery of novel mutations)

Next Steps

  • Analyze results: Use the allele tables to identify variants in your samples
  • Check quality: Review coverage files to ensure sufficient read depth
  • Resistance analysis: If enabled, review resistance marker tables for drug resistance mutations
  • Reproducibility: Use run/parameters.tsv to reproduce your analysis