Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Pipeline Outputs

This page describes all output files generated by the MAD4HATTER pipeline, organized by output type.

Output Structure

Below you will find the folder structure outputs from each of the different workflow options (complete,qc,postprocessing).

Complete Workflow Outputs

results/
├── allele_data.txt                    # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt          # Collapsed allele table (masked sequences only)
├── sample_coverage.txt                # Sample-level coverage statistics
├── amplicon_coverage.txt              # Amplicon-level coverage statistics
├── raw_dada2_output/
│   └── dada2.clusters.txt             # Raw DADA2 output (ASV sequences and counts)
├── resistance_marker_module/          # Resistance marker analysis outputs (if enabled)
│   ├── resmarker_table.txt
│   ├── resmarker_table_by_locus.txt
│   ├── resmarker_microhaplotype_table.txt
│   └── all_mutations_table.txt
├── panel_information/                 # Panel configuration files used
│   ├── amplicon_info.tsv
│   ├── reference.fasta
│   └── resmarker_info.tsv
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

QC Output Files

results/
├── sample_coverage.txt                # Sample-level coverage statistics
├── amplicon_coverage.txt              # Target-level coverage statistics
├── panel_information/                 # Panel configuration files used
│   └── amplicon_info.tsv
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

Postprocessing Output Files

results/
├── allele_data.txt                    # Main allele table (unmasked and masked sequences)
├── allele_data_collapsed.txt          # Collapsed allele table (masked sequences only)
├── panel_information/                 # Panel configuration files used
│   ├── amplicon_info.tsv
│   └── reference.fasta
└── run/                               # Run metadata
    ├── parameters.tsv                 # Parameters used in this run
    └── runtime.tsv                    # Runtime information

Coverage Files

sample_coverage.txt

Description: Sample-level coverage statistics showing how many reads pass through each processing stage.

Columns:

Column Description
sample_name Sample identifier
stage Processing stage name
reads Number of reads at this stage

Stages:

  • Input: Starting number of reads from raw FASTQ files
  • No Dimers: Reads remaining after removing Illumina adapter dimers
  • Amplicons: Reads with expected primers attached (after cutadapt)

The following stages are only included if the complete workflow is run.

  • OutputDada2: Number of denoised sequences (ASVs) after DADA2 processing
  • OutputPostprocessing: Final number of sequences after alignment filtering

Example:

sample_name          stage              reads
SRR26819135_S1_L001  Input              50000
SRR26819135_S1_L001  No Dimers          49494
SRR26819135_S1_L001  Amplicons          40259
SRR26819135_S1_L001  OutputDada2        40217
SRR26819135_S1_L001  OutputPostprocessing 40217

amplicon_coverage.txt

Description: Amplicon-level coverage statistics showing read counts per target for each sample. The columns OutputDada2 and OutputPostprocessing will only be added if the complete workflow has been run.

Columns:

Column Description
sample_name Sample identifier
target_name Target identifier (format: chromosome-start-end)
reads Number of reads for this target after cutadapt
OutputDada2 Number of denoised sequences (ASVs) for this target after DADA2
OutputPostprocessing Final number of sequences for this target after alignment filtering

Example:

sample_name          target_name                        reads  OutputDada2  OutputPostprocessing
SRR26819135_S1_L001  Pf3D7_01_v3-0145420-0145630       54     54           54
SRR26819135_S1_L001  Pf3D7_01_v3-0162888-0163092       400    398          398

!!! tip “Interpreting Coverage” - reads: Raw read count after primer dimer removal and demultiplexing. - OutputDada2: May be slightly lower than reads if some reads were filtered during DADA2 error correction. - OutputPostprocessing: May be lower if sequences failed alignment thresholds. This could be due to off target amplification, poor read quality, or contamination with human reads.


Allele Data Files

allele_data.txt

Description: This is a very informative output file containing all identified alleles (ASVs) with both unmasked and masked sequences, plus mutation annotations.

Columns:

Column Description
sample_name Sample identifier
target_name Target identifier (common format: chromosome-insert_start-insert_end)
asv Original ASV sequence (unmasked, with all bases)
pseudocigar_unmasked PseudoCIGAR string describing mutations in unmasked sequence
asv_masked ASV sequence with masked regions (low-complexity regions replaced with n)
pseudocigar_masked PseudoCIGAR string describing mutations in masked sequence
reads Number of reads supporting this ASV
pool Pool identifier(s) (e.g., D1.1, R1.2)

Key Features:

  • Unmasked columns (asv, pseudocigar_unmasked): Full sequence information before masking
  • Masked columns (asv_masked, pseudocigar_masked): Sequences with low-complexity regions masked (used for downstream analysis). Note that asv_masked contains alignment information to allow for masking.
  • PseudoCIGAR format: See PseudoCIGAR Format section below

Example:

sample_name          target_name                        asv                    pseudocigar_unmasked  asv_masked              pseudocigar_masked    reads  pool
SRR26819135_S1_L001  Pf3D7_01_v3-0145420-0145630        GATATGTTTAAATATA...    94A                    GATATGTTTAAATATA...    25+25N94A169+8N188+9N  54     D1.1

allele_data_collapsed.txt

Description: Simplified allele table containing only masked sequences amd PseudoCIGARS. This is a collapsed version of allele_data.txt that removes duplicate ASVs where they were only different due to suspected errors in the masked region.

Columns:

Column Description
sample_name Sample identifier
target_name Target identifier
asv_masked ASV sequence with masked regions (low-complexity regions replaced with n)
pseudocigar_masked PseudoCIGAR string describing mutations in masked sequence
reads Number of reads supporting this ASV
pool Pool identifier(s)

When to use: This file is useful when you only need the masked sequences and mutation annotations, without the full unmasked sequences.


Raw DADA2 Output

raw_dada2_output/dada2.clusters.txt

Description: Raw output from the DADA2 sequence inference step, before alignment and masking.

Columns:

Column Description
sampleID Sample identifier (same as sample_name in other files )
locus Target identifier (same as target_name in other files)
asv Denoised ASV sequence (unmasked)
reads Number of reads supporting this ASV
allele Unique allele identifier (format: locus.integer, e.g., Pf3D7_01_v3-0145420-0145630.1)
norm.reads.locus Normalized read count (proportion of reads for this locus)
n.alleles Number of unique alleles found at this locus for this sample

When to use: This file contains the raw DADA2 output before any alignment or masking steps. Useful for understanding initial sequence inference results. This can also be used as input to the postprocessing workflow if you wish to reanalyze the data with alternate parameters.

Example:

sampleID            locus                           asv                    reads  allele                                    norm.reads.locus  n.alleles
SRR26819135_S1_L001 Pf3D7_01_v3-0145420-0145630    GATATGTTTAAATATA...    54     Pf3D7_01_v3-0145420-0145630.1              1                 1

PseudoCIGAR Format

The PseudoCIGAR string provides a compact representation of mutations and masked regions in ASV sequences relative to the reference.

Mutation Syntax

SNPs (Single Nucleotide Polymorphisms)

Format: {position}{base}

  • position: Position in the reference sequence (1-based)
  • base: The base in the ASV at this position (different from reference)

Example: 94A means the ASV has an A at position 94, while the reference has a different base.

Insertions

Format: {position}I={base}

  • position: Position in the reference where the insertion occurs
  • base: The base inserted in the ASV

Example: 10I=G means a G was inserted at position 10 in the ASV.

Deletions

Format: {position}D={base}

  • position: Position in the reference sequence
  • base: The base that was deleted from the ASV (exists in reference but not in ASV)

Example: 144D=TCT means the sequence TCT was deleted starting at position 144.

Mask Syntax

Format: {start_position}+{length}N

  • start_position: Where the mask begins (1-based)
  • length: Length of the masked region
  • N: Indicates masked (low-complexity) region

Example: 25+25N means positions 25-49 (25 bases) are masked.

Multiple Mutations

Multiple mutations are included in the PseudoCIGAR string. For example:

25+25N94A169+8N188+9N

This means:

  • Positions 25-49 are masked (25 bases)
  • Position 94 has SNP A
  • Positions 169-176 are masked (8 bases)
  • Positions 188-196 are masked (9 bases)

Interpreting Masked Sequences

In asv_masked, masked regions are replaced with lowercase n characters:

Reference:  ACTTGATTGCACA
ASV:        ACTTGATTGCACA
Masked:     ACTnnnnnnnCACA  (positions 4-10 masked)
PseudoCIGAR: 4+7N

Resistance Marker Module Outputs

These files are generated when resistance marker analysis is enabled. They provide codon-level and mutation-level information for resistance-associated genes.

resmarker_table.txt

Description: Comprehensive resistance marker table showing codon-level information for all resistance-associated genes.

Columns:

Column Description
sample_name Sample identifier
GeneID Gene identifier (e.g., 0417200)
Gene Gene name (e.g., dhfr, mdr1, crt)
CodonID Codon position in the gene
RefCodon Reference codon sequence (3 bases)
Codon Observed codon sequence in the sample
CodonRefAlt Whether codon is reference (REF) or alternative (ALT)
RefAA Reference amino acid
AA Observed amino acid
AARefAlt Whether amino acid is reference (REF) or alternative (ALT)
FollowsIndel Boolean indicating if this codon follows an insertion/deletion
CodonMasked Boolean indicating if this codon is in a masked region
MultipleLoci Boolean indicating if this codon maps to multiple targets
reads Number of reads supporting this codon call

Example:

sample_name          GeneID   Gene  CodonID  RefCodon  Codon  CodonRefAlt  RefAA  AA  AARefAlt  FollowsIndel  CodonMasked  MultipleLoci  reads
SRR26819135_S1_L001  0417200  dhfr  16       GCA       GCA    REF          A      A   REF       False          False        False         127
SRR26819135_S1_L001  0417200  dhfr  51       AAT       ATT    ALT          N      I   ALT       False          False        False         127

!!! tip “Interpreting Resmarker Table” This file is a collapsed version of resmarker_table_by_locus.txt. If a codon is found in multiple targets for a specific marker, then the reads are summed together and the rows are collapsed into one. If this has happened then MultipleLoci will be True. If you need to analyse the data per target, for example if you have a poorly performing target you want to filter out or you want to collapse reads in a different way (e.g., mean), then use resmarker_table_by_locus.txt.


resmarker_table_by_locus.txt

Description: Resistance marker table organized by target, showing which target each codon call comes from. If you have no overlapping targets then this information will be the same as the above file (resmarker_table.txt).

Columns:

Column Description
sample_name Sample identifier
GeneID Gene identifier
Gene Gene name
target_name Target identifier where this codon was observed
CodonID Codon position in the gene
RefCodon Reference codon sequence
Codon Observed codon sequence
CodonRefAlt Whether codon is REF or ALT
RefAA Reference amino acid
AA Observed amino acid
AARefAlt Whether amino acid is REF or ALT
FollowsIndel Boolean indicating if codon follows an indel
CodonMasked Boolean indicating if codon is in a masked region
reads Number of reads supporting this codon call

Difference from resmarker_table.txt: This file includes target_name to show which target each codon call came from. This may be useful if you want to filter targets or collapse codons covered by multuple targets in a different way then the pipeline does (e.g., max or mean).


resmarker_microhaplotype_table.txt

Description: Table showing combinations of codons from the same microhaplotype for resistance markers of interest.

Columns:

Column Description
sample_name Sample identifier
GeneID Gene identifier
Gene Gene name
target_name Target identifier
MicrohaplotypeCodonIDs List of codon positions in this microhaplotype (e.g., 16/51/59)
RefMicrohap Reference microhaplotype (amino acids separated by /, e.g., A/N/C)
Microhaplotype Observed microhaplotype (amino acids separated by /, e.g., A/I/R)
MicrohapRefAlt Whether microhaplotype is REF or ALT
reads Number of reads supporting this microhaplotype

Example:

sample_name          GeneID   Gene  target_name                    MicrohaplotypeCodonIDs  RefMicrohap  Microhaplotype  MicrohapRefAlt  reads
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    16/51/59                A/N/C        A/I/R           ALT             381

!!! note “Microhaplotypes” Microhaplotypes represent combinations of mutations that are analyzed together. This is useful for understanding co-occurring mutations that may have different resistance implications than individual mutations.


all_mutations_table.txt

Description: Mutation table showing all single-nucleotide variants (SNVs) in targets that covered at least one marker of interest.

Columns:

Column Description
sample_name Sample identifier
GeneID Gene identifier
Gene Gene name
target_name Target identifier
LocusPosition Position in the target (not gene position)
Alt Alternative (mutant) base
Ref Reference base
reads Number of reads supporting this mutation

When to use: This file provides a simple view of all mutations without codon/amino acid interpretation. Useful for looking for novel mutations, outside of markers of interest.

Example:

sample_name          GeneID   Gene  target_name                    LocusPosition  Alt  Ref  reads
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    110            T    A    127
SRR26819135_S1_L001  0417200  dhfr  Pf3D7_04_v3-0748128-0748326    133            C    T    127

Panel Information Files

panel_information/

This directory contains copies of the panel configuration files used during the run:

  • amplicon_info.tsv: Target definitions (targets, primers, coordinates)
  • reference.fasta: Reference sequences used for alignment
  • resmarker_info.tsv: Resistance marker definitions

When to use: Useful when using multiple pre-configured pools. The pipeline will automatically generate these files from the list of pools. These files can be informative for downstream analysis and seeing which resmarkers were identified as being covered by targets.


Run Metadata Files

run/parameters.tsv

Description: Tab-separated file containing all parameters used in this pipeline run.

Format: Two columns: parameter name and parameter value.

Example:

pools                D1,R1,R2
readDIR              tests/example_data/example_fastq
workflow_name        complete
cutadapt_minlen      100
quality_score        20
...

When to use: Useful for reproducing runs or understanding what settings were used.


run/runtime.tsv

Description: Runtime information about the pipeline execution.

Columns:

Column Description
PipelineVersion Version of the pipeline
ContainerEngine Container engine used (e.g., docker, apptainer)
Duration Total runtime (e.g., 10m 44s)
CommandLine Full command line used to run the pipeline
CommitId Git commit ID (if available)
Complete Timestamp when pipeline completed
ConfigFiles Path to configuration files used
Container Container image used
ExitStatus Exit status (0 = success)
Profile Nextflow profile used
RunName Nextflow run name

When to use: Useful for tracking pipeline versions, runtime, and debugging issues.


Understanding Output File Relationships

Raw FASTQ files
    ↓
sample_coverage.txt (tracks reads through stages)
    ↓
raw_dada2_output/dada2.clusters.txt (DADA2 inference)
    ↓
allele_data.txt (alignment, masking, mutation calling)
    ↓
allele_data_collapsed.txt (simplified version)
    ↓
resistance_marker_module/ (if enabled)
    ├── resmarker_table.txt (codon-level)
    ├── resmarker_table_by_locus.txt (with target info)
    ├── resmarker_microhaplotype_table.txt (haplotypes)
    └── all_mutations_table.txt (discovery of novel mutations)

Next Steps

  • Analyze results: Use the allele tables to identify variants in your samples
  • Check quality: Review coverage files to ensure sufficient read depth
  • Resistance analysis: If enabled, review resistance marker tables for drug resistance mutations
  • Reproducibility: Use run/parameters.tsv to reproduce your analysis