Post-Processing

This module rearranges the counts matrix produced from sequence inference and filters out sequences that do not map to Plasmodium falciparum. Users have the option to additionally mask regions of low complexity that are known to cause sequencing error, and can cause spurious output reducing precision.

Masking Low Complexity Regions

There are a number of steps in the postprocessing module (DADA2_POSTPROC) to reduce false positives in your final allele table. In particular, tandem repeats and hompolymer regions are masked due to known problems that they cause in Illumina instruments. Masked regions will appear as N’s in your sequences such as the example below.

Below are a table of parameters to control how much masking of low complexity regions should occur. These can be modified in the nextflow.config file.

Parameter	Description
mask_homopolymers	Whether to mask homopolymers (default `true`)
homopolymer_threshold	The number of repeating bases to qualify a sequence as a homopolymer region (default `5`)
mask_tandem_repeats	Whether to mask tandem repeats (default `true`)
trf_min_score	Used by Tandem Repeat Finder. This will control the alignment score required to call a sequence a tandem repeat (default `25`)
trf_max_period	Used by Tandem Repeat Finder. This will limit the range of the pattern size of a tandem repeat (default `3`)

Examples

Here is a sequence from DADA:

TATATATATATATATATATATATATATATATATATATATATATATGTATGTATGTTGATTAATTTGTTTATATATTTATATTTATTTCTTATGACCTTTTTAGGAACGACACCGAAGCTTTAATTTACAATTTTTTGCTATATCCATGTTAGATGCCTGTTCAGTCATTTTGGCCTTCATAGGTCT

And here is it’s masked counterpart using the pipeline defaults. Notice how the homopolymers and tandem repeats are masked by Ns.

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATGTATGTTGATTAATTTGTTTATATATTTATATTTATTTCTTATGACCTTTTTAGGAACGACACCGAAGCTTTAATTTACANNNNNNNNCTATATCCATGTTAGATGCCTGTTCAGTCATTTTGGCCTTCATAGGTCT

The masking is accomplished with Tandem Repeat Finder. Please refer to their documentation for additional information.

File Outputs

alignments{.RDS, .txt}

The sequences from DADA2 are aligned against reference sequences to detect indels and SNPs. This process also outputs a score is used as a filtration step to remove off target sequences. This file contains the original sequence from DADA2, the aligned version (with alignment gaps), the reference sequence that was aligned to (with alignment gaps), and includes the score of the alignment. This file can be useful if you would like to know:

Where tandem repeat or homopolymer sequencing errors occured (these are masked in the final output). For example, sequences with long tandem repeat regions may have indels that will not be easy to see in the original DADA2 sequences, and are masked to reduce false positives. In this file, long indel gaps in either the hapseq or refseq columns will be much easier to see.
The distribution of alignment scores. This is useful if you find human reads in your allele table and want to adjust your filtration step. Generally, a score of 60-70 will be a good enough threshold, but you may find that this needs to be adjusted.

allele_data{.RDS, .txt}

This file contains the final processed alleles from your demultiplexed sequencing files. The alleles are grouped by locus in each sample and include relative abundance metrics. You will likely see ‘N’ characters in the allele sequences. These are here to mask low complexity regions that are known to cause sequencing errors (ie.homopolymers and tandem repeats). Without masking, there would be more unique alleles in the final output that are actually known false positive. The length or number of these regions is a function of the provided reference sequences, and is not impacted by your sequencing data.