Snakemake pipeline for post-processing, QC, and visualization of ARG tree sequences (.ts, .trees, .tsz).
If you use, please cite:
Ross-Ibarra, J. 2026. ARGtest: tools for QC and validation of ancestral recombination graphs. doi: 10.5281/zenodo.19698118
- Install
- Suggested Workflow
- Snakemake Pipeline
- Scripts
- Auxiliary scripts
- Inputs, formats, defaults & logs
- Repository notes
See NOTES.md for the shared module layout and trim_samples.py sample-ID matching rules. Run any script with --help for its full argument list.
conda env create -f environment.yml
conda activate argtestCore dependencies are in environment.yml (numpy, matplotlib, tskit, tszip, msprime, snakemake, pytest).
One reasonable post-processing workflow for ARG tree sequences in this repo is:
- Find low rec regions Because regions of low recombination are more affected by linked selection, for analyses assuming the neutrality of the ARG it may be a good idea to remove low recombination regions ahead of time. Find windows in the genetic map in the bottom
Xpercentile ofcM/Mbusing scripts/hapmap_low_rec_mask.py. This turns a HapMap-style recombination map plus a.faiinto per-chromosome BED masks for very low-recombination regions. - Find regions of poor alignment Find windows of
sizekb where more thanX% of bp are masked using scripts/find_low_access_regions.py. This inspects the inferred mutation map for a tree sequence and writes low-accessibility windows to a BED file. - Find windows with aberrant mutational load All samples in a tree should have the same number of derived mutations, since all have the same distance from root. In windows of
numberSNPs, identify individuals withX% more or fewer derived mutations than the window median. Use scripts/mutload_summary.py for an interactive HTML summary (ASCII bar charts with outlier individuals highlighted in red, plus a lineage table). Use scripts/mutload_masks.py to write the outlier and mutation-masked BED files needed for the pipeline; this is what the Snakemake workflow calls. - Remove affected regions For each chromosome, combine the BED files from steps 1-3 (.low_rec.mask.bed, low_access.ws.accbp.bed, and <ts_stem>_mutation_masked.bed), then remove those genomic regions from a directory of tree sequences with scripts/trim_regions.py. This script applies a supplied BED mask and writes trimmed tree sequences.
- Remove affected samples In many cases, only a few samples within a window will be problematic. They could have evidence of introgression (identified using e.g. TRACE) or odd patterns of derived mutations (see step 3). Using a bedfile specifying regions and individuals, prune those individuals from the trees with scripts/trim_samples.py.
- Validate Run the validation plots with scripts/validation_plots_from_ts.py to get a sense of the cleaned ARG. This gives a compact set of QC plots for mutational load, diversity, Tajima's D, and the site-frequency spectrum (both folded and unfolded). Compare these to the same plots run on the original tree sequences.
- If satisfied, merge chromosomes for each replicate for downstream analysis using scripts/merge_treefiles_by_replicate.py. This concatenates chromosome-specific tree files into one combined tree sequence per replicate. Mutation-rate ratemaps embedded in each chromosome's metadata are merged and carried forward in the combined output.
- Summarise Generate a self-contained HTML pipeline summary with scripts/pipeline_summary.py. This collects genome retention statistics across all pipeline steps (mean ± SD across replicates for per-replicate steps), per-individual outlier counts from step 5, and embeds the validation plots for every chromosome in a single HTML file.
For a rule-based version of the workflow above, this repository now includes a Snakemake pipeline in Snakefile.
The Snakemake workflow is designed for a directory layout with one subdirectory per chromosome and one treefile per replicate inside each chromosome directory:
<root>/
chr1/
1.tsz
2.tsz
...
chr2/
1.tsz
2.tsz
...
The Snakemake workflow discovers treefiles with suffixes .ts, .trees, and .tsz. Replicate IDs are taken from the filename stem, so chr1/1.tsz is replicate 1 for chromosome chr1.
The Snakemake config is in config/snakemake.yaml. Edit it for your project and supply it with --configfile. The file has an inline comment for every option.
Required keys:
root_dir: path to the chromosome-subdirectory roothapmap: HapMap recombination map used for step 1fai: FASTA index used for chromosome lengthsrec_fraction: fraction of recombination-rate intervals (ranked byRate(cM/Mb), lowest first) to include in the low-recombination mask; e.g.0.1masks the bottom 10 % of intervals, while0.0writes empty low-recombination maskslow_access_window_size: window size in bp for step 2low_access_cutoff_bp: minimum accessible bp per window for step 2- exactly one of
mutload_window_sizeormutload_snp_windowfor step 3
Optional keys (all have sensible defaults):
tree_pattern: glob for treefiles within each chromosome directory (default:"*"), for example"*.trees"or"*.tsz"mutload_cutoff: outlier cutoff fraction for step 3 (default:0.25)mutload_fraction: fraction threshold for writing mutation-masked BED rows in step 3suffix_to_strip: suffix removed from sample IDs before matching in step 3 and step 5 (default:"_anchorwave")allow_missing_replicates: set totrueto concatenate partial replicate sets (default:false)burnin: number of leading discovered replicates to discard before concatenation (default:0); must be smaller than the number of replicates discovered after applyingtree_patternbase_name: prefix used for merged outputs (default: name ofroot_dir)merged_out_suffix: force a specific output suffix for merged files (.ts,.trees,.tsz); default is to inherit the suffix of the first inputout_dir: output root for Snakemake products (default:snakemake_out; tilde is expanded)validation_mutation_rate: mutation rate used by step 6 validation plots; omit or set tonullto skip step 6validation_first_chrom_only: run step 6 only on the first chromosome (default:true)validation_sim_branch: simulate site mutations on each ARG replicate with msprime for a posterior-predictive check instead of scaling branch statistics (default:false)
Mutation-rate maps are inferred from the treefile location using the same logic as the scripts, so the usual *.mut_rate.p files should be available near each chromosome directory.
The Snakemake workflow runs the same logical steps as the manual workflow:
- Build per-chromosome low-recombination BED masks
- Build per-chromosome low-accessibility BED masks
- Build per-chromosome, per-replicate mutational-load outlier BEDs and optional mutation-masked BEDs
- Combine the BED masks and trim affected regions from each treefile; the mutation-rate ratemap is embedded in each trimmed treefile's metadata
- Trim the affected samples from each trimmed treefile
- Concatenate chromosomes for each replicate into one combined tree sequence; ratemaps are merged across chromosomes and carried forward
- (optional) Run validation plots on original and cleaned tree sequences and generate a self-contained HTML pipeline summary
The final merged outputs are named:
<base_name>.combined.<replicate>.<suffix>
and are written under the configured out_dir in a combined/ directory.
From the repo root:
module load conda
conda activate argtestEdit config/snakemake.yaml to point at your data (at minimum set root_dir, hapmap, and fai), then run.
The committed config/snakemake.yaml already points at the example dataset at example_data/sim_2chr_5rep_clean and can be used as-is for a test run. It uses rec_fraction: 0.0 to write empty low-recombination masks and burnin: 0 because the bundled example has 5 discovered replicates; burnin must always be smaller than the discovered replicate count.
In sandboxed environments where ~/.cache is read-only, set cache/temp dirs to /tmp when running Snakemake:
XDG_CACHE_HOME=/tmp/argtest-xdg-cache TMPDIR=/tmp/argtest-tmp \
python -m snakemake -n -p --configfile config/snakemake.yamlRun the workflow for real:
XDG_CACHE_HOME=/tmp/argtest-xdg-cache TMPDIR=/tmp/argtest-tmp \
python -m snakemake --cores 16 --rerun-incomplete --keep-going --configfile config/snakemake.yamlBy default, Snakemake writes outputs beneath out_dir with subdirectories for each stage:
<out_dir>/
step1_low_rec/
step2_low_access/
step3_mutload/
step4_masks/
step4_trimmed_regions/
step5_trimmed_samples/
combined/
step6_validation/ # step 6 validation plots (original and cleaned), if configured
pipeline_summary.html
logs/
Intermediate filenames include both chromosome and replicate information so they stay unique across the full workflow.
Pipeline scripts (called by the Snakefile). Run any with --help for arguments, defaults, and examples.
hapmap_low_rec_mask.py— per-chromosome BED of the bottom--rec-fractionof HapMap recombination-rate intervals.find_low_access_regions.py— BED of low-accessibility windows, computed from a tree sequence's inferred mutation map.mutload_summary.py— interactive HTML summary of per-individual mutational-load outliers (ASCII bar charts, outlier histogram, lineage table).mutload_masks.py— outlier and mutation-masked BED files for one tree sequence (pipeline step 3).combine_remove_masks.py— merge the step 1–3 BED masks into a single combined BED per chromosome.trim_regions.py/trim_regions_single.py— apply a BED mask to a directory (or single file) of tree sequences and write trimmed outputs with compacted coordinates.trim_samples.py— remove individuals genome-wide (--individuals) or over BED intervals (--remove). See NOTES.md for the exact sample-ID matching rules.validation_plots_from_ts.py— SINGER-style QC plots (mutational load, diversity, Tajima's D, folded/unfolded SFS) across TS replicates; optional observed-vs-simulated overlays.merge_treefiles_by_replicate.py— concatenate chromosome-specific tree-sequence files by replicate; embedded mutation-rate ratemaps are merged and carried forward.pipeline_summary.py— self-contained HTML report of genome retention, per-individual outlier counts, and embedded validation plots.
Diversity (π), Tajima's D, and SFS in this pipeline are computed with tskit's built-in methods (ts.diversity, ts.Tajimas_D, ts.allele_frequency_spectrum). These normalize by a constant n · (n − 1) / 2 based on the sample set passed in, so when the sample size varies across regions — as it does after trim_samples.py removes individuals over BED intervals — the per-window statistics are not correctly normalized for the locally retained sample count. Treat post-pruning stats with caution, and prefer comparisons on replicates where sample membership is uniform across the genome.
Scripts not called by the Snakemake pipeline.
coalescence_ne_plots_from_ts.py— pair-coalescence and Ne plots from TS replicates with explicit time bins; optional Demes-based coalescent simulations (--sim N) that produce window-stat and SFS TSVs for observed-vs-sim density plots invalidation_plots_from_ts.py.compare_trees_html.py— render one tree index from each of two tree sequences side-by-side into a single HTML file.trees_gallery_html.py— scrollable HTML gallery of all trees from two tree sequences, useful for quick before/after inspection.
TODO: Running
coalescence_ne_plots_from_ts.py --sim Nbefore and after the pipeline (on the raw input tree sequences and on the step 5 output) and comparing the resulting Ne trajectories and simulation TSVs would be a useful formal QC step. Consider making this a standard part of the pipeline alongsidevalidation_plots_from_ts.py --compare.
- Tree-sequence files: scripts accept
.ts,.trees, and.tszfiles. Loading/writing.tszrequirestszipto be installed; scripts will raise a clear error iftszipis missing when a.tszis used. - BED files: expected as whitespace-separated lines
chrom start end [name].startandendare numeric (half-open intervals[start, end)). If a fourth columnnameis present it may list one or more comma-separated sample IDs; if omitted the BED filename stem is used as the sample name. Lines starting with#and blank lines are ignored. - HapMap recombination maps: when required (e.g.
hapmap_low_rec_mask.py), the script expects the HapMap format used bymsprime.RateMap.read_hapmap. - Glob
--pattern: arguments named--patternaccept shell-style glob patterns (for example "*.tsz") and are matched against filenames in the supplied directory. - Defaults & output locations: many scripts write to a
results/directory or to an output directory under the input tree-directory when--out/--out-dirare not provided. Examples:trim_samples.py: default output isresults/<ts_stem>_trimmed.tszwhen--outis not given.trim_regions.py: default--out-diris<ts-dir>/trimmedand default log is<out-dir>/logs/trim_regions.log.mutload_summary.pywritesresults/<name>.htmlandlogs/<name>.log; no BED files are written (usemutload_masks.pyfor BED output).- Several plotting scripts write PNG files into
results/by default; most have--outor--out-dirflags to override this.
- Logging & errors: scripts write summary logs into
logs/or into the chosen--out-dir(see each script's--log/--out-diroptions). Common failure modes include missingtszipfor.tszI/O, mismatched sequence lengths across chromosome files (checked bytrim_regions.py), and invalid BED line formats (the loader will raiseValueErrorwhen a BED line has fewer than 3 columns). When a script prints anERROR:or raises an exception, check the correspondinglogs/or<out-dir>/log file for the detailed run record.
- Generated
logs/andresults/are git-ignored. .DS_Storeis git-ignored.
None of this would be possible without the patient help and advice of Nate Pope. Any errors, bad code, or poor interpretations, however, are my responsbility alone. This repo also uses code from Nate Pope's singer-snakemake.