Integrating spacegraphcats into a snakemake workflow

This page is under construction, please check back as we add more documentation! Snakemake is a workflow manager that automates and scales bioinformatics pipelines, and makes them portable across computing environments.

A sample snakemake workflow

See a sample snakemake workflow for a spacegraphcats pipeline here.

The spacegraphcats rule is copied below:

rule spacegraphcats_one_arg:
    input: 
        query = "outputs/arg90_matches/cfxA4_AY769933.fna", 
        conf = "outputs/sgc_conf/{sample}_r{radius}_conf.yml",
        reads = "outputs/abundtrim/{sample}.abundtrim.fq.gz"
    output:
        "outputs/sgc_arg_queries_r{radius}/{sample}_k31_r{radius}_search_oh0/cfxA4_AY769933.fna.cdbg_ids.reads.gz",
        "outputs/sgc_arg_queries_r{radius}/{sample}_k31_r{radius}_search_oh0/cfxA4_AY769933.fna.contigs.sig"
    params: outdir = lambda wildcards: "outputs/sgc_arg_queries_r" + wildcards.radius
    conda: "envs/spacegraphcats.yml"
    resources:
        mem_mb = 64000
    threads: 1
    shell:'''
    python -m spacegraphcats run {input.conf} extract_contigs extract_reads --nolock --outdir={params.outdir} --rerun-incomplete 
    '''

Configuration files

Spacegraphcats requires a configuration file that specifies run parameters like k-mer size and radius, and base file names. This configuration file can be generated by hand prior to running a workflow, by script prior to running a workflow, or automatically by the workflow.

Generating configuration files prior to running snakemake

If the spacegraphcats queries are defined prior to running a workflow, the configuration file can be generated prior to running the workflow. Below we include some python code to generate multiple configuration files. This code assumes that sample names are stored in a metadata csv file, and that abundance trimmed reads are stored in a directory called outputs/abundtrim. It writes configuration files to inputs/sgc_conf, a directory the user will need to make before generating the config files.

import yaml
import io
import pandas as pd
import re

m = pd.read_csv("metadata.tsv", header = 0)
SAMPLES = m['sample_name'].unique().tolist()

genome_queries = genomes = ["ERS235530_10.fna.gz", "ERS235531_43.fna.gz", "ERS235603_16.fna.gz"]
genome_query_paths = ["inputs/queries/" + genome for genome in genome_queries]

for sample in SAMPLES:
    yml = {'catlas_base': sample,
           'input_sequences': ['outputs/abundtrim/' + sample + '.abundtrim.fq.gz'],
           'ksize': 31, 
           'radius': 1,
           'search': genome_query_paths}
    with io.open("inputs/sgc_conf/" + sample + '_r1_conf.yml', 'w', encoding='utf8') as outfile:
        yaml.dump(yml, outfile, default_flow_style=False, allow_unicode=True, sort_keys=False)

Using snakemake to automatically generate configuration files

In some workflows, the queries may not be pre-defined but instead may be first produced by the workflow. In this case, snakemake checkpoints can be used to automatically and flexibly generate spacegraphcats configuration files based on the results of the workflow.

The genome-grist pipeline implements this strategy based on the output of sourmash gather. When run against a metagenome, gather compares the metagenome against a database of genomes and returns the minimum set of genomes that will produce the maximum mapped reads. One output of gather is a csv file that specifies genomes that are contained in a metagenome. These genomes make a great set of spacegraphcats queries, but will be unknown until sourmash gather has been run on the metagenome by the workflow.