Skip to content

Snakefile config

Francisco Zorrilla edited this page Mar 25, 2021 · 14 revisions

Overview

The config.yaml file contains a list of parameters that are read in by the Snakefile. Instead of editing the Snakefile whenever you want to try to change some parameter, just create a new copy of the config.yaml file. Now thats what I call reproducibility.

The config.yaml file looks something like this:

path:
    root: /path/to/your/project/folder/on/the/cluster
    scratch: $SCRATCH_FOLDER_VARIABLE_SPECIFIC_TO_YOUR_CLUSTER
folder:
    data: dataset
    logs: logs
    assemblies: assemblies
    ...
scripts:
    kallisto2concoct: kallisto2concoct.py
    prepRoary: prepareRoaryInput.R
    binFilter: binFilter.py
    ...
cores:
    fastp: 4
    megahit: 48
    crossMap: 24
    ...
params:
    cutfasta: 10000
    assemblyPreset: meta-sensitive
    assemblyMin: 1000
    ...
envs:
    metagem: metagem
    metawrap: metawrap
    prokkaroary: prokkaroary

Let's now look at each category in this config file.

path

root

The root path will be automatically set by the metaGEM.sh parser to be the current folder you are submitting jobs from. This is where folders will be created to store the generated files:

~/cluster_login_home/
|-project_X/
|--root/
|---logs
|---dataset
|---qfiltered
|---assemblies
...

scratch

The scratch path is cluster specific, and you will likely need to consult your the wiki for your institutions cluster to determine how it should be set. Generally there should be some directory for high I/O jobs, usually called something like $SCRATCHDIR or $TMPDIR or $TMP. The Snakefile assumes that this variable has a unique location for each job submission. You should not set the scratch path to be a specific directory if you are submitting jobs in parallel, as this may result in multiple jobs copying and reading files from the same temporary directory and result in errors.

folder

This is simply a list of all the subfolders that are used/created throughout the metaGEM workflow. You can generate these folders by running:

metaGEM.sh -t createFolders

scripts

This contains a list of all the scripts or important files present in the scripts folder that are used throughout the metaGEM workflow.

cores

This lists the number of cores you wish to allocate to each task or job. Note that these values are read by the Snakefile, while the number of cores requested in the metaGEM.sh parser or the cluster_config.json file are read by the cluster workload manager. You should carefully ensure that these values match when submitting jobs.

e.g.

Let's say that we want to submit an assembly job with megahit, and the config.yaml file has 48 cores allocated to megahit by default as shown at the top of this page. If we were to run the following code:

metaGEM.sh -t megahit -j 1 -c 2 -m 12 -t 2

Then the metaGEM.sh parser would request 1 job with 2 cores + 12 GB RAM with a max runtime of 2 hours. Once the job starts, the Snakefile rule megahit will start running. The megahit call is implemented like so:

megahit -t {config[cores][megahit]} \
     --presets {config[params][assemblyPreset]} \
     --verbose \
     --min-contig-len {config[params][assemblyMin]} \
     -1 $(basename {input.R1}) \
     -2 $(basename {input.R2}) \
     -o tmp;

As you can see from the -t flag parameter, the job will look into the config.yaml file to see what the value of megahit cores is set to. Like we mentioned, by default this is set at 48 cores, but the job submission only requested 2 cores. In such a case, since the job only has access to the resources requested, only 2 cores will be used instead of the desired 48.

Even worse would be to request 48 cores for a certain task via a metaGEM.sh call, but then the config.yaml file only specifies 2 cores for that given task. In this case you would be wasting 46 cores and your cluster administrators will be very upset with you.

The moral of the story here is to make sure that when you use the metaGEM.sh parser for a particular task, you request the number of cores that is specified in the config.yaml file

params

Contains a list of the parameters used by the various individual tools.

envs

This simply contains a list of the conda environment names used within the Snakefile rules. For example, you will notice that most rules in the Snakefile start with activating a certain environment:

set +u;source activate {config[envs][metagem]};set -u;