GitHub - AlgoLab/pangeblocks: pangenome from maximal blocks in an MSA

Customized construction of pangenome graphs via maximal blocks

Pangenome graphs

A pangenome graph is a data structure for a set of genomic sequences, where nodes are labeled by substrings in the alphabet $\lbrace A,C,G,T \rbrace$ in the case of DNA, and $\lbrace A,C,G,U \rbrace$ in the case of RNA (unknown basis are represented by an $N$). Input sequences are represented by paths in the graph, where each input sequence is spell by the concatenation of labels of the nodes in its path.

What is a block?

In the context of an MSA, a block is defined as set of rows that shares the same string in an interval of columns.

Formally, a block composed of the set of rows $K$, in the interval $[b,e]$ is defined as the triplet $(K, b, e)$.

In the image below, each block is highlighted by a different color. For example,

The yellow one in the middle: $(\lbrace \square, \lozenge, \triangle \rbrace, 3, 5)$ spelling A - -
The blue one spanning all rows: $(\lbrace \circ ,\square, \lozenge, \triangle, \triangledown \rbrace, 7, 8)$ spelling G A
The green one in column 5: $(\lbrace \circ,\triangledown \rbrace, 5, 5)$ spelling spelling the single character C

How it works?

pangeblocks creates a variation graph from an MSA by selecting a set of blocks. It creates a search space of blocks from maximal blocks, and then an Integer Linear Programming model selects the best subset of blocks to cover all cells of the MSA.

For each block we create a node with its label.
Consecutive blocks are connected by an arc.
Each input sequence in the MSA is spelled by a path in the graph

Finally, indels are removed from the graph (could be the remotion of an entire node, or the removal of indels in the label of a node), and non-branching paths are collapsed.

Create a virtual environment

mamba create env -n pangeblocks -f envs/snakemake.yml
mamba activate pangeblocks

Create Variation Graphs from MSAs

To construct variation graphs from MSAs, run pangeblocks:

snakemake -s pangeblock.smk -c32 --use-conda # variation graph as GFA

Parameters

Set the parameters in params.yml:

PATH_MSAS: "msas" # folder containing MSAs in .fasta/.fa format
PATH_OUTPUT: "output" # folder where to save the results
OPTIMIZATION:
  OBJECTIVE_FUNCTION:
    - "nodes"    # minimize number of nodes 
    - "strings"  # minimize length of the graph
    - "weighted" # penalize by PENALIZATION blocks shorter that MIN_LEN (other blocks cost=1)
    - "depth"    # penalize by PENALIZATION blocks used by less than MIN_COVERAGE (other blocks cost=1)
  PENALIZATION: # used only with "weighted" and "depth"
    - 3
    - 5
    - 7 
    - 128
  MIN_LEN: # used only with "weighted"
    - 3
    - 5
    - 10   
  MIN_COVERAGE: # used only with "depth"
    - 0.1
    - 0.2
    - 0.3
  THRESHOLD_VERTICAL_BLOCKS: # minimum length for vertical blocks to be fixed in the optimal solution
    - 1
    - 2
    - 8
    - 16
  TIME_LIMIT: 30 # time limit to run each ILP (minutes)
LOG_LEVEL: "INFO"
THREADS: 
  TOTAL: 32
  SUBMSAS: 16
  ILP: 8

Running under docker

To run pangeblocks on a small example (MSA with 10 rows and 200 columns), run the following command, replacing /tmp/pgb with the directory that will contain the results.

docker run -it --user $(id -u):$(id -g) \ 
-v ./test/sars-cov-2-subMSA/:/data \
--mount type=bind,source=/tmp/pgb,target=/results \
algolab/pangeblocks:latest

If you want to run pangeblocks on your data, you also have to provide the directory containing the MSA, replacing ./test/sars-cov-2-subMSA/ with your directory and adding the correct pangeblocks call, specifying the arguments. For example:

docker run -it --user $(id -u):$(id -g) \ 
-v ./DATADIR/:/data \
--mount type=bind,source=/tmp/pgb,target=/results \
algolab/pangeblocks:latest
/app/pangeblocks --path-msa /data/my.msa

Considerations

Maximal blocks are computed with the Wild-PBWT
ILPs are solved using Gurobi
Each MSA must be in the ALPHABET ${A,C,G,T,-,N}$ Not case sensitive. We recommend to map all characters not in the alphabet to N.

[OPTIONAL]

snakemake -s eda.smk -c16         # compute stats for each MSA

The above smk pipeline will analyze the MSAs and output two files in PATH_OUTPUT/analysis-msas:

stats_msas.tsv with basic information about the MSAS: path, number of columns and rows (sequences), number of identical columns, and number of unique sequences
problematic_msas.tsv: contains a list of MSAs that has no information

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
envs		envs
img		img
notebooks		notebooks
rules		rules
src		src
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
eda.smk		eda.smk
pangeblocks		pangeblocks
pangeblocks.smk		pangeblocks.smk
pangeblocks1.smk		pangeblocks1.smk
params.yml		params.yml
requirements.txt		requirements.txt
test-cmd.sh		test-cmd.sh
test_ram.sh		test_ram.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pangenome graphs

What is a block?

How it works?

Create a virtual environment

Create Variation Graphs from MSAs

Parameters

Running under docker

Considerations

About

Releases

Packages

Contributors 2

Languages

License

AlgoLab/pangeblocks

Folders and files

Latest commit

History

Repository files navigation

Pangenome graphs

What is a block?

How it works?

Create a virtual environment

Create Variation Graphs from MSAs

Parameters

Running under docker

Considerations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages