Skip to content
/ SABAT Public

A toolkit for identifying gene homologs in closely related species.

Notifications You must be signed in to change notification settings

Orpowell/SABAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

S.A.B.A.T - Semi-Automatic BLAST Annotation Toolkit

SABAT is toolkit for identifying gene homologs in other species using a gene coding sequence (CDS) and BLAST.

Homologs can be identified using the following steps:

  1. Make a BLASTdb of your genome of interest (with parameters: -parse_seqids, -blast_dbversion 5)
  2. BLAST your gene's CDS against the BLASTdb. (we suggest using dc-megaBLAST)
  3. Convert the BLAST output to a BED file and predict potential gene loci (SABAT blast2bed).
  4. Visualise your BED file with the genome of interest using IGV (Optional).
  5. Assemble the CDS and protein sequence from a predicted locus or set of exons (SABAT assemble-locus or assemble-exons).

Dependencies

The following dependencies are required for SABAT:

  • Python (v3.12.1)
  • Biopython (v1.78)
  • Pandas (v2.1.4)
  • BLAST+ (2.12.0+)
  • IGV (optional for visualising BED files)

The following commands can be used to setup an environment with all the dependencies using anaconda:

conda create -n SABAT python=3.12.1
conda install -c bioconda biopython blast
conda install pandas

Running SABAT

All SABAT commands can be run as follows:

source activate SABAT
python3 cli.py <command> ...

Commands

blast2bed

blast2bed analyses a BLAST output to identify potential gene homolog loci using information about the gene provided (see table below). All BLAST results and predicted loci are output as a single BED file that can be visualised with IGV. Manual curation can facilitate the identification of loci that are missed by blast2bed. This occurs when homolog gene architecture varies significantly from your gene of interest. E.g. differences in intron/exon size and structure. The commands can be used as follows:

python3 cli.py blast2bed \
--input results.blastn \
--exons 1 \
--coverage 1.1 \
--locus_size 1000 \
--refseq False \
> output.bed

Parameters

Parameter Description
-i, --input BLAST output in tabular format (outfmt6)
-e, --exons Expected number of exons in the gene. Default: 1
-c, --coverage A float that represents the proportion of gene CDS that must be covered by a group of BLAST hits for a locus to be defined (0.0-1.0). Default: 1.1
-l, --locus_size Expected size of the gene locus in base pairs. Default: 1000
-r, --refseq Boolean values that indicates whether the BLASTDB provided uses a refseq sequence (RefSeq headers are different from the standard headers in a BLASTDB). Default: False

The BED file generated is directed to the stdout.

assemble-locus

Assemble-locus generates a CDS and protein sequence for a gene locus predicted by blast2bed. Assembly will fail if the CDS doesn't end in a stop codon. The flank option can be used to extend the last exon of the CDS and include a stop codon that may not have been captured when BLASTing the CDS of your gene of interest. The command can be used as follows:

python3 cli.py assemble-locus \
--input output.bed \
--database path/to/blastdb/ \
--locus locus_0 \
--output locus_prediction \
--flank 100 

Parameters

Parameter Description
-i, --input Path to a Bed file (generated by blast2bed)
-db, --database Path to BLASTDB used to generate the blast output used by blast2bed
-l, --locus The name of the locus to assemble.
-o, --output Base name for output files
-f, --flank The number of nucleotides to add to the last exon. Default: 0.

assemble-exons

assemble-exons generates a CDS and protein sequence for a set of exons identified by the user from the output of blast2bed. Assembly will fail if the CDS doesn't end in a stop codon. The flank option can be used to extend the last exon of the CDS and include a stop codon that may not have been captured when BLASTing the CDS of your gene of interest. The commands can be used as follows:

python3 cli.py assemble-exons \
--input output.bed \
--database path/to/blastdb/ \
--exons exon_1 exon_2 exon_3 \
--output exon_prediction \
--flank 100 

Parameters

Parameter Description
-i, --input Path to a Bed file (generated by blast2bed)
-db, --database Path to BLASTDB used to generate the blast output used for blast2bed
-e, --exons A list of exons
-o, --output Base name for output files
-f, --flank The number of nucleotides to add to the last exon. Default: 0.

Assumptions made by SABAT during Assembly

SABAT's assemble commands makes the following assumptions when predicting homologs CDS:

  1. The first exon starts with start codon (ATG).
  2. The last exon ends with a stop codon (TGA, TTA or TAG).
  3. Intermediate exons don’t have to start with ATG, or end with TGA, TTA or TAG.
  4. If intermediate exons contain a stop codon that is the end of the exon.
  5. The longest exon is the correct exon (providing all other criteria are satisfied).

About

A toolkit for identifying gene homologs in closely related species.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages