AnnotaPipeline
Integrated tool to annotate hypothetical proteins developed by Laboratório de Bioinformática at Universidade Federal de Santa Catarina (Brazil).
AnnotaPipeline was tested in Unix-based systems. We strongly recommend to use Unix-based systems or Mac. AnnotaPipeline was not tested in Windows.
If you have questions, suggestions or difficulties regarding the pipeline, please do not hesitate to contact our team here on GitHub or by email: [email protected].
AnnotaPipeline.py
checks all required parameters inAnnotaPipeline.yaml
before execution
Run AUGUSTUS for gene prediction, using these required arguments:
- strand
- genemodel
- species
- protein
- introns
- start
- stop
- cds
You can also use these optional arguments:
Or add augustus argument below
augustus-optional:
(as described in example config file)
- hintsfile
- extrinsicCfgFile
- UTR
After gene prediction, sequences are "cleaned" based on minimal sequence size from seq-cleaner
on AnnotaPipeline.yaml
.
.aa
, .cdsexon
and .codingseq
sequences are extracted from GFF output using getAnnoFasta.pl
(AUGUSTUS script).
.aa
sequences are used for subsequent analysis. .codingseq
sequences are used for transcriptomics analysis (optional).
Similarity analysis run through blastp_parser.py
, which executes blastp
with format 6 output (parsed after run) on predicted proteins (cleaned) against SwissProt:
- qseqid
- sseqid
- sacc
- bitscore
- evalue
- ppos
- pident
- qcovs
- stitle
evalue
andmax_target_seqs
given by user
This output is parsed to find annotations in the secondary database. The keyword list in AnnotaPipeline.yaml
is used to exclude potential hypothetical annotations.
Hits are classified – as a potential annotation – if: (i) it doesn't have any words in the keyword list, and (ii) passed value thresholds for identity, positivity and coverage. All potential annotations – hits that passed all criteria – are present in BASENAME_SwissProt_annotations.txt
for manual check.
Potential annotations are ranked by bitscore
and the best hit is assigned to the corresponding protein. Proteins that aren't annotated based on SwissProt annotations, are separated in BASENAME_BLASTp_AA_SwissProted.fasta
and blastp
is rerun against the secondary database (selected in AnnotaPipeline.yaml
).
Following the same process, BASENAME_SpecifiedDB_annotations.txt
classifies potential annotations.
Annotation files:
BASENAME_annotated_products.txt
contains all annotationsBASENAME_hypothetical_products.txt
contains hypothetical proteins with no hits that passed all criteria(which doesn't have a single hit that passes all criteria) are present inBASENAME_hypothetical_products.txt
BASENAME_no_hit_products.txt
contains proteins with no hits – against SwissProt and the secondary database – that are treated as hypothetical for subsequent analysis
Functional analysis runs in two different ways:
- For annotated proteins:
- InterProScan is used (with configured databases) to get ontology and IPR terms
- For hypothetical proteins (which includes no_hit_products):
- InterProScan, RPS-BLAST and HMMER are used to find hits of possible functions of predicted proteins
Software arguments:
Optional arguments can be given in
AnnotaPipeline.yaml
for InterProScan, HMMER and RPS-BLAST (not tested)
- InterProScan – for annotated and hypothetical proteins – uses
-goterms
and-iprlookup
arguments. hmmscan
runs with--noali
argument and user values forevalue
anddomE
. It also uses Pfam database.- RPS-BLAST runs with
evalue
andmax_target_seqs
arguments given inAnnotaPipeline.yaml
and-outfmt 6 "qseqid sseqid sacc bitscore evalue ppos pident qcovs stitle"
. RPS-BLAST uses CDD database.
Parsing:
functional_annotation_parser.py
join outputs for both InterProScan runs in a single file calledInterProScan_Out_BASENAME.txt
. This file summarizes outputs for each predicted protein.- Coils, Gene3D and MobiDBLite databases are structural databases and are excluded from this output.
- RPS-BLAST information for hypothetical proteins are summarized in
BASENAME_Grouped_Hypothetical_Information.txt
as it gives long descriptions. It may be helpful to find functional hints for proteins. info_parser.py
uses InterProScan results with parsed files from BLAST results to generateAll_annotated_products.txt
. This file joins gene annotation (from BLAST) with functional annotation (from InterProScan) using GOs and IPR values. This file is used to annotate sequences (aminoacid and nucleotide) in FASTA and GFF files.
Transcript quantification uses Kallisto and transcripts (.codingseq
provided by AUGUSTUS) with RNA-seq data given by user.
Analysis starts with kallisto index
and is followed by kallisto quant
:
Both methods uses
bootstrap
value provided by user
-
1 – paired-end data – using estimated average fragment length (
l
) and estimated standard deviation of fragment length (s
)- Optional arguments for paired-end data if given by user in
AnnotaPipeline.yaml
OR
- Optional arguments for paired-end data if given by user in
-
2 – single-end data – using REQUIRED arguments estimated average fragment length (
l
) and estimated standard deviation of fragment length (s
) given by user inAnnotaPipeline.yaml
Parsing:
kallisto_parser.py
removes hits – by TPM value – below thethreshold
parameter fromproteomics
section. Possible thresholds are:- Median
- Mean
- Float value (user input)
- Parsed output file is called
BASENAME_Transcript_Quantification.tsv
and contains a simplified result of Kallisto output (abundance.tsv
) with target_id and TPM value
WARNING 1: Faild runs for Comet MS/MS don't crash AnnotaPipeline, so check AnnotaPipeline_Log.log to assure all spectrometry files produced outputs
WARNING 2: Before run, check if your comet.params file is compatible with installed comet version
Proteomics analysis uses Comet MS/MS with comet.params
config given by user. In this file, our script overwrite values for following parameters:
decoy_search = 1
output_pepxmlfile = 0
output_percolatorfile = 1
decoy_prefix = DECOY_
Comet MS/MS runs with:
- Modified
comet.params
- Annotated protein file
- Path containing mass spectrometry files (
comet-spectrometry
param) and extension (comet-ext
param)- Extension can be mzXML, mzML, Thermo raw, mgf, and ms2 variants (cms2, bms2, ms2)
- Optional arguments
first
andlast
are used, if given, and overwrite cutoff values defined incomet.params
Comet MS/MS outputs (files with .pin
extension) and raw files are moved to current directory.
Percolator runs for each .pin
file with default parameters.
Percolator outputs are parsed by percolator_parser.py
using percolator-qvalue
(AnnotaPipeline.yaml
). Raw files are maintained. quantitative_proteomics
function uses parsed Percolator files to create BASENAME_Total_Proteomics_Quantification.tsv
.
This output quantifies (all spectometry outputs):
- Unique Peptide – number of unique peptides found across the entire dataset
- Total Peptide – number of total peptides found across the entire dataset
- Unique Spectrum – number of unique spectrum found across the entire dataset
- Total Spectrum – number of unique spectrum found across the entire dataset
summary_parser.py
integrates all outputs (in a single file summarizing all annotations found for each protein) from:
- Prediction and similarity analysis
- Functional annotation
- Transcriptomic (if used)
- Peptide identification (if used)
AnnotaPipeline requires the following software to run properly:
- BLAST+ and RPS-BLAST (available at https://ftp.ncbi.nih.gov/blast/executables/blast+/LATEST)
- InterProScan (available at https://interproscan-docs.readthedocs.io/en/latest/HowToDownload.html)
- HMMER (available at http://hmmer.org/download.html)
You will also need to install AUGUSTUS (available at https://github.com/Gaius-Augustus/Augustus) if you want to run this pipeline starting with gene/protein prediction.
Before executing, please modify the necessary fields in the configuration file (AnnotaPipeline.yaml
).
Required:
- SwissProt (available at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz)
- Pfam (available at http://pfam.xfam.org)
- CDD (available at https://www.ncbi.nlm.nih.gov/cdd)
Choose one secondary database:
-
TrEMBL (available at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz)
-
EuPathDB (available at https://veupathdb.org/veupathdb/app):
- AmoebaDB
- CryptoDB
- FungiDB
- GiardiaDB
- HostDB
- MicrosporidiaDB
- PiroplasmaDB
- PlasmoDB
- ToxoDB (tested)
- TrichDB
- TriTrypDB (tested)
-
NCBI | NR Database (available at https://ftp.ncbi.nlm.nih.gov/blast/db)
TIP: You can use a subset of the NR Database
-
Custom Database: You can set a custom database if you provide the pattern to get descriptions in the fasta file.
Example:
Sequences from the Arabidopsis database (ArabdopsisDB Tair10) are separated by pipes ("
|
") with protein descriptions in the 3th field.>AT1G51370.2 | Symbols: | F-box/RNI-like/FBD-like domains-containing protein | chr1:19045615-19046748 FORWARD LENGTH=346
It is possible to use AnnotaPipeline with the Tair10DB changing the following parameters:
secondary-format
set tocustom
customsep
set to"|"
customcolumn
set to2
NOTE:
customcolumn
is set to2
because indexing starts at 0.
WARNING: Installation through conda/mamba requires manual download and configuration of InterProScan databases
- CDD
- Gene3D
- Hamap
- Panther
- Pfam
- Pirsf
- Pirsr
- Prints
- PrositePatterns
- PrositeProfiles
- Sfld
- Smart (unlicenced)
- Superfamily
- Tigrfam
- Coils
- MobiDBLite
-
Download
Annota_environment.yaml
file -
Create environment
2.1 with default conda
conda env create -n <desired_name> -f Annota_environment.yaml
2.2 with mamba (speedup installation)
conda update -n base conda conda install -n base -c conda-forge mamba mamba env create -n <desired_name> -f Annota_environment.yaml
-
Activate environment
conda activate <desired_name>
-
Configure InterProScan databases and
AnnotaPipeline.yaml
- Locate AnnotaPipeline environment home
echo $CONDA_PREFIX
- Go to
$CONDA_PREFIX/config/species
cd $CONDA_PREFIX/config/species
- Add custom species folder (trained results)
- Clone repository
git clone https://github.com/bioinformatics-ufsc/AnnotaPipeline.git
- Run
setup.py
(Scripts will be available at$PATH
)
python3 setup.py
-
Install required softwares:
- BLAST+ and RPS-BLAST (available at https://ftp.ncbi.nih.gov/blast/executables/blast+/LATEST)
- InterProScan (available at https://interproscan-docs.readthedocs.io/en/latest/HowToDownload.html)
- HMMER (available at http://hmmer.org/download.html)
-
Optional softwares
- Kallisto (available at https://pachterlab.github.io/kallisto/download.html)
- Comet MS/MS (available at https://github.com/UWPR/Comet/releases/latest)
- Requires Percolator (available at https://github.com/percolator/percolator)
-
Configure
AnnotaPipeline.yaml
TIP: If you already have InterProScan locally installed and configured, use it instead (as an alternative to conda installation – interpro vanilla)
We recommend using our example configuration file as a guide (config_example.yaml
).
AnnotaPipeline.py
can run with three different options:
AnnotaPipeline.py -c AnnotaPipeline.yaml -p protein_sequences.fasta
This is the simplest execution of AnnotaPipeline.
The annotation process will begin with the submitted protein_sequences.fasta
and will contain a simplified version of header.
Also, this way to run will not produce and annotated GFF output.
AnnotaPipeline.py -c AnnotaPipeline.yaml -s genomic_data.fasta
This is the complete execution of AnnotaPipeline.
It will execute gene/protein prediction based on genomic_data.fasta
utilizing AUGUSTUS and the predicted proteins will initiate the annotation process.
Given the prediction process, it is important to use a trained AUGUSTUS model for your species before executing AnnotaPipeline.
AnnotaPipeline.py -c AnnotaPipeline.yaml -p protein_sequences.fasta -gff gff_file.gff
You can execute AnnotaPipeline with this command line if you already have .aa
and .gff
files from previous AUGUSTUS predictions. The submitted .gff
needs to be in GFF3 format.
The annotation process is the same as the genomic data input, the difference being you will skip gene prediction and start with similarity analysis.
AnnotaPipeline, it will output five main files (along with many others in their respective folders):
All_Annotated_Products.txt
contains all unique sequence identifiers and their respective annotations (with functional annotations – when present).Annota_BASENAME.fasta
contains all sequences and their annotations (with functional annotations – when present) in FASTA format.BASENAME_Annotated_GFF.gff
contains all sequences and their annotations (with functional annotations – when present) in GFF3 format.This file is absent in Protein file as input run mode
AnnotaPipeline_BASENAME_transcripts.fasta
contains nucleotide sequences for predicted proteins, with the same features present in the protein file.AnnotaPipeline_BASENAME_Summary.tsv
summarizes hits for each protein in similarity, functional, transcriptomics (if used) and proteomics analysis (if used).
Raw outputs are listed inside output folders:
1_GenePrediction_BASENAME
– AUGUSTUS files2_SimilarityAnalysis_BASENAME
– BLASTP analysis3_FunctionalAnnotation_BASENAME
– InterProScan/HMMER/RPS-BLAST analysis4_TranscriptQuantification_BASENAME
– Kallisto analysis5_PeptideIdentification_BASENAME
– Comet MS/MS and Percolator analysis
The output folders and files will be located in the same folder you executed the pipeline.
- Output files for Organisms present in AnnotaPipeline publication are available at: http://150.162.6.129/Annotafiles/
If you used AnnotPipeline in your research, please cite us
AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data
...And the following papers:
- AUGUSTUS: Stanke M. et al., 2003
- BLAST+: Camacho C. et al., 2008
- HMMER: follow HMMER user guide
- InterProScan: Jones et al., 2014
If you used Transcriptomics module, please also cite:
- Kallisto: Bray, N. L. et al., 2016
If you used Proteomics module, please also cite:
- COMET MS/MS: Eng, J. K., et al., 2012
- percolator: The, M. et al., 2016