Skip to content
/ kpi Public

Structural interpretation of Killer cell immunoglobulin-like Receptors (KIR) haplotypes from raw short or long read sequences. It predicts the presence/absence of 16 KIR genes and then uses that to predict pairs of structural (gene-content and order) haplotypes.

License

Notifications You must be signed in to change notification settings

droeatumn/kpi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KPI

main.nf makes the predictions.

Dependencies

Install Java, Groovy, Nextflow, Docker, and Git. Create accounts in GitHub and Docker Hub. Add 'docker.enabled = true' and 'docker.fixOwnership = true' to your Nexflow configuration (e.g., $HOME/.nextflow/config). Make sure Docker is running and you are logged in to Docker Hub.

Running

Input
There are two input options.
1. An ID along with a folder of fasta or fastq files, optionally gzipped. (--raw and --id)
2. A two-column text file, where the first column is an ID, and the second column is a path to a fasta or fastq file (--map). Each ID may have multiple rows. The paths to the files be absolute or relative, but the files must be in the same directory as the map file or under it. If using relative paths, the paths must start with the _parent_ folder of the map file.

Option 1 is more efficient with respect to disk space.

Output
For each input ID, an output text file will be created named '_prediction.txt'. Each ID's output file contains a header line and a second line with the haplotype pair predictions and gene predictions.
Each haplotype within a pair is separated by a '+'. If the prediction is ambiguous, each pair of haplotypes is separated by '|'. e.g.,
'cA01˜tA01+cA01˜tB01|cA01˜tA01+cB05˜tB01|cA01˜tB01+cB05˜tA01' means haplotype
'cA01˜tA01 and cA01˜tB01' or 'cA01˜tA01 and cB05˜tB01' or 'cA01˜tB01 and cB05˜tA01'.

The reference haplotypes are defined at https://github.com/droeatumn/kpi/blob/master/input/haps.txt

Running
Use 'raw' to indicate the input directory, and 'output' to indicate the directory to put the output. The defaults are 'raw' and 'output' under the location where KPI was pulled.
Use 'filetype' to indicated the input type; default is 'fq' (FASTQ).
f<a/q/m/bam/kmc> - input in FASTA format (fa), FASTQ format (fq), multi FASTA (fm) or BAM (fbam) or KMC(fkmc); default: FASTQ

Option 1: Provide and ID (--id) and a folder (--raw) with its raw data
./main.nf --id ID --raw inDir --output outDir --filetype fq
e.g., ./main.nf --id id1 --raw ~/input --output ~/output

Option 2: Provide a file with a map (--map) from IDs to their raw data
./main.nf --map mapFile.txt --output outDir --filetype fq
e.g., ./main.nf --map ~/input/idstoRaw.txt --output ~/output
In this example the path to files in idstoRaw.txt are somewhere under ~/input/.

Example using data in the image, so no input is required.
Example 1: cA01˜tA01+cB01˜tB01 with --raw.
Run the following command for an example of interpreting synthetic reads created from sequences with Genbank IDs KP420439 and KP420440 (https://www.ncbi.nlm.nih.gov/nuccore/KP420439 and https://www.ncbi.nlm.nih.gov/nuccore/KP420440)). These two haplotypes contain all the genes except KIR2DS5, so the haplotype predictions are very ambiguous.

./main.nf --id ex1 --raw ~/git/kpi/input/example1 --output ~/output

To run another example, replace 'example1' with 'example2'.

Example 2: cA01˜tA01+cA01˜tB01 with --map and --id.
Run the following command for an example of interpreting synthetic reads created from sequences with Genbank IDs KP420439 and KU645197 (https://www.ncbi.nlm.nih.gov/nuccore/KP420439 and https://www.ncbi.nlm.nih.gov/nuccore/KU645197)).

./main.nf --id ex2 --map ~/git/kpi/input/example2/example2.txt --output ~/output

To run another example, replace 'example2' with 'example1'.

Example 3: combine Example 1 and 2 with --map and --id.
./main.nf --id ex12 --map ~/git/kpi/input/example1-2.txt --output ~/output

Miscellaneous
Hardware
For targeted sequencing, kpi requires approximately 4 CPU, 8G RAM and 20G disk space. For WGS, it requires around 13 CPU, 16G RAM total and 100G temp disk space.

Raw data
The software assumes average coverage for both chromosomes is less than 255. If this is not the case for your data, please downsample before running. Support for high coverage data is a future enhancement.

Containers
To run without a container, use the --nocontainer parameter. To use a container other than the default (droeatumn/kpi:latest), use the --container parameter.

To run in a self-contained environment with the --id parameter. Replace 'inDir' and 'outDir'.
docker run --rm -it -v inDir:/opt/kpi/raw/ -v outDir:/opt/kpi/output/ droeatumn/kpi:latest /opt/kpi/main.nf --id
Or
docker run --rm -it -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --id ex1 --raw /opt/kpi/input/example1/
Or
docker run --rm -it -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --map /opt/kpi/input/example1/example1.txt
Or, if your bam file (for one individual) is locally in ~/data
docker run --rm -it -v ~/data:/opt/kpi/raw -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --filetype fbam --id testid
Or, if a map to the bam files locally withing ~/data
docker run --rm -it -v ~/data:/opt/kpi/raw -v $PWD/output:/opt/kpi/output droeatumn/kpi:latest /opt/kpi/main.nf --filetype fbam --map /opt/kpi/raw/map.txt

Citation
Roe D, Kuang R. Accurate and Efficient KIR Gene and Haplotype Inference From Genome Sequencing Reads With Novel K-mer Signatures. Front Immunol (2020) 11:583013. (https://doi.org/10.3389/fimmu.2020.583013)

About

Structural interpretation of Killer cell immunoglobulin-like Receptors (KIR) haplotypes from raw short or long read sequences. It predicts the presence/absence of 16 KIR genes and then uses that to predict pairs of structural (gene-content and order) haplotypes.

Topics

Resources

License

Stars

Watchers

Forks

Packages