BaseVarC - SNPs Calling From Low-Pass (<1.0x) WGS Data

Current Version: 1.0.0

BaseVarC was implemented in C++, aiming at speeding up variants calling from large-scale population, and was used in the CMDB project for calling variants from one million samples

Installation

git clone --recursive https://github.com/Zilong-Li/BaseVarC.git
cd BaseVarC
./configure
make

If everything goes well, you can find BaseVarC program in the src directory.

Command-Line

BaseVarC
Contact: Zilong Li [[email protected]]
Usage  : BaseVarC <command> [options]

Commands:
         basetype       Variants Caller
         popmatrix      Create population matrix
         concat         Concat popmatrix

Variants Calling

Commands: BaseVarC basetype
Usage   : BaseVarC basetype [options]

Options :
  --input,      -i         BAM/CRAM file list, one file per row
  --output,     -o         Output file prefix
  --reference,  -r         Reference file
  --region,     -s         Samtools-like region <chr:start-end>
  --group,      -g         Population group information <SampleID Group>
  --mapq,       -q <INT>   Mapping quality >= INT [10]
  --thread,     -t <INT>   Number of threads
  --batch,      -b <INT>   Number of samples each batch
  --maf,        -a <FLOAT> Minimum allele count frequency [min(0.001, 100/N, maf)]
  --load,                  Load data only
  --rerun,                 Read previous loaded data and rerun
  --keep_tmp,              Don't remove tmp files when basetype finished
  --verbose,    -v         Set verbose output

Testing

In the tests directory, there is a script which contains a example using test data.

cd test/
sh test.sh

Note on performance

RAM, run time and I/O all rest squarely on three parameters: --region, --thread and --batch. Depending on your situation, you can customize these parameters for exploiting your HPC servers.

--batch : BaseVarC converts reads from BAM files into an internal temp format. This parameter control how many samples will be bundled as a batch. RAM is linear with this. Larger number means more RAM but less file pointers(I/O).
--region: The longer the genomic region is given, the more RAM is used. Be aware that reading BAM files repeatedly is overhead. So you should split the chromosome into long region as possible as you can.
--thread: The number of threads to use. RAM and I/O are linear with threads. The more threads are given, the faster BaseVarC is.

License

BaseVarC and the code in this repo is available under a GPL3 license. For more information please see the LICENSE

Citation

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
SeqLib @ 0877128		SeqLib @ 0877128
fmt @ 2711cb1		fmt @ 2711cb1
src		src
test		test
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile.am		Makefile.am
Makefile.in		Makefile.in
README.md		README.md
compile		compile
config.h.in		config.h.in
configure		configure
configure.ac		configure.ac
depcomp		depcomp
install-sh		install-sh
missing		missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BaseVarC - SNPs Calling From Low-Pass (<1.0x) WGS Data

Installation

Command-Line

Variants Calling

Testing

Note on performance

License

Citation

About

Releases

Packages

Languages

License

Zilong-Li/BaseVarC

Folders and files

Latest commit

History

Repository files navigation

BaseVarC - SNPs Calling From Low-Pass (<1.0x) WGS Data

Installation

Command-Line

Variants Calling

Testing

Note on performance

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages