Skip to content

jorgeavilacartes/basecalling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNA Basecaller for ONT data

Create and activate environmnet

micromamba env create -n basecalling-cuda117 -f envs/basecalling_cuda11.7_pytorch2.yml
micromamba activate basecalling-cuda117

use conda/miniconda/mamba/micromamba

Training

for testing with small datasets

python feito/train.py --path-train data/subsample_train.hdf5 --path-val data/subsample_val.hdf5 --model Rodan --epochs 5 --batch-size 16
python3 feito/train.py --path-train data/RODAN/train/rna-train.hdf5 --path-val data/RODAN/train/rna-valid.hdf5 --epochs 30 --batch-size 16 --num-workers 4 --model SimpleNet --device cuda

with RODAN's dataset

python feito/train.py --path-train data/RODAN/train/rna-train.hdf5 --path-val data/RODAN/train/rna-valid.hdf5 --model Rodan --epochs 20 --batch-size 64 --device cuda

Testing

  • This test assumes that testing dataset is in the same format than training and validations (`hdf5`` format), i.e. you have split reads with their ground truths.
  • For experimental purposes use /extdata/RODAN/train/rna-test.hdf5.

RODAN with small dataset

python feito/test.py --path-test data/subsample_val.hdf5 --batch-size 16 --model Rodan --device cpu --path-checkpoint output/training/checkpoints/Rodan-epoch5.pt --path-fasta output/test/basecalled_signals.fa --rna true --use-viterbi true

SimpleNet with small dataset

python feito/test.py --path-test data/subsample_val.hdf5 --batch-size 16 --model SimpleNet --device cpu --path-checkpoint output/training/checkpoints/SimpleNet-epoch1.pt --path-fasta output/test/basecalled_signals_SimpleNet.fa --rna true --use-viterbi true

Basecalling

  • This assumes you have a trained model, and a set of reads in fast5 format.
  • Reads will be split by the dataloader in non-overlapping signals with length equal to the input of the model (this must be provided as parameter, but it shouldn't (FIXME:)), and an index will be created, to refer each portion of the basecalled signal to its portion of read.
python feito/basecall.py --path-fast5 data/RODAN/test/mouse-dataset/0 --len-subsignals 4096 --path-index output/basecalling/simplenet-index.csv --batch-size 16 --model SimpleNet --device cpu --path-checkpoint output/training/checkpoints/SimpleNet-epoch30.pt --path-fasta output/basecalling/simplenet-basecalled_reads.fa --path-reads output/basecalling/simplenet-basecalled_reads.fa

Reconstruct full reads from basecalled signals

Since raw signals need to be split into chunks of a fix length, we need to . For this reason, an index for portion of basecalled reads is built during the previous step. Now we need to take those portions of reads plus the index and reconstruct each read by concatenating the portions in the right order.


Mapping reads with minimap2

OPTIONAL

Install minimap2 in a conda environment

micromamba env create -n map-reads -f envs/minimap2.yml
micromamba activate map-reads

map reads to transcriptome

transcriptome="/projects5/basecalling-jorge/basecalling/data/RODAN/test/transcriptomes/mouse_reference.fasta"
reads="/projects5/basecalling-jorge/basecalling/output-old/basecalling/simplenet-basecalled_reads.fa"
samfile="output-old/basecalling/mapped_reads.sam"
minimap2 --secondary=no -ax map-ont -t 32 --cs $transcriptome $reads > $outputsam

samtools

sort mapped reads

bamfile="output-old/basecalling/mapped_reads.bam"
samtools view -bS $samfile | samtools sort > $bamfile

indexing

samtools index $bamfile 

visualize alignment

samtools view $bamfile | less -S

check statistics of mapped/unnmaped reads

samtools flagstat $bamfile

TODO list

  • Callbacks:
    • Checkpoint: save best model
    • Early stopping
  • Test model: compute accuracy of basecalled reads
    • use viterbi (and or beam search) to generate reads from output model
    • align basecalled read against ground truth with smith waterman
  • Create own datasets from raw signals and a reference
  • New architecture for RNA, consider sampling rate

Info

Basecalling

To map the output of the model to an RNA sequence, use beam search to decode the output of the neural network https://github.com/nanoporetech/fast-ctc-decode

Computation of accuracy

To compare the basecalled read against the ground truth read, use Smith Waterman

Connect to a GPU in the server

qrsh -l gpu_mem=8G

Steps for basecall signals for ONT

  1. Generate dataset for training a supervised model
    • Split raw signals in chunks of a fixed size (RODAN uses 4096-long signals)
    • Basecall the to obtain a Ground Truth (RODAN basecalle)

TO CONSIDER

How do they influence the architectures?

sampling rate [samples/sec] [bp/sec] [samples/bp]
DNA 4000 450 8.89
RNA 3012 70 43.03

Path to datasets in the server compbio RODAN's dataset

  • /extdata/RODAN/train/rna-train.hdf5
  • /extdata/RODAN/train/rna-test.hdf5
  • /extdata/RODAN/test

Directory where I am working on compbio:/projects5/basecalling-jorge/basecalling


Folder structure

basecalling
├── feito: source code
├── envs: yaml files with different environments that can be installed with conda/miniconda/mamba/micromamba (different versions of pytorch and cuda)
├── data: store data here
├── notebooks: jupyter notebooks to test code
├── output: results of training
├── params.yml: input parameters for train with DVC
└── README.md

Source code

feito
├── api: trainer, tester, basecaller APIs
├── callbacks: functions to be run after each epoch in the training
├── dataloaders: classes to be used with DataLoader from pytorch
├── loss_functions: variants of CTCLoss
├── models: architectures and custom layers
├── utils: accuracy and others
├── feito.py: custom pipeline to basecall reads from fast5 files
├── trainer.py: custom pipeline to train a basecaller
└── tester.py: custom pipeline to test a basecaller ()

About

an attempt of basecaller for ONT RNA signals

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages