-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_quantification.Rmd
1258 lines (967 loc) · 51.3 KB
/
02_quantification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "E07 - RNA-seq / 02 - quantification (Analysis of Gene Expression @ UCT Prague)"
author:
- Jiri Novotny [email protected]
- Studuj bioinformatiku! http://studuj.bioinformatiku.cz
institute: "Laboratory of Genomics and Bioinformatics @ Institute of Molecular Genetics of the ASCR"
output:
rmdformats::readthedown:
highlight: "kate"
lightbox: true
thumbnails: true
gallery: true
toc_depth: 4
self_contained: true
number_sections: false
toc_collapsed: false
df_print: "paged"
date: "`r Sys.Date()`"
---
```{r, child = here::here("_assets/custom.Rmd"), eval = TRUE}
```
***
If you have created `tmux` session (named `rnaseq`) in the previous exercise, you can attach to it with
```{bash}
tmux attach -t rnaseq
```
Copy, please, these files and directories to your personal directory:
```{bash}
cp -r ~/shared/AGE_current/Exercises/E07-RNA_seq/02_quantification ~/AGE/Exercises/E07-RNA_seq
cp -r ~/shared/AGE_current/Exercises/E07-RNA_seq/data/{sample_sheet_airway.csv,experiment_trimmed_filtered,genome,gsnap_index_hg37_chr20,salmon_index_hg37_chr20,rna_sequences,transcriptome} ~/AGE/Exercises/E07-RNA_seq/data
cp -r ~/shared/AGE_current/Exercises/E07-RNA_seq/bin ~
chmod +x ~/bin/*
```
And change the working directory in RStudio terminal to the current exercise's directory:
```{bash}
cd ~/AGE/Exercises/E07-RNA_seq/02_quantification
```
We also need to install additional software.
We reuse the `conda` environment `tools` created in the previous exercise:
```{bash}
conda deactivate
conda activate tools
conda install -c bioconda preseq subread salmon rseqc
```
***
# Assignment
Your assignment will be quite similar to E06-IGV:
- Using the `dds` object, find at least three interesting genes,
i.e. they should be expressed at all (a mean number of counts across samples at least greater than 100)
and possibly their expression should differ between conditions (but you don't have to calculate the differential expression).
- Filter the BAM files so they contain reads assigned to those genes.
- Also filter the GTF file, so only features (rows) associated with those genes will remain.
- Import the files to IGV, so there will be:
- A reference sequence. To make it easier, use the one IGV is offering.
- A track with annotation.
- Coverage tracks for the samples.
- Make some screenshots of IGV `r emo::ji("slightly_smiling_face")`
- Save the IGV session.
Make sure you have checked the option "Use relative paths in session files" in `View -> Preferences -> General`.
This way I can open your IGV session and all needed files located in the same directory will be loaded.
Please, return the assignment in this form:
- Copy the files to a directory named `E07-RNA_seq_02_quantification_assignment_<your_name>`:
`bash` script for filtering the BAM and GTF files (also place here the chosen gene names as a comment),
filtered BAM and GTF files, IGV session and screenshots.
Put this directory to a ZIP archive named `E07-RNA_seq_02_quantification_assignment_<your_name>.zip`.
- Upload the ZIP file to the appropriate assignment in MS Teams.
***
# Introduction
In this tutorial we will show how to:
- Filter out tRNA and rRNA reads using `SortMeRNA` tool.
- Quantify gene expression to obtain a count matrix, including quality control of this process.
- Import the resulting count matrix to R.
Generally, there are two ways how to do quantification (see the figure below):
- **Aligner pipeline**: Align reads to reference genome to obtain BAM file and count reads overlapping a feature (e.g. exon).
- We will use [GSNAP](http://research-pub.gene.com/gmap/) for alignment and `featureCounts` tool from
[subread](http://subread.sourceforge.net/) suite for counting.
We will use two versions of `featureCounts`: command-line and from within R through
[Rsubread](http://bioconductor.org/packages/release/bioc/html/Rsubread.html) package.
- **Pseudoaligner pipeline**: Map reads to reference transcriptome (pseudoalignment) and directly obtain a transcript-level count matrix (transcript abundance).
- We will use [Salmon](https://combine-lab.github.io/salmon/) for this and later import the count matrix to R.
![Two different pipelines for quantification. Both finish with a count matrix.](`r here("E07-RNA_seq/02_quantification/_rmd_images/quantification_pipelines.png")`)
We will continue with trimmed data from the previous exercise.
***
# Config
Here we will set paths to files and directories.
This is copypasted from the previous exercise.
```{bash}
# This is a root directory for data.
BASE_DATA_DIR="../data"
# Here you will find experiment data.
EXPERIMENT_DATA_DIR="$BASE_DATA_DIR/experiment"
# A directory to store trimmed data.
TRIMMED_DATA_DIR="$BASE_DATA_DIR/experiment_trimmed"
# A directory for MultiQC output. Will be created automatically by MultiQC.
MULTIQC_DIR="multiqc"
# A directory for temporary files.
TMP_DIR="tmp"
mkdir $TMP_DIR
SAMPLE_NAMES=("SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513")
```
But here are new variables with paths to reference files and directories for `SortMeRNA`.
```{bash}
N_THREADS=2
### Reference genome and its annotation files.
# Reference sequence of human chromosome 20 in gzipped/decompressed FASTA format.
HG37_GENOME="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.dna.chromosome.20.fa.gz"
HG37_GENOME_D="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.dna.chromosome.20.fa"
# Annotation for human reference genome in gzipped GTF format.
HG37_ANN_GZ="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.gtf.gz"
HG37_ANN="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.gtf"
# Human reference transcriptome.
HG37_TRANS="$BASE_DATA_DIR/transcriptome/Homo_sapiens.GRCh37.75.cdna.all.fa.gz"
### SortMeRNA
# Reference sequences of human rRNA and tRNA.
SORTMERNA_REFERENCE_DIR="$BASE_DATA_DIR/rna_sequences"
HUMAN_RRNA_FASTA="human_rRNA.fa"
HUMAN_TRNA_FASTA="human_tRNA.fa"
# Directory for SortMeRNA index of our reference sequences.
SORTMERNA_INDEX_DIR="sortmerna_index"
mkdir $SORTMERNA_INDEX_DIR
# Directory for logs from SortMeRNA which we later parse with MultiQC.
SORTMERNA_LOG_DIR="sortmerna_log"
mkdir $SORTMERNA_LOG_DIR
# Directory for temporary files which will be later removed.
SORTMERNA_TMP_DIR="sortmerna_tmp"
mkdir $SORTMERNA_TMP_DIR
# Directory to store trimmed and rRNA/tRNA filtered data.
TRIMMED_FILTERED_DATA_DIR="$BASE_DATA_DIR/experiment_trimmed_filtered"
mkdir $TRIMMED_FILTERED_DATA_DIR
### Paths for GSNAP
# Directory to store index of reference genome.
GSNAP_INDEX_DIR="$BASE_DATA_DIR/gsnap_index_hg37_chr20"
# Directory to store alignments.
GSNAP_OUT_DIR="gsnap_out"
mkdir $GSNAP_OUT_DIR
### RSeQC
HG37_ANN_BED="$BASE_DATA_DIR/genome/hg19_ensembl_gene.bed"
HG37_ANN_BED_FIX="$BASE_DATA_DIR/genome/hg19_ensembl_gene_fix.bed"
RSEQC_OUT_DIR="rseqc_out"
mkdir $RSEQC_OUT_DIR
### preseq
PRESEQ_OUT_DIR="preseq_out"
mkdir $PRESEQ_OUT_DIR
### featureCounts
# Directory to store biotype counts.
FC_OUT="featurecounts_biotypes_out"
mkdir $FC_OUT
### Salmon
SALMON_INDEX_DIR="$BASE_DATA_DIR/salmon_index_hg37_chr20"
SALMON_OUT_DIR="salmon_out"
```
***
# Reference files
Before you do any NGS analysis, you need to get reference files, such as genome, transcriptome,
annotation, etc. The main (but not only) sources for most of the organisms are:
- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) (USA - NCBI) -
you can read more about its annotation pipeline [here](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/).
- [UCSC](https://hgdownload.soe.ucsc.edu/downloads.html)
- Based on [GENCODE](https://www.gencodegenes.org/) (which is essentially identical to the Ensembl annotation),
[UniProt](https://www.uniprot.org/), and mRNA data from [GenBank](https://www.ncbi.nlm.nih.gov/genbank/).
You can find FAQ about the annotation [here](https://genome.ucsc.edu/FAQ/FAQgenes.html).
- [Ensembl](http://ftp.ensembl.org/pub/) (EU - EMBL) - FTP site
- Based on [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/), [UniProt](https://www.uniprot.org/),
and [Havana](https://www.ensembl.org/info/genome/genebuild/manual_havana.html) manual annotation.
You can read more about Ensembl annotation [here](http://www.ensembl.org/info/genome/genebuild/index.html)
and IDs [here](https://uswest.ensembl.org/Help/Faq?id=488).
Ensembl annotation is practically identical to [GENCODE](https://www.gencodegenes.org/).
All of them are using the same genome assemblies (at least for human and mouse, as far as I know),
but different pipelines for annotation, so, for example, common genes could have different loci, exons etc.,
or some genes are present only in one of the annotations.
Their impact on RNA-seq data analysis was evaluated in [this paper](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8).
So which one to use? There was a similar question
[here](https://bioinformatics.stackexchange.com/questions/21/feature-annotation-refseq-vs-ensembl-vs-gencode-whats-the-difference.).
Here is one of the answers:
> In a practical sense, I think the biggest difference between RefSeq and Ensembl/GENCODE
is in the sensitivity/specificity trade off.
> Ensembl aims more towards the inclusive end, including a far larger number of transcript variants,
many of which are only weakly supported.
> RefSeq trades some of this sensitivity for specificity - you can be more confident that
a RefSeq transcript exists, but less confident that the ReqSeq annotation
includes all of the real transcripts for a gene.
So Ensembl should be better for exploratory analysis (in terms of detected genes including false positives),
while RefSeq is more practical for a more narrowed set of validated genes.
Also, it is a matter of preference `r emo::ji("slightly_smiling_face")`
For example, in our lab we are almost in all cases using Ensembl.
Because this exercise is largely based on the [F1000 tutorial](https://f1000research.com/articles/4-1070/v2)
in which Ensembl reference was used, we will use the same one -
human reference genome (chromosome 20) and transcriptome, both in the older version 37.75.
We have already downloaded all the files for you `r emo::ji("sunglasses")`
Anyway, it is good to know how to navigate in [Ensembl FTP site](http://ftp.ensembl.org/pub/).
For each release (such as 75 in our case) there is a following directory structure:
`<file type>/<organism>/<files>`. For example:
- Human genome and transcriptome references in FASTA format are located in
`release-75/fasta/homo_sapiens/dna` and `release-75/fasta/homo_sapiens/cdna` directory, respectively.
- Human genome annotation is located in `release-75/gtf/homo_sapiens`.
- There are also `<current_*>/<organism>` (e.g. `current_gtf/homo_sapiens`) directories pointing to the latest Ensembl release's files.
In each of these directories there is also a `README` file describing a content of files in that directory
([example](http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/README) for human reference genome files).
If you are working on a remote server, `wget` tool provides the easiest way to download the files.
The basic command is:
```{bash}
wget -O output_file url
```
For example:
```{bash}
wget -O transcriptomes/human_transcriptome.fa.gz ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
```
You can omit the `-O` parameter and `wget` will download the file to the current directory and with the original filename.
***
# Filtering out rRNA and tRNA - `SortMeRNA`
Our library could be contaminated by ribosomal (rRNA) and transfer RNA (tRNA).
Although this contamination won't affect a subsequent differential expression analysis,
it can significantly slow-down the alignment, because those RNAs contain long repeating sequences.
[SortMeRNA](https://bioinfo.lifl.fr/RNA/sortmerna/) is a tool for filtering, mapping and OTU-picking NGS reads
in metatranscriptomic and metagenomic data.
`SortMeRNA` takes as input a file with reads (FASTA or FASTQ format) and one or multiple rRNA/tRNA database file(s) (FASTA),
and sorts apart rRNA/tRNA and rejected reads into two files specified by the user.
Although its primary purpose is filtering metagenomic data, it can be easily used for classic RNA-seq data.
At the end we will use `MultiQC` to report `SortMeRNA` log - a percentage of rRNA/tRNA in our samples.
The simplified `SortMeRNA` pipeline is depicted here:
![`SortMeRNA` pipeline. First, `SortMeRNA` creates index from our reference FASTA files (rRNA, tRNA, ...).
Then we merge FASTQ files to a single one and use `SortMeRNA` to sort reads to those aligning to reference and those which don't.
Finally, we split the latter to forward and reverse reads.](`r here("E07-RNA_seq/02_quantification/_rmd_images/sortmerna.png")`)
> In the beggining you have copied scripts for `SortMeRNA` to `~/bin`
```{bash}
# First we copy FASTA files with reference sequences to directory for SortMeRNA index.
cp $SORTMERNA_REFERENCE_DIR/$HUMAN_RRNA_FASTA $SORTMERNA_REFERENCE_DIR/$HUMAN_TRNA_FASTA $SORTMERNA_INDEX_DIR
# Create FASTA index.
samtools faidx $SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA
samtools faidx $SORTMERNA_INDEX_DIR/$HUMAN_TRNA_FASTA
# Create SortMeRNA index.
cd $SORTMERNA_INDEX_DIR
indexdb_rna --ref $HUMAN_RRNA_FASTA,$HUMAN_RRNA_FASTA.fai:$HUMAN_TRNA_FASTA,$HUMAN_TRNA_FASTA.fai
cd -
for sample_id in ${SAMPLE_NAMES[@]}; do
echo "Processing sample $sample_id ..."
# SortMeRNA cannot work directly with paired-end data - its input is a single FASTQ file.
# However, authors provide two scripts to merge and unmerge paired-end data.
merge-paired-reads.sh \
$TRIMMED_DATA_DIR/${sample_id}*_1P.fastq \
$TRIMMED_DATA_DIR/${sample_id}*_2P.fastq \
$SORTMERNA_TMP_DIR/${sample_id}.fastq
sortmerna \
--reads $SORTMERNA_TMP_DIR/${sample_id}.fastq \
--ref $SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA,$SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA.fai:$SORTMERNA_INDEX_DIR/$HUMAN_TRNA_FASTA,$SORTMERNA_INDEX_DIR/$HUMAN_TRNA_FASTA.fai \
--aligned $SORTMERNA_TMP_DIR/${sample_id}_rRNA_tRNA \
--other $SORTMERNA_TMP_DIR/${sample_id}_filtered \
--paired_in \
--fastx \
-a $N_THREADS \
--log \
-v
unmerge-paired-reads.sh \
$SORTMERNA_TMP_DIR/${sample_id}_filtered.fastq \
$TRIMMED_FILTERED_DATA_DIR/${sample_id}_trimmomatic_filtered_1.fastq \
$TRIMMED_FILTERED_DATA_DIR/${sample_id}_trimmomatic_filtered_2.fastq
mv $SORTMERNA_TMP_DIR/${sample_id}*.log $SORTMERNA_LOG_DIR
done
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_sortmerna" aria-expanded="false" aria-controls="collapseExample">
Show parameters info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_sortmerna">
- `samtools faidx $SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA`{.bash}
- Create an index (`.fai` extension) for FASTA file.
- `indexdb_rna --ref $HUMAN_RRNA_FASTA,$HUMAN_RRNA_FASTA.fai:$HUMAN_TRNA_FASTA,$HUMAN_TRNA_FASTA.fai`{.bash}
- Create an index for SortMeRNA. The format is `ref1.fasta,ref1.fasta.fai:ref2.fasta,ref2.fasta.fai`.
- `merge-paired-reads.sh forward-reads.fastq reverse-reads.fastq merged-reads.fastq`{.bash}
- Merge paired-end reads. That is, put paired reads side by side into one FASTQ file.
- `unmerge-paired-reads.sh merged-reads.fastq forward-reads.fastq reverse-reads.fastq`{.bash}
- Unmerge paired-end reads. That is, split paired reads to two FASTQ files.
***
`sortmerna`:
- `--reads $TMP_DIR/${sample_id}_merged.fastq`{.bash}
- Input FASTQ file. Paired-end reads must be merge before with `merge-paired-reads.sh` script.
- `--ref`{.bash}
- Reference FASTA. Same parameter format as in `indexdb_rna`.
- `--aligned $TMP_DIR/${sample_id}_rRNA`{.bash}
- Put aligned reads to this file. `.fastq` extension is added automatically.
- `--other $TMP_DIR/${sample_id}_filtered`{.bash}
- Put non-aligned reads to this file - these are reads we are interested in. `.fastq` extension is added automatically.
- `--paired_in`{.bash}
- SortMeRNA assigns each read individually.
In case one read from the pair aligns to reference, whole pair will be put to FASTQ with aligned reads.
Another option is `--paired_out`, which do the opposite.
So these parameters ensure that read pairs won't be split to aligned and other FASTQ files.
- `--fastx`{.bash}
- Output is FASTQ file.
- `-a $N_THREADS`{.bash}
- Use this number of threads.
- `-log`{.bash}
- Create a log file. It will be named `--reads <file>.log`.
- `-v`{.bash}
- Be verbose.
</div>
***
# Alignment to genome
At this moment, we have our reads ready for the alignment.
We will align reads to the human reference genome chromosome 20 (-> BAM file) and then count reads
overlapping exons/genes to obtain a gene-level count matrix (-> delimited text file or R object).
## Aligner - `GSNAP`
Our aligner will be [GSNAP](http://research-pub.gene.com/gmap/).
To effectively align reads, `GSNAP` needs to make an index of the genome. Let's do it:
```{bash}
gmap_build \
--dir=$GSNAP_INDEX_DIR \
--genomedb=$(basename $GSNAP_INDEX_DIR) \
--gunzip \
--sort="chrom" \
--circular="MT" \
$HG37_GENOME
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_gsnap_index" aria-expanded="false" aria-controls="collapseExample">
Show parameters info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_gsnap_index">
- `--dir=$GSNAP_INDEX_DIR`{.bash}
- Save index into this directory.
- `--genomedb=$(basename $GSNAP_INDEX_DIR)`{.bash}
- Name of index (subdirectory in `--dir` will be created).
`basename` will output the part behind the last `/` in `$GSNAP_INDEX_DIR`.
- `--gunzip`{.bash}
- Input files are compressed with gzip.
- `--sort="chrom"`{.bash}
- Sort index by chromosomes.
- `--circular="MT"`{.bash}
- Which chromosomes are circular.
- `$HG37_GENOME`{.bash}
- Path to genome FASTA file.
</div>
***
Now we create information about splice sites. That's because we are mapping reads coming from mRNA to genome which includes introns.
So without the splicing information, "classic aligner" will put a long insert to alignment of reads which span introns.
With splicing information, "splice-aware aligner" (such as `GSNAP`) knows the location of exons,
so reads can be aligned to two exons without a long insert inside the intron.
![](`r here("E07-RNA_seq/02_quantification/_rmd_images/alignment.png")`)
`GSNAP` has a tool for making such a file with splice sites information.
First we decompress the genome annotation GTF file and then make a special compressed GTF with splice sites:
```{bash}
zcat $HG37_ANN_GZ > $HG37_ANN
cat $HG37_ANN | gtf_splicesites | iit_store -o $GSNAP_INDEX_DIR/splice_sites
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_gsnap_splicesites" aria-expanded="false" aria-controls="collapseExample">
Show parameters info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_gsnap_splicesites">
- `gtf_splicesites`{.bash}
- Extract splice sites (exons) and send it to stdout in GTF format.
- `iit_store -o $GSNAP_INDEX_DIR/splice_sites`{.bash}
- Compress GTF and save it to the file `$GSNAP_INDEX_DIR/splice_sites.iit`.
</div>
***
All the files required for alignment are ready, so we can start it.
Later we will use `MultiQC` to report log files from `GSNAP`.
First we decompress the gzipped FASTA with reference genome.
After alignment we use this file to reconstruct the missing header in the SAM files produced by `GSNAP`.
We loop through the FASTQ files and for each we perform the alignment and subsequent processing with `samtools` -
converting to the BAM format (compressed SAM), sorting alignments by position and creating BAM index.
```{bash}
gzip -k -d $HG37_GENOME
for sample_id in ${SAMPLE_NAMES[@]}; do
echo "Processing sample $sample_id ..."
out_file_path=$GSNAP_OUT_DIR/$sample_id
gsnap \
--nthreads=$N_THREADS \
--dir=$GSNAP_INDEX_DIR \
--db=$(basename $GSNAP_INDEX_DIR) \
--use-splicing=$GSNAP_INDEX_DIR/splice_sites \
--gunzip \
--format="sam" \
--output-file=${out_file_path}.sam \
${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_1.fastq \
${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_2.fastq
samtools view -Sb -T $HG37_GENOME_D ${out_file_path}.sam | samtools sort -@$N_THREADS -o $GSNAP_OUT_DIR/${sample_id}.bam -
samtools index $GSNAP_OUT_DIR/${sample_id}.bam
done
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_gsnap_alignment" aria-expanded="false" aria-controls="collapseExample">
Show parameters info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_gsnap_alignment">
- `--nthreads=$N_THREADS`{.bash}
- Use this number of threads.
- `--dir=$GSNAP_INDEX_DIR`{.bash}
- Index directory.
- `--db=$(basename $GSNAP_INDEX_DIR)`{.bash}
- Name of genome index.
- `--use-splicing=$GSNAP_INDEX_DIR/splice_sites`{.bash}
- Path to file with splicing information.
- `--gunzip`{.bash}
- Input files are compressed with gzip.
- `--format="sam"`{.bash}
- Output in SAM format.
- `--output-file=${out_file_path}.sam`{.bash}
- Path to output file.
- `${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_1.fastq`{.bash}
- Forward reads of a sample.
- `${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_2.fastq`{.bash}
- Reverse reads of a sample.
***
- `samtools view -Sb -T $HG37_GENOME_D ${out_file_path}.sam |`{.bash}
- Convert to BAM format using this number of threads (`-@$N_THREADS`),
using this reference (`-T $HG37_GENOME_D`) FASTA to make a header
(BAM files need information about the length of each reference, and this info is stored in the header),
and pipe the output (stdout).
- `samtools sort -@$N_THREADS -o $GSNAP_OUT_DIR/${sample_id}.bam -`{.bash}
- Read input (stdin), sort it by position and save it to `$GSNAP_OUT_DIR/${sample_id}.bam`.
`samtools` are designed to read stdin when `-` is supplied as the input file.
- `samtools index $GSNAP_OUT_DIR/${sample_id}.bam`{.bash}
- Create index of a BAM file.
</div>
***
## Quality control of the alignment
### `RSeQC`
Taken from the [RSeQC](http://rseqc.sourceforge.net) homepage:
> RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput
sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality,
nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation,
mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc.
`RSeQC` is composed of several command-line tools. Probably the most useful ones are [supported by MultiQC](https://multiqc.info/docs/#rseqc).
We run few of them and later include them in `MultiQC` report.
For some of its tools, `RSeQC` needs an annotation in BED format, which is different from GTF.
Fortunately, authors also [provide](https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg19_Ensembl_gene.bed.gz/download) these annotations
and we will use the annotation of [Ensembl GRCh37/hg19](https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg19_Ensembl_gene.bed.gz/download)
genome to which we aligned our reads (you should always use annotation for the same version of genome).
You can make your own annotation [here](https://genome.ucsc.edu/cgi-bin/hgTables?command=start).
But there is one problem: in our reference genome, chromosomes are named just with their numbers (e.g. `1`, `2`).
In the annotation BED file, chromosome names (first column) contain `chr` prefix (e.g. `chr1`, `chr2`).
This is one of the main discrepancies between Ensembl and RefSeq reference genomes.
So we need to change the BED file a little bit: to remove all `chr` prefixes.
We can do it easily with `awk`:
```{bash}
awk '{ gsub("^chr", "", $1); print $0 }' $HG37_ANN_BED > $HG37_ANN_BED_FIX
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_bed_awk" aria-expanded="false" aria-controls="collapseExample">
Show `awk` explanation
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_bed_awk">
- `gsub("^chr", "", $1)`
- In the first column, remove the leading `chr` string.
- `print $0`
- Print all columns.
</div>
Now we can run some `RSeQC` tools on each of our samples:
```{bash}
for sample_id in ${SAMPLE_NAMES[@]}; do
echo "Processing sample $sample_id ..."
in_file_path=$GSNAP_OUT_DIR/${sample_id}.bam
bam_stat.py -i $in_file_path > $RSEQC_OUT_DIR/${sample_id}.bam_stat.txt
read_distribution.py -i $in_file_path -r $HG37_ANN_BED_FIX > $RSEQC_OUT_DIR/${sample_id}.read_distribution.txt
infer_experiment.py -i $in_file_path -r $HG37_ANN_BED_FIX > $RSEQC_OUT_DIR/${sample_id}.infer_experiment.txt
inner_distance.py -i $in_file_path -r $HG37_ANN_BED_FIX -o $RSEQC_OUT_DIR/${sample_id}.rseqc
done
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_rseqc" aria-expanded="false" aria-controls="collapseExample">
Show RSeQC tools info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_rseqc">
- [bam_stat.py](http://rseqc.sourceforge.net/#bam-stat-py)
- Summarizing mapping statistics of a BAM or SAM file.
- [read_distribution.py](http://rseqc.sourceforge.net/#read-distribution-py)
- Provided a BAM/SAM file and reference gene model, this module will calculate how mapped reads were distributed
over genome feature (like CDS exon, 5'UTR exon, 3' UTR exon, Intron, Intergenic regions).
- [infer_experiment.py](http://rseqc.sourceforge.net/#infer-experiment-py)
- This program is used to "guess" how RNA-seq sequencing were configured, particulary how reads were stranded for
strand-specific RNA-seq data, through comparing the "strandness of reads" with the "strandness of transcripts".
- [inner_distance.py](http://rseqc.sourceforge.net/#inner-distance-py)
- This module is used to calculate the inner distance (or insert size) between two paired RNA reads.
The distance is the mRNA length between two paired reads.
</div>
### `preseq` - checking library complexity
From the [preseq](http://smithlabresearch.org/software/preseq/) homepage:
> The `preseq` package is aimed at predicting and estimating the complexity of a genomic sequencing library,
equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and
how many will be expected from additional sequencing using an initial sequencing experiment.
The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth,
or to screen multiple libraries to avoid low complexity samples.
Good reading regarding `preseq` is this blog post called
["How deep should we sequence?"](https://davetang.org/muse/2013/07/10/how-deep-should-we-sequence/).
We will use `lc_extrap` function, which computes the expected future yield of distinct reads and bounds on the number of
total distinct reads in the library and the associated confidence intervals.
Later we will report the output with `MultiQC`.
So let's run `preseq` on each of our samples:
```{bash}
for sample_id in ${SAMPLE_NAMES[@]}; do
echo "Processing sample $sample_id ..."
in_file_path=$GSNAP_OUT_DIR/${sample_id}.bam
preseq lc_extrap -B $in_file_path -o $PRESEQ_OUT_DIR/${sample_id}.ccurve.txt
done
```
***
## `featureCounts` - quantification
Now comes the quantification itself.
The tool [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) from the suite [subread](http://subread.sourceforge.net/)
is a popular quantification software.
`featureCounts` takes as input SAM/BAM files and an annotation file including chromosomal coordinates of features.
It outputs numbers of reads assigned to features (or meta-features).
It also outputs stat info for the overall summrization results, including number of successfully
assigned reads and number of reads that failed to be assigned due to various reasons.
Each entry in the provided annotation file is taken as a **feature** (e.g. an exon).
A **meta-feature** is the aggregation of a set of features (e.g. a gene).
The `featureCounts` program uses the gene_id attribute available in the GTF format annotation
(or the GeneID column in the SAF format annotation) to group features into meta-features,
i.e. features belonging to the same meta-feature have the same gene identifier.
When summarizing reads at meta-feature level, read counts obtained for features included in the
same meta-feature will be added up to yield the read count for the corresponding meta-feature.
A read is said to overlap a feature if at least one read base is found to overlap the feature.
For paired-end data, a fragment (or template) is said to overlap a feature if any of the
two reads from that fragment is found to overlap the feature.
By default, `featureCounts` does not count reads overlapping with more than one feature
(or more than one meta-feature when summarizing at the meta-feature level).
Users can use the `-O` option to instruct `featureCounts` to count such reads
(they will be assigned to all their overlapping features or meta-features).
Note that, when counting at the meta-feature level, reads that overlap multiple features of
the same meta-feature are always counted exactly once for that meta-feature, provided there is no
overlap with any other meta-feature. For example, an exon-spanning read will be counted only once
for the corresponding gene even if it overlaps with more than one exon.
![`featureCounts` counting.](`r here("E07-RNA_seq/02_quantification/_rmd_images/featurecounts.png")`)
First of all, we will use the command-line version of `featureCounts` to get a number
of biotypes (`gene_biotype` in GTF annotation), such as protein coding, rRNA, miRNA, etc.
Let's examine in our GTF annotation what is actually `featureCounts` using as features and meta-features:
```{bash}
head $HG37_ANN | cut -f 3,9 | tail -n +6 | column -t
```
You should see this:
```
gene gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
transcript gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
exon gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
exon gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
exon gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";
```
In the third field in GTF there are features and we will use exons for counting.
In the nineth field there are possible meta-features used for grouping the counts.
To count biotypes we will use the meta-feature `gene_biotype`.
```{bash}
featureCounts -a $HG37_ANN -t exon -g gene_biotype -o $FC_OUT/biotype_counts.tsv -p -C -T $N_THREADS $GSNAP_OUT_DIR/*.bam
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_featurecounts" aria-expanded="false" aria-controls="collapseExample">
Show `featureCounts` parameters info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_featurecounts">
- `-a $HG37_ANN`{.bash}
- Annotation file in GTF format.
- `-t exon`{.bash}
- Feature to count reads is `exon`.
- `-g gene_biotype`{.bash}
- Meta-feature to group counts is `gene_biotype`.
- `-o $FC_OUT/biotype_counts.tsv`{.bash}
- Output counts to this file.
- `-p`{.bash}
- Input files contain paired-end reads.
- `-C`{.bash}
- Do not count read pairs that have their two ends mapping to
different chromosomes or mapping to same chromosome but on different strands.
These reads are called chimeric.
- `-T $N_THREADS`{.bash}
- Use this number of threads.
- `$GSNAP_OUT_DIR/*.bam`{.bash}
- Input files.
</div>
Let's look at the number of distinct biotypes:
```{bash}
cut -f1,7,8,9,10 $FC_OUT/biotype_counts.tsv | tail -n +2 | column -t
```
```
Geneid gsnap_out/SRR1039508.bam gsnap_out/SRR1039509.bam gsnap_out/SRR1039512.bam gsnap_out/SRR1039513.bam
pseudogene 47866 44224 40431 33452
lincRNA 589 568 596 582
protein_coding 26995 26508 28067 27411
antisense 284 351 235 142
processed_transcript 44 52 54 59
snRNA 9 8 9 10
sense_intronic 0 0 1 0
miRNA 24 35 32 35
misc_RNA 61 66 34 27
snoRNA 5 4 4 2
rRNA 0 6 2 7
3prime_overlapping_ncrna 0 0 0 0
polymorphic_pseudogene 0 0 0 0
sense_overlapping 0 0 0 0
IG_V_gene 0 0 0 0
IG_C_gene 0 0 0 0
IG_J_gene 0 0 0 0
IG_V_pseudogene 0 0 0 0
TR_C_gene 0 0 0 0
TR_J_gene 0 0 0 0
TR_V_gene 0 0 0 0
TR_V_pseudogene 0 0 0 0
IG_C_pseudogene 0 0 0 0
TR_D_gene 0 0 0 0
TR_J_pseudogene 0 0 0 0
IG_J_pseudogene 0 0 0 0
IG_D_gene 0 0 0 0
processed_pseudogene 0 0 0 0
Mt_tRNA 0 0 0 0
Mt_rRNA 0 0 0 0
```
Chromosome 20 probably contains a lot of pseudogenes `r emo::ji("monocle")`
***
Now let's move into R and do the gene-level quantification using the `featureCounts()`
function from [Rsubread](https://bioconductor.org/packages/release/bioc/html/Rsubread.html) package.
We will use the meta-feature `gene_id`.
```{r, eval = TRUE}
library(Rsubread)
library(here)
library(glue)
# pattern = ".bam$" will match all files ending with ".bam", so it won't include ".bam.bai" files.
bam_files <- list.files(here("E07-RNA_seq/02_quantification/gsnap_out"), pattern = ".bam$", full.names = TRUE)
annot_file <- here("E07-RNA_seq/data/genome/Homo_sapiens.GRCh37.75.gtf")
res <- featureCounts(
files = bam_files,
annot.ext = annot_file,
isGTFAnnotationFile = TRUE,
GTF.featureType = "exon",
useMetaFeatures = TRUE,
GTF.attrType = "gene_id",
isPairedEnd = TRUE,
# We want to discard reads mapping to multiple locations (i.e. reads having multiple alignments in BAM file).
countMultiMappingReads = FALSE,
# We want to discard reads which have multiple features or meta-features assigned.
allowMultiOverlap = FALSE,
# We want to discard fragments which has its two reads mapped to different chromosomes.
countChimericFragments = FALSE,
nthreads = 2
)
```
`res` is a `list` object with several values:
```{r, eval = TRUE}
names(res)
```
- `counts`: the count matrix, in which rows are genes and columns are samples.
```{r, eval = TRUE}
res$counts[1:5, ]
```
- `stat`: counting summary.
```{r, eval = TRUE}
res$stat
```
We can calculate the percentage of assigned and unassigned reads for each sample:
```{r, eval = TRUE}
rownames(res$stat) <- res$stat$Status
assigned_percent <- apply(dplyr::select(res$stat, -Status), MARGIN = 2, FUN = function(x) {
round((x["Assigned"] / sum(x)) * 100, 2)
})
unassigned_percent <- 100 - assigned_percent
data.frame(
Sample = names(assigned_percent),
Assigned = glue("{assigned_percent}%"),
Unassigned = glue("{unassigned_percent}%")
)
```
***
## Importing count matrix to `DESeq2` - a differential expression analysis package
At this point, we have the gene-level count matrix.
This is the branching point after which we can use a variety of
tools for (but not only) differential expression analysis.
We have chosen the popular package
[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).
This package also offers a great [vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html),
which also contains a description of theory behind `DESeq2`.
To store a count matrix and additional information, `DESeq2` is using object of class `DESeqDataSet`,
which is derived from [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html).
You can think of this object as a container for your experiment, where each sample has its metadata and counts.
Later we will show you how is this object organized and how to access its data.
In fact, it is very similar to `ExpressionSet` we were using for microarray data.
We load a sample sheet and then create a `DESeqDataSet` object:
```{r, eval = TRUE}
library(DESeq2)
library(stringr)
# Load the sample sheet.
sample_sheet_file <- here("E07-RNA_seq/data/sample_sheet_airway.csv")
sample_sheet <- read.csv(sample_sheet_file)
sample_sheet$Sample_Group <- factor(sample_sheet$Sample_Group, levels = c("untreated", "dex"))
rownames(sample_sheet) <- sample_sheet$Sample_ID
sample_sheet
# Keep only the sample IDs in column names of the count matrix.
colnames(res$counts) <- str_remove(colnames(res$counts), ".bam")
# Make sure the order of columns in the count matrix is the same as order of rows in the sample sheet.
# That is, each row in the sample sheet corresponds to column in the count matrix.
res$counts <- res$counts[, rownames(sample_sheet)]
rownames(sample_sheet)
colnames(res$counts)
# Create a DESeqDataSet object.
# Design is simple: counts of each gene depends on Sample_Group ("untreated" or "dex" in our case).
# Design is further used for modeling the differential expression and in normalization.
dds <- DESeqDataSetFromMatrix(countData = res$counts, colData = sample_sheet, design = ~ Sample_Group)
dds
```
***
# Mapping to transcriptome (pseudoalignment) - `Salmon`
So how does mapping to transcriptome differs from alignment to genome? Well, it is **mapping**.
That means you don't obtain an exact alignment of each read, but only a region in reference to which maps into.
So it is much, much faster than alignment, but also sufficient if you just want
to do differential expression analysis on a **known** transcriptome and you don't care about finding novel splice variants or transcripts.
Anyway, `Salmon` is also able to output SAM files with mappings.
We will use a tool called [Salmon](https://combine-lab.github.io/salmon/),
which is doing both mapping and transcript-level quantification (transcript abundance).
In the recent versions it is also possible to use an
[alignment-based mode](https://salmon.readthedocs.io/en/latest/salmon.html#quantifying-in-alignment-based-mode),
that is, to use BAM files with alignments produced by an another aligner.
In that case you haven't to create a transcriptome index (see below).
First, as for `GSNAP`, we need to create an index (in this case from transcriptome FASTA):
```{bash}
salmon index --threads $N_THREADS --transcripts $HG37_TRANS --index $SALMON_INDEX_DIR
```
Now we are ready for mapping and quantification:
```{bash}
for sample_id in ${SAMPLE_NAMES[@]}; do
echo "Processing sample $sample_id ..."
salmon quant \
-i $SALMON_INDEX_DIR \
--libType A \
--threads $N_THREADS \
-1 $TRIMMED_FILTERED_DATA_DIR/${sample_id}*_1.fastq \
-2 $TRIMMED_FILTERED_DATA_DIR/${sample_id}*_2.fastq \
-o $SALMON_OUT_DIR/$sample_id
done
```
<p>
<button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_salmon" aria-expanded="false" aria-controls="collapseExample">
Show `Salmon` parameters info
</button>
</p>
<div class="collapse div-collapse-info" id="div_info_salmon">
- `-i $SALMON_INDEX_DIR`{.bash}
- Directory with `Salmon` index.
- `--libType A`{.bash}
- Library type. `A` is for automatic determination by `Salmon`.
For more complicated library types see [this](https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype).
- `--threads $N_THREADS`{.bash}
- Use this number of threads.
- `-1`{.bash}
- Forward reads.
- `-2`{.bash}
- Reverse reads.
- `-o $SALMON_OUT_DIR/$sample_id`{.bash}
- Quantification of each sample will be saved here.
</div>
Output from `Salmon` looks like this. There are some mapping metrics and logs, but mainly a file named `quant.sf`
containing a transcript-level count matrix.
```{bash}
ls $SALMON_OUT_DIR/*
```
```
salmon_out/SRR1039508:
aux_info cmd_info.json lib_format_counts.json libParams logs quant.sf
salmon_out/SRR1039509:
aux_info cmd_info.json lib_format_counts.json libParams logs quant.sf
salmon_out/SRR1039512:
aux_info cmd_info.json lib_format_counts.json libParams logs quant.sf
salmon_out/SRR1039513:
aux_info cmd_info.json lib_format_counts.json libParams logs quant.sf
```
`quant.sf` is just a TAB delimited file:
```{bash}
column -t -s $'\t' $SALMON_OUT_DIR/${SAMPLE_NAMES[1]}/quant.sf | head
```
```
Name Length EffectiveLength TPM NumReads