Exercises/E07-RNA_seq/02_quantification/02_quantification.Rmd

---
title: "E07 - RNA-seq / 02 - quantification (Analysis of Gene Expression @ UCT Prague)"
author:
  - Jiri Novotny jiri.novotny@img.cas.cz
  - Studuj bioinformatiku! http://studuj.bioinformatiku.cz
institute: "Laboratory of Genomics and Bioinformatics @ Institute of Molecular Genetics of the ASCR"
output:
  rmdformats::readthedown:
    highlight: "kate"
    lightbox: true
    thumbnails: true
    gallery: true
    toc_depth: 4
    self_contained: true
    number_sections: false
    toc_collapsed: false
    df_print: "paged"
date: "`r Sys.Date()`"
---

```{r, child = here::here("_assets/custom.Rmd"), eval = TRUE}
```

***

If you have created `tmux` session (named `rnaseq`) in the previous exercise, you can attach to it with

```{bash}
tmux attach -t rnaseq
```

Copy, please, these files and directories to your personal directory:

```{bash}
cp -r ~/shared/AGE_current/Exercises/E07-RNA_seq/02_quantification ~/AGE/Exercises/E07-RNA_seq
cp -r ~/shared/AGE_current/Exercises/E07-RNA_seq/data/{sample_sheet_airway.csv,experiment_trimmed_filtered,genome,gsnap_index_hg37_chr20,salmon_index_hg37_chr20,rna_sequences,transcriptome} ~/AGE/Exercises/E07-RNA_seq/data
cp -r ~/shared/AGE_current/Exercises/E07-RNA_seq/bin ~
chmod +x ~/bin/*
```

And change the working directory in RStudio terminal to the current exercise's directory:

```{bash}
cd ~/AGE/Exercises/E07-RNA_seq/02_quantification
```

We also need to install additional software.
We reuse the `conda` environment `tools` created in the previous exercise:

```{bash}
conda deactivate
conda activate tools
conda install -c bioconda preseq subread salmon rseqc
```

***

# Assignment

Your assignment will be quite similar to E06-IGV:

- Using the `dds` object, find at least three interesting genes,
  i.e. they should be expressed at all (a mean number of counts across samples at least greater than 100)
  and possibly their expression should differ between conditions (but you don't have to calculate the differential expression).
- Filter the BAM files so they contain reads assigned to those genes.
- Also filter the GTF file, so only features (rows) associated with those genes will remain.
- Import the files to IGV, so there will be:
  - A reference sequence. To make it easier, use the one IGV is offering.
  - A track with annotation.
  - Coverage tracks for the samples.
- Make some screenshots of IGV `r emo::ji("slightly_smiling_face")`
- Save the IGV session.
  Make sure you have checked the option "Use relative paths in session files" in `View -> Preferences -> General`.
  This way I can open your IGV session and all needed files located in the same directory will be loaded.

Please, return the assignment in this form: 

- Copy the files to a directory named `E07-RNA_seq_02_quantification_assignment_<your_name>`:
  `bash` script for filtering the BAM and GTF files (also place here the chosen gene names as a comment),
  filtered BAM and GTF files, IGV session and screenshots.
  Put this directory to a ZIP archive named `E07-RNA_seq_02_quantification_assignment_<your_name>.zip`.
- Upload the ZIP file to the appropriate assignment in MS Teams.

***

# Introduction

In this tutorial we will show how to:

- Filter out tRNA and rRNA reads using `SortMeRNA` tool.
- Quantify gene expression to obtain a count matrix, including quality control of this process.
- Import the resulting count matrix to R.

Generally, there are two ways how to do quantification (see the figure below):

- **Aligner pipeline**: Align reads to reference genome to obtain BAM file and count reads overlapping a feature (e.g. exon).
  - We will use [GSNAP](http://research-pub.gene.com/gmap/) for alignment and `featureCounts` tool from
    [subread](http://subread.sourceforge.net/) suite for counting.
    We will use two versions of `featureCounts`: command-line and from within R through
    [Rsubread](http://bioconductor.org/packages/release/bioc/html/Rsubread.html) package.
- **Pseudoaligner pipeline**: Map reads to reference transcriptome (pseudoalignment) and directly obtain a transcript-level count matrix (transcript abundance).
  - We will use [Salmon](https://combine-lab.github.io/salmon/) for this and later import the count matrix to R.

![Two different pipelines for quantification. Both finish with a count matrix.](`r here("E07-RNA_seq/02_quantification/_rmd_images/quantification_pipelines.png")`)

We will continue with trimmed data from the previous exercise.

***

# Config

Here we will set paths to files and directories.
This is copypasted from the previous exercise.

```{bash}
# This is a root directory for data.
BASE_DATA_DIR="../data"

# Here you will find experiment data.
EXPERIMENT_DATA_DIR="$BASE_DATA_DIR/experiment"

# A directory to store trimmed data.
TRIMMED_DATA_DIR="$BASE_DATA_DIR/experiment_trimmed"

# A directory for MultiQC output. Will be created automatically by MultiQC.
MULTIQC_DIR="multiqc"

# A directory for temporary files.
TMP_DIR="tmp"
mkdir $TMP_DIR

SAMPLE_NAMES=("SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513")
```

But here are new variables with paths to reference files and directories for `SortMeRNA`.

```{bash}
N_THREADS=2

### Reference genome and its annotation files.
# Reference sequence of human chromosome 20 in gzipped/decompressed FASTA format.
HG37_GENOME="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.dna.chromosome.20.fa.gz"
HG37_GENOME_D="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.dna.chromosome.20.fa"
# Annotation for human reference genome in gzipped GTF format.
HG37_ANN_GZ="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.gtf.gz"
HG37_ANN="$BASE_DATA_DIR/genome/Homo_sapiens.GRCh37.75.gtf"
# Human reference transcriptome.
HG37_TRANS="$BASE_DATA_DIR/transcriptome/Homo_sapiens.GRCh37.75.cdna.all.fa.gz"

### SortMeRNA
# Reference sequences of human rRNA and tRNA.
SORTMERNA_REFERENCE_DIR="$BASE_DATA_DIR/rna_sequences"
HUMAN_RRNA_FASTA="human_rRNA.fa"
HUMAN_TRNA_FASTA="human_tRNA.fa"

# Directory for SortMeRNA index of our reference sequences.
SORTMERNA_INDEX_DIR="sortmerna_index"
mkdir $SORTMERNA_INDEX_DIR

# Directory for logs from SortMeRNA which we later parse with MultiQC.
SORTMERNA_LOG_DIR="sortmerna_log"
mkdir $SORTMERNA_LOG_DIR

# Directory for temporary files which will be later removed.
SORTMERNA_TMP_DIR="sortmerna_tmp"
mkdir $SORTMERNA_TMP_DIR

# Directory to store trimmed and rRNA/tRNA filtered data.
TRIMMED_FILTERED_DATA_DIR="$BASE_DATA_DIR/experiment_trimmed_filtered"
mkdir $TRIMMED_FILTERED_DATA_DIR

### Paths for GSNAP
# Directory to store index of reference genome.
GSNAP_INDEX_DIR="$BASE_DATA_DIR/gsnap_index_hg37_chr20"

# Directory to store alignments.
GSNAP_OUT_DIR="gsnap_out"
mkdir $GSNAP_OUT_DIR

### RSeQC
HG37_ANN_BED="$BASE_DATA_DIR/genome/hg19_ensembl_gene.bed"
HG37_ANN_BED_FIX="$BASE_DATA_DIR/genome/hg19_ensembl_gene_fix.bed"

RSEQC_OUT_DIR="rseqc_out"
mkdir $RSEQC_OUT_DIR

### preseq
PRESEQ_OUT_DIR="preseq_out"
mkdir $PRESEQ_OUT_DIR

### featureCounts
# Directory to store biotype counts.
FC_OUT="featurecounts_biotypes_out"
mkdir $FC_OUT

### Salmon
SALMON_INDEX_DIR="$BASE_DATA_DIR/salmon_index_hg37_chr20"
SALMON_OUT_DIR="salmon_out"
```

***

# Reference files

Before you do any NGS analysis, you need to get reference files, such as genome, transcriptome,
annotation, etc. The main (but not only) sources for most of the organisms are:

- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) (USA - NCBI) -
  you can read more about its annotation pipeline [here](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/).
- [UCSC](https://hgdownload.soe.ucsc.edu/downloads.html)
  - Based on [GENCODE](https://www.gencodegenes.org/) (which is essentially identical to the Ensembl annotation),
    [UniProt](https://www.uniprot.org/), and mRNA data from [GenBank](https://www.ncbi.nlm.nih.gov/genbank/).
    You can find FAQ about the annotation [here](https://genome.ucsc.edu/FAQ/FAQgenes.html).
- [Ensembl](http://ftp.ensembl.org/pub/) (EU - EMBL) - FTP site
  - Based on [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/), [UniProt](https://www.uniprot.org/),
    and [Havana](https://www.ensembl.org/info/genome/genebuild/manual_havana.html) manual annotation.
    You can read more about Ensembl annotation [here](http://www.ensembl.org/info/genome/genebuild/index.html)
    and IDs [here](https://uswest.ensembl.org/Help/Faq?id=488).
    Ensembl annotation is practically identical to [GENCODE](https://www.gencodegenes.org/).

All of them are using the same genome assemblies (at least for human and mouse, as far as I know),
but different pipelines for annotation, so, for example, common genes could have different loci, exons etc.,
or some genes are present only in one of the annotations.
Their impact on RNA-seq data analysis was evaluated in [this paper](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8).

So which one to use? There was a similar question
[here](https://bioinformatics.stackexchange.com/questions/21/feature-annotation-refseq-vs-ensembl-vs-gencode-whats-the-difference.).
Here is one of the answers:

> In a practical sense, I think the biggest difference between RefSeq and Ensembl/GENCODE
  is in the sensitivity/specificity trade off.

> Ensembl aims more towards the inclusive end, including a far larger number of transcript variants,
  many of which are only weakly supported.

> RefSeq trades some of this sensitivity for specificity - you can be more confident that
  a RefSeq transcript exists, but less confident that the ReqSeq annotation
  includes all of the real transcripts for a gene.

So Ensembl should be better for exploratory analysis (in terms of detected genes including false positives),
while RefSeq is more practical for a more narrowed set of validated genes.

Also, it is a matter of preference `r emo::ji("slightly_smiling_face")`
For example, in our lab we are almost in all cases using Ensembl.

Because this exercise is largely based on the [F1000 tutorial](https://f1000research.com/articles/4-1070/v2)
in which Ensembl reference was used, we will use the same one -
human reference genome (chromosome 20) and transcriptome, both in the older version 37.75.
We have already downloaded all the files for you `r emo::ji("sunglasses")`

Anyway, it is good to know how to navigate in [Ensembl FTP site](http://ftp.ensembl.org/pub/).
For each release (such as 75 in our case) there is a following directory structure:
`<file type>/<organism>/<files>`. For example:

- Human genome and transcriptome references in FASTA format are located in
  `release-75/fasta/homo_sapiens/dna` and `release-75/fasta/homo_sapiens/cdna` directory, respectively.
- Human genome annotation is located in `release-75/gtf/homo_sapiens`.
- There are also `<current_*>/<organism>` (e.g. `current_gtf/homo_sapiens`) directories pointing to the latest Ensembl release's files.

In each of these directories there is also a `README` file describing a content of files in that directory
([example](http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/README) for human reference genome files).

If you are working on a remote server, `wget` tool provides the easiest way to download the files.
The basic command is:

```{bash}
wget -O output_file url
```

For example:

```{bash}
wget -O transcriptomes/human_transcriptome.fa.gz ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
```

You can omit the `-O` parameter and `wget` will download the file to the current directory and with the original filename.

***

# Filtering out rRNA and tRNA - `SortMeRNA`

Our library could be contaminated by ribosomal (rRNA) and transfer RNA (tRNA).
Although this contamination won't affect a subsequent differential expression analysis,
it can significantly slow-down the alignment, because those RNAs contain long repeating sequences.

[SortMeRNA](https://bioinfo.lifl.fr/RNA/sortmerna/) is a tool for filtering, mapping and OTU-picking NGS reads
in metatranscriptomic and metagenomic data.
`SortMeRNA` takes as input a file with reads (FASTA or FASTQ format) and one or multiple rRNA/tRNA database file(s) (FASTA),
and sorts apart rRNA/tRNA and rejected reads into two files specified by the user.
Although its primary purpose is filtering metagenomic data, it can be easily used for classic RNA-seq data.
At the end we will use `MultiQC` to report `SortMeRNA` log - a percentage of rRNA/tRNA in our samples.

The simplified `SortMeRNA` pipeline is depicted here:

![`SortMeRNA` pipeline. First, `SortMeRNA` creates index from our reference FASTA files (rRNA, tRNA, ...).
Then we merge FASTQ files to a single one and use `SortMeRNA` to sort reads to those aligning to reference and those which don't.
Finally, we split the latter to forward and reverse reads.](`r here("E07-RNA_seq/02_quantification/_rmd_images/sortmerna.png")`)

> In the beggining you have copied scripts for `SortMeRNA` to `~/bin`

```{bash}
# First we copy FASTA files with reference sequences to directory for SortMeRNA index.
cp $SORTMERNA_REFERENCE_DIR/$HUMAN_RRNA_FASTA $SORTMERNA_REFERENCE_DIR/$HUMAN_TRNA_FASTA $SORTMERNA_INDEX_DIR

# Create FASTA index.
samtools faidx $SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA
samtools faidx $SORTMERNA_INDEX_DIR/$HUMAN_TRNA_FASTA

# Create SortMeRNA index.
cd $SORTMERNA_INDEX_DIR
indexdb_rna --ref $HUMAN_RRNA_FASTA,$HUMAN_RRNA_FASTA.fai:$HUMAN_TRNA_FASTA,$HUMAN_TRNA_FASTA.fai
cd -

for sample_id in ${SAMPLE_NAMES[@]}; do
  echo "Processing sample $sample_id ..."

  # SortMeRNA cannot work directly with paired-end data - its input is a single FASTQ file.
  # However, authors provide two scripts to merge and unmerge paired-end data.
  merge-paired-reads.sh \
    $TRIMMED_DATA_DIR/${sample_id}*_1P.fastq \
    $TRIMMED_DATA_DIR/${sample_id}*_2P.fastq \
    $SORTMERNA_TMP_DIR/${sample_id}.fastq
  
  sortmerna \
    --reads $SORTMERNA_TMP_DIR/${sample_id}.fastq \
    --ref $SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA,$SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA.fai:$SORTMERNA_INDEX_DIR/$HUMAN_TRNA_FASTA,$SORTMERNA_INDEX_DIR/$HUMAN_TRNA_FASTA.fai \
    --aligned $SORTMERNA_TMP_DIR/${sample_id}_rRNA_tRNA \
    --other $SORTMERNA_TMP_DIR/${sample_id}_filtered \
    --paired_in \
    --fastx \
    -a $N_THREADS \
    --log \
    -v
  
  unmerge-paired-reads.sh \
    $SORTMERNA_TMP_DIR/${sample_id}_filtered.fastq \
    $TRIMMED_FILTERED_DATA_DIR/${sample_id}_trimmomatic_filtered_1.fastq \
    $TRIMMED_FILTERED_DATA_DIR/${sample_id}_trimmomatic_filtered_2.fastq
  
  mv $SORTMERNA_TMP_DIR/${sample_id}*.log $SORTMERNA_LOG_DIR
done
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_sortmerna" aria-expanded="false" aria-controls="collapseExample">
    Show parameters info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_sortmerna">

- `samtools faidx $SORTMERNA_INDEX_DIR/$HUMAN_RRNA_FASTA`{.bash}
  - Create an index (`.fai` extension) for FASTA file.
- `indexdb_rna --ref $HUMAN_RRNA_FASTA,$HUMAN_RRNA_FASTA.fai:$HUMAN_TRNA_FASTA,$HUMAN_TRNA_FASTA.fai`{.bash}
  - Create an index for SortMeRNA. The format is `ref1.fasta,ref1.fasta.fai:ref2.fasta,ref2.fasta.fai`.
- `merge-paired-reads.sh forward-reads.fastq reverse-reads.fastq merged-reads.fastq`{.bash}
  - Merge paired-end reads. That is, put paired reads side by side into one FASTQ file.
- `unmerge-paired-reads.sh merged-reads.fastq forward-reads.fastq reverse-reads.fastq`{.bash}
  - Unmerge paired-end reads. That is, split paired reads to two FASTQ files.

***

`sortmerna`:

- `--reads $TMP_DIR/${sample_id}_merged.fastq`{.bash}
  - Input FASTQ file. Paired-end reads must be merge before with `merge-paired-reads.sh` script.
- `--ref`{.bash}
  - Reference FASTA. Same parameter format as in `indexdb_rna`.
- `--aligned $TMP_DIR/${sample_id}_rRNA`{.bash}
  - Put aligned reads to this file. `.fastq` extension is added automatically.
- `--other $TMP_DIR/${sample_id}_filtered`{.bash}
  - Put non-aligned reads to this file - these are reads we are interested in. `.fastq` extension is added automatically.
- `--paired_in`{.bash}
  - SortMeRNA assigns each read individually.
    In case one read from the pair aligns to reference, whole pair will be put to FASTQ with aligned reads.
    Another option is `--paired_out`, which do the opposite.
    So these parameters ensure that read pairs won't be split to aligned and other FASTQ files.
- `--fastx`{.bash}
  - Output is FASTQ file.
- `-a $N_THREADS`{.bash}
  - Use this number of threads.
- `-log`{.bash}
  - Create a log file. It will be named `--reads <file>.log`.
- `-v`{.bash}
  - Be verbose.

</div>

***

# Alignment to genome

At this moment, we have our reads ready for the alignment.
We will align reads to the human reference genome chromosome 20 (-> BAM file) and then count reads
overlapping exons/genes to obtain a gene-level count matrix (-> delimited text file or R object).

## Aligner - `GSNAP`

Our aligner will be [GSNAP](http://research-pub.gene.com/gmap/).
To effectively align reads, `GSNAP` needs to make an index of the genome. Let's do it:

```{bash}
gmap_build \
  --dir=$GSNAP_INDEX_DIR \
  --genomedb=$(basename $GSNAP_INDEX_DIR) \
  --gunzip \
  --sort="chrom" \
  --circular="MT" \
  $HG37_GENOME
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_gsnap_index" aria-expanded="false" aria-controls="collapseExample">
    Show parameters info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_gsnap_index">

- `--dir=$GSNAP_INDEX_DIR`{.bash}
  - Save index into this directory.
- `--genomedb=$(basename $GSNAP_INDEX_DIR)`{.bash}
  - Name of index (subdirectory in `--dir` will be created).
    `basename` will output the part behind the last `/` in `$GSNAP_INDEX_DIR`.
- `--gunzip`{.bash}
  - Input files are compressed with gzip.
- `--sort="chrom"`{.bash}
  - Sort index by chromosomes.
- `--circular="MT"`{.bash}
  - Which chromosomes are circular.
- `$HG37_GENOME`{.bash}
  - Path to genome FASTA file.

</div>

***

Now we create information about splice sites. That's because we are mapping reads coming from mRNA to genome which includes introns.
So without the splicing information, "classic aligner" will put a long insert to alignment of reads which span introns.
With splicing information, "splice-aware aligner" (such as `GSNAP`) knows the location of exons,
so reads can be aligned to two exons without a long insert inside the intron.

![](`r here("E07-RNA_seq/02_quantification/_rmd_images/alignment.png")`)

`GSNAP` has a tool for making such a file with splice sites information.
First we decompress the genome annotation GTF file and then make a special compressed GTF with splice sites:

```{bash}
zcat $HG37_ANN_GZ > $HG37_ANN
cat $HG37_ANN | gtf_splicesites | iit_store -o $GSNAP_INDEX_DIR/splice_sites
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_gsnap_splicesites" aria-expanded="false" aria-controls="collapseExample">
    Show parameters info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_gsnap_splicesites">

- `gtf_splicesites`{.bash}
  - Extract splice sites (exons) and send it to stdout in GTF format.
- `iit_store -o $GSNAP_INDEX_DIR/splice_sites`{.bash}
  - Compress GTF and save it to the file `$GSNAP_INDEX_DIR/splice_sites.iit`.

</div>

***

All the files required for alignment are ready, so we can start it.
Later we will use `MultiQC` to report log files from `GSNAP`.
First we decompress the gzipped FASTA with reference genome.
After alignment we use this file to reconstruct the missing header in the SAM files produced by `GSNAP`.

We loop through the FASTQ files and for each we perform the alignment and subsequent processing with `samtools` -
converting to the BAM format (compressed SAM), sorting alignments by position and creating BAM index.

```{bash}
gzip -k -d $HG37_GENOME

for sample_id in ${SAMPLE_NAMES[@]}; do
  echo "Processing sample $sample_id ..."

  out_file_path=$GSNAP_OUT_DIR/$sample_id

  gsnap \
  --nthreads=$N_THREADS \
  --dir=$GSNAP_INDEX_DIR \
  --db=$(basename $GSNAP_INDEX_DIR) \
  --use-splicing=$GSNAP_INDEX_DIR/splice_sites \
  --gunzip \
  --format="sam" \
  --output-file=${out_file_path}.sam \
  ${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_1.fastq \
  ${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_2.fastq
  
  samtools view -Sb -T $HG37_GENOME_D ${out_file_path}.sam | samtools sort -@$N_THREADS -o $GSNAP_OUT_DIR/${sample_id}.bam -
  samtools index $GSNAP_OUT_DIR/${sample_id}.bam
done
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_gsnap_alignment" aria-expanded="false" aria-controls="collapseExample">
    Show parameters info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_gsnap_alignment">

- `--nthreads=$N_THREADS`{.bash}
  - Use this number of threads.
- `--dir=$GSNAP_INDEX_DIR`{.bash}
  - Index directory.
- `--db=$(basename $GSNAP_INDEX_DIR)`{.bash}
  - Name of genome index.
- `--use-splicing=$GSNAP_INDEX_DIR/splice_sites`{.bash}
  - Path to file with splicing information.
- `--gunzip`{.bash}
  - Input files are compressed with gzip.
- `--format="sam"`{.bash}
  - Output in SAM format.
- `--output-file=${out_file_path}.sam`{.bash}
  - Path to output file.
- `${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_1.fastq`{.bash}
  - Forward reads of a sample.
- `${TRIMMED_FILTERED_DATA_DIR}/${sample_id}*_2.fastq`{.bash}
  - Reverse reads of a sample.

***

- `samtools view -Sb -T $HG37_GENOME_D ${out_file_path}.sam |`{.bash}
  - Convert to BAM format using this number of threads (`-@$N_THREADS`),
    using this reference (`-T $HG37_GENOME_D`) FASTA to make a header
    (BAM files need information about the length of each reference, and this info is stored in the header),
    and pipe the output (stdout).
- `samtools sort -@$N_THREADS -o $GSNAP_OUT_DIR/${sample_id}.bam -`{.bash}
  - Read input (stdin), sort it by position and save it to `$GSNAP_OUT_DIR/${sample_id}.bam`.
    `samtools` are designed to read stdin when `-` is supplied as the input file.
- `samtools index $GSNAP_OUT_DIR/${sample_id}.bam`{.bash}
  - Create index of a BAM file.

</div>

***

## Quality control of the alignment

### `RSeQC`

Taken from the [RSeQC](http://rseqc.sourceforge.net) homepage:

> RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput
sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality,
nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation,
mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc.

`RSeQC` is composed of several command-line tools. Probably the most useful ones are [supported by MultiQC](https://multiqc.info/docs/#rseqc).
We run few of them and later include them in `MultiQC` report.
For some of its tools, `RSeQC` needs an annotation in BED format, which is different from GTF.
Fortunately, authors also [provide](https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg19_Ensembl_gene.bed.gz/download) these annotations
and we will use the annotation of [Ensembl GRCh37/hg19](https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg19_Ensembl_gene.bed.gz/download)
genome to which we aligned our reads (you should always use annotation for the same version of genome).
You can make your own annotation [here](https://genome.ucsc.edu/cgi-bin/hgTables?command=start).

But there is one problem: in our reference genome, chromosomes are named just with their numbers (e.g. `1`, `2`).
In the annotation BED file, chromosome names (first column) contain `chr` prefix (e.g. `chr1`, `chr2`).
This is one of the main discrepancies between Ensembl and RefSeq reference genomes.
So we need to change the BED file a little bit: to remove all `chr` prefixes.
We can do it easily with `awk`:

```{bash}
awk '{ gsub("^chr", "", $1); print $0 }' $HG37_ANN_BED > $HG37_ANN_BED_FIX
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_bed_awk" aria-expanded="false" aria-controls="collapseExample">
    Show `awk` explanation
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_bed_awk">

- `gsub("^chr", "", $1)`
  - In the first column, remove the leading `chr` string.
- `print $0`
  - Print all columns.

</div>

Now we can run some `RSeQC` tools on each of our samples:

```{bash}
for sample_id in ${SAMPLE_NAMES[@]}; do
  echo "Processing sample $sample_id ..."

  in_file_path=$GSNAP_OUT_DIR/${sample_id}.bam

  bam_stat.py -i $in_file_path > $RSEQC_OUT_DIR/${sample_id}.bam_stat.txt
  read_distribution.py -i $in_file_path -r $HG37_ANN_BED_FIX > $RSEQC_OUT_DIR/${sample_id}.read_distribution.txt
  infer_experiment.py -i $in_file_path -r $HG37_ANN_BED_FIX > $RSEQC_OUT_DIR/${sample_id}.infer_experiment.txt
  inner_distance.py -i $in_file_path -r $HG37_ANN_BED_FIX -o $RSEQC_OUT_DIR/${sample_id}.rseqc
done
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_rseqc" aria-expanded="false" aria-controls="collapseExample">
    Show RSeQC tools info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_rseqc">

- [bam_stat.py](http://rseqc.sourceforge.net/#bam-stat-py)
  - Summarizing mapping statistics of a BAM or SAM file.
- [read_distribution.py](http://rseqc.sourceforge.net/#read-distribution-py)
  - Provided a BAM/SAM file and reference gene model, this module will calculate how mapped reads were distributed
    over genome feature (like CDS exon, 5'UTR exon, 3' UTR exon, Intron, Intergenic regions).
- [infer_experiment.py](http://rseqc.sourceforge.net/#infer-experiment-py)
  - This program is used to "guess" how RNA-seq sequencing were configured, particulary how reads were stranded for
    strand-specific RNA-seq data, through comparing the "strandness of reads" with the "strandness of transcripts".
- [inner_distance.py](http://rseqc.sourceforge.net/#inner-distance-py)
  - This module is used to calculate the inner distance (or insert size) between two paired RNA reads.
    The distance is the mRNA length between two paired reads.

</div>

### `preseq` - checking library complexity

From the [preseq](http://smithlabresearch.org/software/preseq/) homepage:

> The `preseq` package is aimed at predicting and estimating the complexity of a genomic sequencing library,
equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and
how many will be expected from additional sequencing using an initial sequencing experiment.
The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth,
or to screen multiple libraries to avoid low complexity samples.

Good reading regarding `preseq` is this blog post called
["How deep should we sequence?"](https://davetang.org/muse/2013/07/10/how-deep-should-we-sequence/).

We will use `lc_extrap` function, which computes the expected future yield of distinct reads and bounds on the number of
total distinct reads in the library and the associated confidence intervals.
Later we will report the output with `MultiQC`.

So let's run `preseq` on each of our samples:

```{bash}
for sample_id in ${SAMPLE_NAMES[@]}; do
  echo "Processing sample $sample_id ..."

  in_file_path=$GSNAP_OUT_DIR/${sample_id}.bam
  
  preseq lc_extrap -B $in_file_path -o $PRESEQ_OUT_DIR/${sample_id}.ccurve.txt
done
```

***

## `featureCounts` - quantification

Now comes the quantification itself.
The tool [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) from the suite [subread](http://subread.sourceforge.net/)
is a popular quantification software.

`featureCounts` takes as input SAM/BAM files and an annotation file including chromosomal coordinates of features.
It outputs numbers of reads assigned to features (or meta-features).
It also outputs stat info for the overall summrization results, including number of successfully
assigned reads and number of reads that failed to be assigned due to various reasons.

Each entry in the provided annotation file is taken as a **feature** (e.g. an exon).
A **meta-feature** is the aggregation of a set of features (e.g. a gene).
The `featureCounts` program uses the gene_id attribute available in the GTF format annotation
(or the GeneID column in the SAF format annotation) to group features into meta-features,
i.e. features belonging to the same meta-feature have the same gene identifier.

When summarizing reads at meta-feature level, read counts obtained for features included in the
same meta-feature will be added up to yield the read count for the corresponding meta-feature.

A read is said to overlap a feature if at least one read base is found to overlap the feature.
For paired-end data, a fragment (or template) is said to overlap a feature if any of the
two reads from that fragment is found to overlap the feature.

By default, `featureCounts` does not count reads overlapping with more than one feature
(or more than one meta-feature when summarizing at the meta-feature level).
Users can use the `-O` option to instruct `featureCounts` to count such reads
(they will be assigned to all their overlapping features or meta-features).

Note that, when counting at the meta-feature level, reads that overlap multiple features of
the same meta-feature are always counted exactly once for that meta-feature, provided there is no
overlap with any other meta-feature. For example, an exon-spanning read will be counted only once
for the corresponding gene even if it overlaps with more than one exon.

![`featureCounts` counting.](`r here("E07-RNA_seq/02_quantification/_rmd_images/featurecounts.png")`)

First of all, we will use the command-line version of `featureCounts` to get a number
of biotypes (`gene_biotype` in GTF annotation), such as protein coding, rRNA, miRNA, etc.

Let's examine in our GTF annotation what is actually `featureCounts` using as features and meta-features:

```{bash}
head $HG37_ANN | cut -f 3,9 | tail -n +6 | column -t
```

You should see this:

```
gene        gene_id  "ENSG00000223972";  gene_name      "DDX11L1";          gene_source  "ensembl_havana";  gene_biotype  "pseudogene";
transcript  gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  gene_name    "DDX11L1";         gene_source   "ensembl_havana";  gene_biotype  "pseudogene";      transcript_name  "DDX11L1-002";  transcript_source  "havana";
exon        gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  exon_number  "1";               gene_name     "DDX11L1";         gene_source   "ensembl_havana";  gene_biotype     "pseudogene";   transcript_name    "DDX11L1-002";  transcript_source  "havana";  exon_id  "ENSE00002234944";
exon        gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  exon_number  "2";               gene_name     "DDX11L1";         gene_source   "ensembl_havana";  gene_biotype     "pseudogene";   transcript_name    "DDX11L1-002";  transcript_source  "havana";  exon_id  "ENSE00003582793";
exon        gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  exon_number  "3";               gene_name     "DDX11L1";         gene_source   "ensembl_havana";  gene_biotype     "pseudogene";   transcript_name    "DDX11L1-002";  transcript_source  "havana";  exon_id  "ENSE00002312635";
```

In the third field in GTF there are features and we will use exons for counting.
In the nineth field there are possible meta-features used for grouping the counts.
To count biotypes we will use the meta-feature `gene_biotype`.

```{bash}
featureCounts -a $HG37_ANN -t exon -g gene_biotype -o $FC_OUT/biotype_counts.tsv -p -C -T $N_THREADS $GSNAP_OUT_DIR/*.bam
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_featurecounts" aria-expanded="false" aria-controls="collapseExample">
    Show `featureCounts` parameters info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_featurecounts">

- `-a $HG37_ANN`{.bash}
  - Annotation file in GTF format.
- `-t exon`{.bash}
  - Feature to count reads is `exon`.
- `-g gene_biotype`{.bash}
  - Meta-feature to group counts is `gene_biotype`.
- `-o $FC_OUT/biotype_counts.tsv`{.bash}
  - Output counts to this file.
- `-p`{.bash}
  - Input files contain paired-end reads.
- `-C`{.bash}
  - Do not count read pairs that have their two ends mapping to
    different chromosomes or mapping to same chromosome but on different strands.
    These reads are called chimeric.
- `-T $N_THREADS`{.bash}
  - Use this number of threads.
- `$GSNAP_OUT_DIR/*.bam`{.bash}
  - Input files.

</div>

Let's look at the number of distinct biotypes:

```{bash}
cut -f1,7,8,9,10 $FC_OUT/biotype_counts.tsv | tail -n +2 | column -t
```

```
Geneid                    gsnap_out/SRR1039508.bam  gsnap_out/SRR1039509.bam  gsnap_out/SRR1039512.bam  gsnap_out/SRR1039513.bam
pseudogene                47866                     44224                     40431                     33452
lincRNA                   589                       568                       596                       582
protein_coding            26995                     26508                     28067                     27411
antisense                 284                       351                       235                       142
processed_transcript      44                        52                        54                        59
snRNA                     9                         8                         9                         10
sense_intronic            0                         0                         1                         0
miRNA                     24                        35                        32                        35
misc_RNA                  61                        66                        34                        27
snoRNA                    5                         4                         4                         2
rRNA                      0                         6                         2                         7
3prime_overlapping_ncrna  0                         0                         0                         0
polymorphic_pseudogene    0                         0                         0                         0
sense_overlapping         0                         0                         0                         0
IG_V_gene                 0                         0                         0                         0
IG_C_gene                 0                         0                         0                         0
IG_J_gene                 0                         0                         0                         0
IG_V_pseudogene           0                         0                         0                         0
TR_C_gene                 0                         0                         0                         0
TR_J_gene                 0                         0                         0                         0
TR_V_gene                 0                         0                         0                         0
TR_V_pseudogene           0                         0                         0                         0
IG_C_pseudogene           0                         0                         0                         0
TR_D_gene                 0                         0                         0                         0
TR_J_pseudogene           0                         0                         0                         0
IG_J_pseudogene           0                         0                         0                         0
IG_D_gene                 0                         0                         0                         0
processed_pseudogene      0                         0                         0                         0
Mt_tRNA                   0                         0                         0                         0
Mt_rRNA                   0                         0                         0                         0
```

Chromosome 20 probably contains a lot of pseudogenes `r emo::ji("monocle")`

***

Now let's move into R and do the gene-level quantification using the `featureCounts()`
function from [Rsubread](https://bioconductor.org/packages/release/bioc/html/Rsubread.html) package.
We will use the meta-feature `gene_id`.

```{r, eval = TRUE}
library(Rsubread)
library(here)
library(glue)

# pattern = ".bam$" will match all files ending with ".bam", so it won't include ".bam.bai" files.
bam_files <- list.files(here("E07-RNA_seq/02_quantification/gsnap_out"), pattern = ".bam$", full.names = TRUE)
annot_file <- here("E07-RNA_seq/data/genome/Homo_sapiens.GRCh37.75.gtf")

res <- featureCounts(
  files = bam_files,
  annot.ext = annot_file,
  isGTFAnnotationFile = TRUE,
  GTF.featureType = "exon",
  useMetaFeatures = TRUE,
  GTF.attrType = "gene_id",
  isPairedEnd = TRUE,
  # We want to discard reads mapping to multiple locations (i.e. reads having multiple alignments in BAM file).
  countMultiMappingReads = FALSE,
  # We want to discard reads which have multiple features or meta-features assigned.
  allowMultiOverlap = FALSE,
  # We want to discard fragments which has its two reads mapped to different chromosomes.
  countChimericFragments = FALSE,
  nthreads = 2
)
```

`res` is a `list` object with several values:

```{r, eval = TRUE}
names(res)
```

- `counts`: the count matrix, in which rows are genes and columns are samples.

```{r, eval = TRUE}
res$counts[1:5, ]
```

- `stat`: counting summary.

```{r, eval = TRUE}
res$stat
```

We can calculate the percentage of assigned and unassigned reads for each sample:

```{r, eval = TRUE}
rownames(res$stat) <- res$stat$Status

assigned_percent <- apply(dplyr::select(res$stat, -Status), MARGIN = 2, FUN = function(x) {
  round((x["Assigned"] / sum(x)) * 100, 2)
})

unassigned_percent <- 100 - assigned_percent
data.frame(
  Sample = names(assigned_percent),
  Assigned = glue("{assigned_percent}%"),
  Unassigned = glue("{unassigned_percent}%")
)
```

***

## Importing count matrix to `DESeq2` - a differential expression analysis package

At this point, we have the gene-level count matrix.
This is the branching point after which we can use a variety of
tools for (but not only) differential expression analysis.
We have chosen the popular package
[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).
This package also offers a great [vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html),
which also contains a description of theory behind `DESeq2`.

To store a count matrix and additional information, `DESeq2` is using object of class `DESeqDataSet`,
which is derived from [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html).
You can think of this object as a container for your experiment, where each sample has its metadata and counts.
Later we will show you how is this object organized and how to access its data.
In fact, it is very similar to `ExpressionSet` we were using for microarray data.

We load a sample sheet and then create a `DESeqDataSet` object:

```{r, eval = TRUE}
library(DESeq2)
library(stringr)

# Load the sample sheet.
sample_sheet_file <- here("E07-RNA_seq/data/sample_sheet_airway.csv")
sample_sheet <- read.csv(sample_sheet_file)
sample_sheet$Sample_Group <- factor(sample_sheet$Sample_Group, levels = c("untreated", "dex"))
rownames(sample_sheet) <- sample_sheet$Sample_ID

sample_sheet

# Keep only the sample IDs in column names of the count matrix.
colnames(res$counts) <- str_remove(colnames(res$counts), ".bam")
# Make sure the order of columns in the count matrix is the same as order of rows in the sample sheet.
# That is, each row in the sample sheet corresponds to column in the count matrix.
res$counts <- res$counts[, rownames(sample_sheet)]

rownames(sample_sheet)
colnames(res$counts)

# Create a DESeqDataSet object.
# Design is simple: counts of each gene depends on Sample_Group ("untreated" or "dex" in our case).
# Design is further used for modeling the differential expression and in normalization.
dds <- DESeqDataSetFromMatrix(countData = res$counts, colData = sample_sheet, design = ~ Sample_Group)
dds
```

***

# Mapping to transcriptome (pseudoalignment) - `Salmon`

So how does mapping to transcriptome differs from alignment to genome? Well, it is **mapping**.
That means you don't obtain an exact alignment of each read, but only a region in reference to which maps into.
So it is much, much faster than alignment, but also sufficient if you just want
to do differential expression analysis on a **known** transcriptome and you don't care about finding novel splice variants or transcripts.
Anyway, `Salmon` is also able to output SAM files with mappings.

We will use a tool called [Salmon](https://combine-lab.github.io/salmon/),
which is doing both mapping and transcript-level quantification (transcript abundance).

In the recent versions it is also possible to use an
[alignment-based mode](https://salmon.readthedocs.io/en/latest/salmon.html#quantifying-in-alignment-based-mode),
that is, to use BAM files with alignments produced by an another aligner.
In that case you haven't to create a transcriptome index (see below).

First, as for `GSNAP`, we need to create an index (in this case from transcriptome FASTA):

```{bash}
salmon index --threads $N_THREADS --transcripts $HG37_TRANS --index $SALMON_INDEX_DIR
```

Now we are ready for mapping and quantification:

```{bash}
for sample_id in ${SAMPLE_NAMES[@]}; do
  echo "Processing sample $sample_id ..."

  salmon quant \
    -i $SALMON_INDEX_DIR \
    --libType A \
    --threads $N_THREADS \
    -1 $TRIMMED_FILTERED_DATA_DIR/${sample_id}*_1.fastq \
    -2 $TRIMMED_FILTERED_DATA_DIR/${sample_id}*_2.fastq \
    -o $SALMON_OUT_DIR/$sample_id
done
```

<p>
  <button class="btn btn-primary btn-sm" type="button" data-toggle="collapse" data-target="#div_info_salmon" aria-expanded="false" aria-controls="collapseExample">
    Show `Salmon` parameters info
  </button>
</p>

<div class="collapse div-collapse-info" id="div_info_salmon">

- `-i $SALMON_INDEX_DIR`{.bash}
  - Directory with `Salmon` index.
- `--libType A`{.bash}
  - Library type. `A` is for automatic determination by `Salmon`.
    For more complicated library types see [this](https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype).
- `--threads $N_THREADS`{.bash}
  - Use this number of threads.
- `-1`{.bash}
  - Forward reads.
- `-2`{.bash}
  - Reverse reads.
- `-o $SALMON_OUT_DIR/$sample_id`{.bash}
  - Quantification of each sample will be saved here.

</div>

Output from `Salmon` looks like this. There are some mapping metrics and logs, but mainly a file named `quant.sf`
containing a transcript-level count matrix.

```{bash}
ls $SALMON_OUT_DIR/*
```

```
salmon_out/SRR1039508:
aux_info  cmd_info.json  lib_format_counts.json  libParams  logs  quant.sf

salmon_out/SRR1039509:
aux_info  cmd_info.json  lib_format_counts.json  libParams  logs  quant.sf

salmon_out/SRR1039512:
aux_info  cmd_info.json  lib_format_counts.json  libParams  logs  quant.sf

salmon_out/SRR1039513:
aux_info  cmd_info.json  lib_format_counts.json  libParams  logs  quant.sf
```

`quant.sf` is just a TAB delimited file:

```{bash}
column -t -s $'\t' $SALMON_OUT_DIR/${SAMPLE_NAMES[1]}/quant.sf | head
```

```
Name             Length  EffectiveLength  TPM           NumReads
ENST00000415118  8       9.000            0.000000      0.000
ENST00000434970  9       10.000           0.000000      0.000
ENST00000448914  13      14.000           0.000000      0.000
ENST00000604642  23      24.000           0.000000      0.000
ENST00000603326  19      20.000           0.000000      0.000
ENST00000604950  31      32.000           0.000000      0.000
ENST00000603077  31      32.000           0.000000      0.000
ENST00000605284  17      18.000           0.000000      0.000
ENST00000390591  31      32.000           0.000000      0.000
```

- **Name**: Ensembl ID of transcript
- **Length**: Transcript length (sum of its exons) in base pairs.
- **EffectiveLength**: This is the computed effective length of the target transcript.
  It takes into account all factors being modeled that will effect the probability of
  sampling fragments from this transcript, including the fragment length distribution
  and sequence-specific and GC-fragment bias (if they are being modeled).
- **TPM**: This is `Salmon`'s estimate of the relative abundance of this transcript in units
  of Transcripts Per Million (TPM). TPM is the recommended relative abundance measure to use for downstream analysis
  (but not for differential expression analysis using `DESeq2`!).
- **NumReads**: This is `Salmon`'s estimate of the number of reads mapping to each transcript
  that was quantified. It is an "estimate" insofar as it is the expected number of reads
  that have originated from each transcript given the structure of the uniquely mapping
  and multi-mapping reads and the relative abundance estimates for each transcript.

## Importing `Salmon` output to R (`DESeq2`)

Thanks to the package [tximport](http://bioconductor.org/packages/release/bioc/html/tximport.html)
we can easily import a transcript abundance matrix from `Salmon` to R.
`tximport` calculates gene-level counts, but it needs to know how transcripts belong to genes.
To get this information, we use package
[EnsDb.Hsapiens.v75](https://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v75.html)
containing the annotation of genomic features.

In case the annotation package is not available for your reference version or organism,
you can create `EnsDB` from GTF file (it will be converted to SQLite database file):

```{r, eval = FALSE}
ensembldb::ensDbFromGtf(gtf_file, outfile = "ensdb.sqlite")
ensdb <- ensembldb::EnsDb("ensdb.sqlite)
```

We retrieve IDs of transcripts and corresponding IDs of genes as dataframe:

```{r, eval = TRUE}
library(EnsDb.Hsapiens.v75)

tx2gene <- transcripts(EnsDb.Hsapiens.v75, return.type = "data.frame")[, c("tx_id", "gene_id")]
head(tx2gene)
```

Now we can import data from `Salmon` and create a `DESeqDataSet` object:

```{r, eval = TRUE}
library(tximport)

salmon_files <- here("E07-RNA_seq/02_quantification/salmon_out", sample_sheet$Sample_ID, "quant.sf")
salmon_files
salmon_counts <- tximport(salmon_files, type = "salmon", tx2gene = tx2gene)
dds_salmon <- DESeqDataSetFromTximport(salmon_counts, colData = sample_sheet, design = ~ Sample_Group)
dds_salmon
```

***

# `MultiQC`

We have generated several files and reports from different tools.
Let's put it all together with `MultiQC`:

```{bash}
multiqc -o multiqc_quantification $SORTMERNA_LOG_DIR $RSEQC_OUT_DIR $PRESEQ_OUT_DIR $SALMON_OUT_DIR
```

Now you can explore the `MultiQC` report.

***

# Using `DESeqDataSet` (`SummarizedExperiment`) object

As we said before, you can think of `DESeqDataSet` as a container for your experiment.
Now we examine how to work with this container.
Don't be confused when we sometimes talk about `SummarizedExperiment` - `DESeqDataSet`
is just adding some extended functionality to this object, but the interface
remains the same.

![`SummarizedExperiment` overview.
[Source](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)](`r here("E07-RNA_seq/02_quantification/_rmd_images/summarized_experiment.png")`)

## Column (sample) data

`colData()` function is used to retrieve the column data (sample sheet).
It is similar to `pData()`/`phenoData()` for `ExpressionSet` object:

```{r, eval = TRUE}
colData(dds)
```

To only get the columns names of the count matrix,
you can simply call `colnames()`:

```{r, eval = TRUE}
colnames(dds)
```

Columns of sample data can be also directly accessed from `DESeqDataSet` object:

```{r, eval = TRUE}
dds$Sample_Group
```

## Row (feature) data

Each feature (gene in our case) is described in feature dataframe.
Again, this is similar to `fData()`/`featureData()` for `ExpressionSet` object.

We don't set this dataframe yet, so now it's empty, with rownames being the same as in the count matrix.
The IDs of features can also be retrieved with `rownames()`.

```{r, eval = TRUE}
rowData(dds) %>% head()
rownames(dds) %>% head()
all(rowData(dds) %>% rownames() == rownames(dds))
```

It would be nice to have some information about genes there,
so let's make a more detailed feature dataframe using the annotation
package [org.Hs.eg.db](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html).
This package is updated biannual and contains mapping of various gene identifiers.
Data from this package are retrieved through [AnnotationDbi](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html)
interface.

And what's actually inside?

```{r, eval = TRUE}
library(org.Hs.eg.db)
columns(org.Hs.eg.db)
```

All this columns are connected by various keys:

```{r, eval = TRUE}
keytypes(org.Hs.eg.db)
```

Gene IDs (rownames) in our `DESeqDataSet` are of type `ENSEMBL`.
We use these gene IDs as keys to retrieve corresponding `SYMBOL` and `GENENAME`:

```{r, eval = TRUE}
# From annotation, we want SYMBOL and GENENAME for each of the specified keys.
# We specify package (AnnotationDbi::) from which select() function is used, because
# this function name is common to several packages.
ann <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(dds), keytype = "ENSEMBL", columns = c("SYMBOL", "GENENAME"))
head(ann)
```

Some ENSEMBL IDs have multiple `SYMBOL` assigned:

```{r, eval = TRUE}
ann[duplicated(ann$ENSEMBL) & duplicated(ann$ENSEMBL, fromLast = TRUE), ] %>% head(10)
```

To keep things easier, we only keep the first occurence of `ENSEMBL`:

```{r, eval = TRUE}
ann <- ann[!duplicated(ann$ENSEMBL), ]
rownames(ann) <- ann$ENSEMBL
```

Now we join the feature data in `dds` object and our new dataframe.

```{r, eval = TRUE}
# We will select only ENSEMBL IDs which are in our DESeqDataSet and we also get the same order.
ann <- ann[rownames(dds), ]

# Make sure ENSEMBL IDs match.
# stopifnot() will throw an error if condition inside is FALSE.
# You can try stopifnot(1 == 2) :)
stopifnot(all(rownames(ann) == rownames(dds)))

rowData(dds) <- cbind(rowData(dds), ann)
rowData(dds) %>% head()
```

## Count data

`counts()` function is used to retrieve the count matrix.
Later we will normalize the count matrix, and to retrieve it, you
have to specify the parameter `normalized = TRUE`.

> This is the one of the few differences of `DESeqDataSet`:
  in case of `SummarizedExperiment`, `assays()` function is used
  to retrieve the count matrix.

```{r, eval = TRUE}
counts(dds)[1:5, ]
```

## Subsetting

`DESeqDataSet` (`SummarizedExperiment`) has one fine feature:
you can subset it and it will be applied to every data slot (just like `ExpressionSet` was doing so).
So for example, you only want the first five features:

```{r, eval = TRUE}
dds_subset <- dds[1:5, ]
dds_subset
rowData(dds_subset)
counts(dds_subset)
```

Or you can subset both features and sample data:

```{r, eval = TRUE}
dds_subset <- dds[1:5, dds$Sample_Group == "dex"]
dds_subset
rowData(dds_subset)
colData(dds_subset)
counts(dds_subset)
```

***

# Cleanup

```{r, warning = TRUE, message = TRUE, eval = TRUE}
save(list = ls(all.names = TRUE), file = here("E07-RNA_seq/02_quantification/02_quantification.RData"), envir = environment())

warnings()
traceback()
sessioninfo::session_info()
```

***
***

# HTML rendering

This chunk is not evaluated (`eval = FALSE`). Otherwise you will probably end up in recursive hell `r emo::ji("exploding_head")`

```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(conflicted)
library(knitr)
library(here)

if (!require(rmdformats)) {
  BiocManager::install("rmdformats")
}

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = FALSE)
rmarkdown::render(
  here("E07-RNA_seq/02_quantification/02_quantification.Rmd"),
  output_file = here("E07-RNA_seq/02_quantification/02_quantification.html"),
  envir = new.env(),
  knit_root_dir = here()
)
```