Skip to content

Detecting chromosomal rearrangements from targeted proximity-ligation data

Notifications You must be signed in to change notification settings

deLaatLab/PLIER

Repository files navigation

PLIER

PLIER is a simple yet powerful framework to identify regions in the genome with more than expected coverage. This coverage can be defined as a typical "depth of mapped reads", or number of independent ligation-products captured in a proximity-ligation dataset (such as FFPE-TLC).

DOI

PLIER's working structure

PLIER pipeline is structured in several steps to facilitate distributed computation (e.g., in a High Performance Computing environment). Each step is implemented in a specific file that are numbered from 01 (e.g., 01_demux-by-probs.py). Before explaining each step, we will explain the pre-requisites.

PLIER's package requirements:

The PLIER pipeline requires the following tools:

  • A Unix like shell (e.g. Bash v3.2+)
  • Samtools v1.9+
  • Bwa v0.7.17+
  • Python v3.5+ and the following Python packages:
    • numpy v1.13.3+
    • pandas v0.23.4+
    • h5py v2.7.1+
    • matplotlib v2.1.2+ (only for producing summary statistics)
    • pysam v0.15.1+
    • scikit-learn v0.23.2+
    • python-Levenshtein v0.12.0+

The user can run the following command to make sure all required packages are installed:

pip3 install numpy pandas h5py matplotlib pysam scikit-learn python-Levenshtein

PLIER's paths

There are several paths that need to be defined before running PLIER. For example, these paths include (for example) the path to the FASTA file of your reference genome (reference_path) or path to its corresponding BWA index (bwa_index).

These paths can be defined by opening ./configs.json in a text editor and describing the proper paths in your system.
Please adjust the exemplar paths that are already defined in ./configs.json to point to the proper files in your system before running PLIER.

Defining probe sets (or viewpoints) coordinates

FFPE-TLC data is generated by pooling reads that have at least a single fragment mapping to probed-areas. The corresponding probs coordinates can be defined in a tab-delimited file (for example see ./probs/lymphoma.probs.tsv). The content of such a tab-delimited file describes the probe coordinates as well as a given experiment name. For example:

Table.1. An example of probs details.

chr pos_begin pos_end target_locus
chr3 187429165 187760000 BCL6
chr8 128448316 129453680 MYC
chr14 106024796 106044796 IGH
chr14 106156687 106176687 IGH
chr14 106310062 106385377 IGH
chr14 107329540 107349540 IGH
chr18 60736416 60996600 BCL2

As you can see, multiple (linearly adjacent) regions can have the same name. These will be merged into a single viewpoint.

During the de-multiplexing step (see step 2.ii), each row in the prob information file (i.e. Table.1) produces an experiment. (also known as "viewpoint", e.g. the target gene such as MYC, BCL2, etc.). Under the hood, PLIER assigns reads to their related viewpoint based on their fragment’s overlap with the viewpoint’s coordinates. A mapped read is discarded if it did not overlap with any viewpoint that is defined in the *.probs.tsv file. Each collection of mapped-reads related to a viewpoint is stored in a dataset. Therefore, each combination of sample and viewpoint produce an independent FFPE-TLC experiment. The related meta-data for each experiment will be stored (by default) in <SAMPLE ID>.viewpoints.tsv.

How to run PLIER (an example "test" case)

  1. Start by defining configurations (in configs.json) and probe details (e.g., in ./probs/<PROJECT ID>.probs.tsv, see above).

  2. For FFPE-TLC experiments: 0. Aligning sequenced reads: We recommend to use BWA-MEM and the following arguments (hg19 in this case):

    bwa mem -SP -t6 -k12 -A2 -B3 reference_index_hg19_bwa \
          ./fastqs/lymphoma/F217-test_R1.fastq.gz \
          ./fastqs/lymphoma/F217-test_R2.fastq.gz | \
          samtools sort -n | \
          samtools view -h -q1 -b - > ./bams/combined/lymphoma/F217-test.bam

    Notes: - Due to privacy requirements of the patients used in this study, we can not share the FASTQ files that was used in this example. Please contact the authors in case you would like to have these exemplar FASTQ files. You would need an official approval from the corresponding ethics committee to receive the FASTQ files. - The alignments must be are sorted by name (i.e. samtools sorting with -n argument). Such sorting facilitates collection of fragments that are produced by an individual paired-end read (i.e. split-mapping) when traversing the BAM file.

    1. De-multiplexing reads into experiments by:

      python3 ./01_demux-by-probs.py --sample_id=F217-test --input_bam=./bams/combined/lymphoma/F217-test.bam --probs=./probs/lymphoma.probs.tsv --genome=hg19 

      For each defined viewpoint (i.e. in this case, each row in ./probs/lymphoma.probs.tsv file), this script produces an independent alignment file (.BAM) which contains reads that enclosed at least a fragment mapping to that particular viewpoint. In addition, .viewpoints.tsv file is created that stores the meta-data related to each experiment (in this case ./viewpoints/F217-test.viewpoints.tsv). We will frequently use this file in the next steps.

    2. Storing the de-multiplexed reads into an HDF5 container by:

      python3 ./02_make_dataset.py --viewpoints=./viewpoints/F217-test.viewpoints.tsv

      By this script, PLIER maps the fragments into an in-silico digested reference genome. This procedure requires positions of restriction enzyme's recognition sites of a given reference genome. These coordinates are stored in ./renzs/ folder. Such files are tab-delimited compressed files and are automatically created by PLIER (if not found), but they can be generated manually as well. Automatic generation is done by scanning the corresponding reference genome stored in FASTA format (reference_genome path is defined in configs.json) and storing the coordinates of restriction-enzyme cut sites in the relevant tab-delimited and compressed file. For example ./renzs/hg19_NlaIII.npz refers to the hg19 genome that is in-silico digested using NlaIII restriction enzyme (i.e. CATG sequence).

    3. Computing enrichment scores using PLIER by:

      python3 ./03_estimate_proximity-enrichments.py --viewpoints=./viewpoints/F217-test.viewpoints.tsv --expr_index=0 --bin_width=75e3

      This command computes the enrichment scores for the first (--expr_index=0) experiment defined in the given viewpoint file (in this case ./viewpoints/F217-test.viewpoints.tsv file. The user can change the experiment's index --expr_index to compute enrichment scores for another experiment (i.e. another row in ./viewpoints/F217-test.viewpoints.tsv file). Therefore, each application of PLIER is executed independently. This facilitates implementation of PLIER in a high-performance computing environment. For each experiment, the enrichment scores are stored in the --output_dir path, using an appropriate naming convention that specify the parameters used for the computation (e.g., bin_width). If a --output_id is given, the output will be stored according to the given name. The most important argument is bin_width that needs to be defined. Other arguments can be left as default.

      In order to compute enrichments across all experiments, one need to execute PLIER multiple times. Note that PLIER uses a multi-scale enrichment score and therefore, PLIER should be run with multiple bin_widths as follows:

      for ei in `seq 0 3`; do
         for bw in 5e3 75e3; do
            python3 ./03_estimate_proximity-enrichments.py --viewpoints=./viewpoints/F217-test.viewpoints.tsv --expr_index=${ei} --bin_width=${bw} --n_epoch=10;
         done; 
      done
    4. PLIER computes an enrichment for each (say 75kb) bin in the genome. In order to identify rearrangement events, adjacently enriched bins need to be merged into a enriched regions. By default, any cluster of adjacent bins that show enrichment above 5.0 are merged. This threshold can be adjusted using --min_enrichment.
      This is done by the following command:

      python3 ./04_merge_neighbor_enrichments.py --viewpoints=./viewpoints/F217-test.viewpoints.tsv

      In case you used the default naming convention, the corresponding enrichments (across bin_widths) will be found for each experiment. Otherwise, you need to adjust the --search_pattern argument accordingly.

    5. After merging the enrichment, significant rearrangements can be identified by selecting enriched regions that are found across scales (i.e., bin_widths=5kb as well 75kb) and also show merged enrichment above 8.0. This can be done by:

      python3 ./05_select_significant_enrichments_as_calls.py --viewpoints=./viewpoints/F217-test.viewpoints.tsv

Contact & questions

For any inquiry please contact Amin Allahyar at a{DOT}allahyar{AT}hubrecht{DOT}eu.

Citation and DOI:

In case you used an analysis or result from this work, please cite: https://doi.org/10.1038/s41467-021-23695-8

License for research or commercial use

This method is protected by a patent application. Please contact Cergentis [email protected] to discuss a research or commercial license if you want to make use of the method.