Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse CNV:TR alleles from VCF #1561

Merged
merged 7 commits into from
Dec 14, 2023

Conversation

nuno-agostinho
Copy link
Contributor

@nuno-agostinho nuno-agostinho commented Nov 21, 2023

ENSVAR-5875: parse CNV:TR alleles (tandem repeats) from VCF

tandem_repeat consequences should be the same as tandem_duplication - requires: Ensembl/ensembl-variation#1057

Motivation

VCF v4.4 (section 5.7) details how to represent tandem repeats using the symbolic allele <CNV:TR>. For instance,

chr1 100 cnv_notation T <CNV:TR>,<CNV:TR> . . END=130;SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG,CA,CAG;RN=1,3;RB=90,15,2,12  GT:PS:CN 1|2:100:3.9666

This is equivalent to having the following alternative allele string, as illustrated in the image below:

  • CAGCAGCAGCAGTTGTTG,CACACA
  • Alternative representation: $(CAG)_4(TTG)_2,(CA)_3$

Screenshot 2023-11-21 at 13 22 04

The following INFO fields are used to describe tandem repeats (not all are needed simultaneously, as some are redundant):

INFO fields Description
RN Total number of repeat sequences in this allele
RUS Repeat unit sequence of the corresponding repeat sequence
RUC Repeat unit count of corresponding repeat sequence
RB Total number of bases in the corresponding repeat sequence
RUB Number of bases in each individual repeat unit
RUL Repeat unit length of the corresponding repeat sequence
CIRB Confidence interval around RB
CIRUC Confidence interval around RUC

Logic

We first check if we have any sequence from RUS field to parse the <CNV:TR> alleles. If not, we simply annotate the variant as a structural variant whose class_SO_term is tandem_repeat (should return the same consequences as tandem_duplication, see Ensembl/ensembl-variation#1057).

If the RUS field is defined, we recreate the alternative sequence by checking the total number of repeats for each allele (RN1) and then we append the repeat unit sequence as many times as required by either using:

  • The RUC field for that repeat unit as the number of times to append the repeat sequence
  • The RB field for that repeat unit divided by the length of its sequence (the resulting value is the same as the RUC field -- when both are defined)

To take the sequence of the alternative allele into consideration while calculating variant consequences, we will return a VariationFeature object instead of a SV2.

Testing

Run VEP with CNV:TR examples, such as:

chr7    140721574	cnv0    T	<CNV:TR>        .	.	SVLEN=30;CN=6.5;RUS=CAG;RUC=65;CIRUC=-15,.	GT	./.
chr7    140721574	cnv1    T	<CNV:TR>        .	.	RN=3;RUS=CAG,TG,CAGG;RUL=3,2,4;RUC=10,7,3;RB=30,14,12
chr7    140721574	cnv2    T	<CNV:TR>,<CNV:TR>	.	.       RN=2,1;RUS=CAG,TTG,CA;RUL=3,3,2;RB=12,6,6;RUC=4,2,4;RUB=3,3,3,3,3,3,2,2,2	.
chr7    140721574	cnv3    T	<CNV:TR>,<CNV:TR>	.	.       SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG,CA,CAG;RN=1,.;RB=90,15,2,12 GT:PS:CN        1|2:100:3.9666
chr7    140721574	cnv4    T	<CNV:TR>,<CNV:TR>	.	.       SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG;RUC=2,4;RN=1,1	GT:PS:CN        1|2:100:3.9666
chr7    140721574	cnv4    T	<CNV:TR>,<CNV:TR>	.	.       SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG;RUC=2,4;missRN=1,1  GT:PS:CN        1|2:100:3.9666
chr7    140721574	cnv5    T	<CNV:TR>,<CNV:TR>	.	.       RN=2,1;RUS=.,CA,.;RUL=3,3,2;RB=12,6,6;RUC=4,2,4;RUB=3,3,3,3,3,3,2,2,2   .
chr7    140721574	cnv6    T	<CNV:TR>,<CNV:TR>	.	.       SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG,CA,CAG;RN=1,3;RB=90,15,2,12 GT:PS:CN        1|2:100:3.9666
chr7    140721574	cnv7    T	<CNV:TR>        .	.	SVLEN=20000;CN=1.25;RUL=10000;RUC=5;RUB=10000,10500,11000,11500,12000   GT	./.

Currently, if CIRB and CIRUC fields are defined in the VCF, a warning will be thrown stating that these fields are ignored. In the future, we could calculate all possible alternative alleles based on CIRB and CIRUC.

Footnotes

  1. If RN is omitted, each <CNV:TR> allele should have its respective unit repeated only 1 time.

  2. Unless the alternative allele length is higher than --max_sv_size.

modules/Bio/EnsEMBL/VEP/Parser/VCF.pm Outdated Show resolved Hide resolved
modules/Bio/EnsEMBL/VEP/Parser/VCF.pm Outdated Show resolved Hide resolved
modules/Bio/EnsEMBL/VEP/Parser/VCF.pm Outdated Show resolved Hide resolved
Copy link
Contributor

@nakib103 nakib103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nakib103 nakib103 merged commit 59a6cbd into Ensembl:postreleasefix/112 Dec 14, 2023
1 check passed
@nakib103
Copy link
Contributor

merged with release/112 and main

@nuno-agostinho nuno-agostinho deleted the CNV-TR branch January 3, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants