Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: vcf is not a valid file or directory. Please provide a valid file or directory. #71

Open
RosaDeSa opened this issue May 9, 2023 · 8 comments

Comments

@RosaDeSa
Copy link

RosaDeSa commented May 9, 2023

Hi Kevin , I'm trying this script but I'm running into this error during the prediction:
(the vcf file was annotated with VEP)

DEBUG | ezancestry.process:process_user_input:214 - list index out of range
Traceback (most recent call last):
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 217, in process_user_input
snpsdf = pd.read_csv(
File "/usr/local/lib/python3.9/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 270, in read
alldata = self._rows_to_cols(content)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 1013, in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 739, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: Expected 3 fields in line 7, saw 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/tigem/r.desantis/.local/bin/ezancestry", line 8, in
sys.exit(app())
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 532, in wrapper
return callback(**use_params) # type: ignore
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/commands.py", line 286, in predict
snpsdf = process_user_input(input_data, aisnps_directory, aisnps_set)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 232, in process_user_input
raise ValueError(
ValueError: a1.VEP.ann.vcf is not a valid file or directory. Please provide a valid file or directory.

@RosaDeSa RosaDeSa closed this as completed May 9, 2023
@arvkevi
Copy link
Owner

arvkevi commented May 10, 2023

Hi @RosaDeSa 👋🏼 were you able to figure out what the issue was? If so, it could be helpful for others if you share your solution. I'm unsure how ezancestry handles VEP annotations, the parser from snps might be robust enough to handle them though.

@RosaDeSa RosaDeSa reopened this May 10, 2023
@RosaDeSa
Copy link
Author

RosaDeSa commented May 10, 2023

Hi @arvkevi , I obtained the prediction.csv file and plotted it. The problem was probably due to a malformed file; I generated again the VCF file adding some parameters in VEP.
Despite this, I'm still determining the results, I used two different VCFs (from two different samples), but the prediction results are exactly the same; this is probably a little weird. I'll try snsp, as you suggested. If I find consistent results, I'll gladly share the solution here!
Thanx

@arvkevi
Copy link
Owner

arvkevi commented May 11, 2023

Ezancestry uses snps to read vcfs in process.py. Are the two samples related? Do they have the exact same set of AISNPs?

@RosaDeSa
Copy link
Author

I noticed it, also using snps I've same results. The samples are not related, they belong two different person.
And yes, they have the same AISNPs, it's weird, isn't?

In a while I'll analyze wgs of other 2 different samples, I'll test also on those the script.

#pca,kidd,/home/r.desantis/.ezancestry/data/models,/home/r.desantis/.ezancestry/data/aisnps
,component1,component2,component3,predicted_population_population,ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI,predicted_population_superpopulation,AFR,AMR,EAS,EUR,SAS,population_description,superpopulation_name
LV_vep.vcf,0.11874386857468588,0.15300045809781831,0.3265148978535419,ITU,0.0,0.0,0.08919748915377203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09703463769218275,0.0,0.0,0.29927578644262454,0.0,0.0,0.0,0.0,0.0,0.0,0.08710151819096609,0.22274821473011025,0.20464235379034443,0.0,0.0,SAS,0.0,0.17202243612400409,0.0,0.0,0.827977563875996,Indian Telugu in the UK,South Asian Ancestry


#pca,kidd,/home/r.desantis/.ezancestry/data/models,/home/r.desantis/.ezancestry/data/aisnps
,component1,component2,component3,predicted_population_population,ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI,predicted_population_superpopulation,AFR,AMR,EAS,EUR,SAS,population_description,superpopulation_name
out.vcf,0.11874386857468588,0.15300045809781831,0.3265148978535419,ITU,0.0,0.0,0.08919748915377203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09703463769218275,0.0,0.0,0.29927578644262454,0.0,0.0,0.0,0.0,0.0,0.0,0.08710151819096609,0.22274821473011025,0.20464235379034443,0.0,0.0,SAS,0.0,0.17202243612400409,0.0,0.0,0.827977563875996,Indian Telugu in the UK,South Asian Ancestry

@RosaDeSa
Copy link
Author

RosaDeSa commented May 24, 2023

Hi @arvkevi also with other 2 samples I've same problem.

Following head of vcf with SNPs that I give in input. Is that correct for Ezancestry?

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a2
chr1    13813   .       T       G       67.64   MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=-1.645;DP=5;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=24.33;MQRankSum=-1.282;QD=13.53;ReadPosRankSum=1.036;SOR=1.609     GT:AD:DP:FT:GQ:PL       0/1:3,2:5:DP_filter:75:75,0,120
chr1    13838   rs200683566     C       T       64.64   MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DB;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=25.17;MQRankSum=-1.501;QD=10.77;ReadPosRankSum=0.431;SOR=1.179   GT:AD:DP:FT:GQ:PL       0/1:4,2:6:DP_filter:72:72,0,142
chr1    13868   .       A       G       32.65   MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.967;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=26.87;MQRankSum=0.967;QD=10.88;ReadPosRankSum=0.967;SOR=0.223      GT:AD:DP:FT:GQ:PL       0/1:1,2:3:DP_filter:18:40,0,18
chr1    16288   rs200736374     C       G       42.64   QD_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=1.889;DB;DP=36;ExcessHet=0.0000;FS=1.817;MLEAC=1;MLEAF=0.500;MQ=42.58;MQRankSum=-2.014;QD=1.22;ReadPosRankSum=1.022;SOR=0.939   GT:AD:DP:GQ:PL  0/1:30,5:35:50:50,0,968
chr1    16298   rs200451305     C       T       311.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.497;DB;DP=30;ExcessHet=0.0000;FS=3.682;MLEAC=1;MLEAF=0.500;MQ=42.47;MQRankSum=-4.337;QD=12.47;ReadPosRankSum=2.029;SOR=1.388  GT:AD:DP:GQ:PL  0/1:13,12:25:99:319,0,385
chr1    16378   rs148220436     T       C       293.64  MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=-2.461;DB;DP=38;ExcessHet=0.0000;FS=5.153;MLEAC=1;MLEAF=0.500;MQ=36.39;MQRankSum=-3.036;QD=8.16;ReadPosRankSum=-0.747;SOR=1.190 GT:AD:DP:GQ:PL  0/1:22,14:36:99:301,0,599


@arvkevi
Copy link
Owner

arvkevi commented May 24, 2023

Hey @RosaDeSa, one other thing that could be contributing to this is having too many missing AISNPs in the vcf. When you call predict, it should log a message indicating how many AISNPs were present in your vcf for a sample. It looks like this (from cell 23 of this notebook).

2021-09-20 06:25:34.289 | INFO     | ezancestry.process:_input_to_dataframe:276 - Sample has a valid genotype for 44 
out of a possible 55 (80.0%)

Do you know how many AISNPs were in your input samples?

@RosaDeSa
Copy link
Author

Yes, you're right! I've 0 of out of possible 55 using the Kidd set and 1 of 127 using the Seldin set.
Do you think the problem is the reference I used to align the data (hg38)? Prediction searches the aisnps for rs id and not for position, right?

@arvkevi
Copy link
Owner

arvkevi commented May 25, 2023

Hmm, the merge is on both rsid AND position. Unfortunately, this requires vcf annotated with rsids and for the position to match the hg19 positions from the .aisnps files.

You could try commenting out "chr" and "position_hg19" in this line, but I haven't looked at the hg19->hg38 liftover in about a year. So if you do this, you should see if any alleles changed.

I'll have to think about how ezancestry could support hg38. The easiest would probably be a --hg38 flag that uses new versions of the aisnps files. But I won't have time to get to this work for a little while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants