Merge pull request #39 from mancusolab/dev

Fix typos and update documentation, update vcf reading function
mancusolab · Apr 24, 2024 · 9622576 · 9622576
2 parents a266425 + 03fba91
commit 9622576
Show file tree

Hide file tree

Showing 4 changed files with 49 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -13,12 +13,11 @@ etc.) across multiple ancestries.
 ``` diff
 - We detest usage of our software or scientific outcome to promote racial discrimination.
 ```
-SuShiE is described in,
+
+SuShiE is described in
 >  [Improved multi-ancestry fine-mapping identifies cis-regulatory variants underlying molecular traits and disease risk](https://www.medrxiv.org/content/10.1101/2024.04.15.24305836v1).
 >
 > Zeyun Lu,  Xinran Wang,  Matthew Carr,  Artem Kim,  Steven Gazal,  Pejman Mohammadi,  Lang Wu,  Alexander Gusev,  James Pirruccello,  Linda Kachuri,  Nicholas Mancuso.
-> medRxiv 2024.04.15.24305836; doi: https://doi.org/10.1101/2024.04.15.24305836.
-
 
 Check [here](https://mancusolab.github.io/sushie/) for full
 documentation.
@@ -89,10 +88,8 @@ You can play it with your own ideas!
 
 ## Notes
 
--   SuShiE currently only supports **continuous** phenotype
-    fine-mapping.
--   SuShiE currently only supports fine-mapping on
-    [autosomes]~~(https://en.wikipedia.org/wiki/Autosome)~~.
+-   SuShiE currently only supports **continuous** phenotype fine-mapping.
+-   SuShiE currently only supports fine-mapping on autosomes.
 -   SuShiE uses [JAX](https://github.com/google/jax) with [Just In
     Time](https://jax.readthedocs.io/en/latest/jax-101/02-jitting.html)
     compilation to achieve high-speed computation. However, there are
@@ -103,13 +100,14 @@ You can play it with your own ideas!
 
 ## Version History
 
-| Version | Description |
-| --------- | --------- |
-| 0.1  |     Initial Release |
-| 0.11 |     Fix the bug for OLS to compute adjusted r squared. |
-| 0.12 |    Update io.corr function so that report all the correlation results no matter cs is pruned or not. |
-| 0.13  |   Add `--keep` command to enable user to specify a file that contains the subjects ID SuShiE will perform on. Add `--ancestry_index` command to enable user to specify a file that contains the ancestry index for fine-mapping. With this, user can input single phenotype, genotype, and covariate file that contains all the subjects across ancestries. Implement padding to increase inference time. Record elbo at each iteration and can access it in the `infer.SuShiEResult` object. The alphas table now outputs the average purity and KL divergence for each `L`. Change `--kl_threshold` to `--divergence`. Add `--maf` command to remove SNPs that less than minor allele frequency threshold within each ancestry. Add `--max_select` command to randomly select maximum number of SNPs to compute purity to avoid unnecessary memory spending. Add a QC function to remove duplicated SNPs. |
-| 0.14  | Remove KL-Divergence pruning. Enhance command line appearance and improve the output files contents. Fix small bugs on multivariate KL. |
+| Version | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 0.1     | Initial Release                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| 0.11    | Fix the bug for OLS to compute adjusted r squared.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| 0.12    | Update io.corr function so that report all the correlation results no matter cs is pruned or not.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| 0.13    | Add `--keep` command to enable user to specify a file that contains the subjects ID SuShiE will perform on. Add `--ancestry_index` command to enable user to specify a file that contains the ancestry index for fine-mapping. With this, user can input single phenotype, genotype, and covariate file that contains all the subjects across ancestries. Implement padding to increase inference time. Record elbo at each iteration and can access it in the `infer.SuShiEResult` object. The alphas table now outputs the average purity and KL divergence for each `L`. Change `--kl_threshold` to `--divergence`. Add `--maf` command to remove SNPs that less than minor allele frequency threshold within each ancestry. Add `--max_select` command to randomly select maximum number of SNPs to compute purity to avoid unnecessary memory spending. Add a QC function to remove duplicated SNPs. |
+| 0.14    | Remove KL-Divergence pruning. Enhance command line appearance and improve the output files contents. Fix small bugs on multivariate KL.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| 0.15    | Fix several typos; add a sanity check on reading vcf genotype data by assigning gt_types==Unknown as NA; Add preprint information.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 
 ## Support
 
@@ -139,6 +137,7 @@ Lab](https://www.mancusolab.com/):
 -   [HAMSTA](https://github.com/tszfungc/hamsta): a Python software to
     estimate heritability explained by local ancestry data from
     admixture mapping summary statistics.
+-   [Traceax](https://github.com/tszfungc/traceax): a Python library to perform stochastic trace estimation for linear operators.
 
 ------------------------------------------------------------------------
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -4,6 +4,19 @@ SuShiE🍣
 
 SuShiE (Sum of Shared Single Effect) is a Python software to fine-map causal SNPs, compute prediction weights, and infer effect size correlation for molecular data (e.g., mRNA levels and protein levels etc.) across multiple ancestries. **The manuscript is in progress.**
 
+.. code:: diff
+
+    - We detest usage of our software or scientific outcome to promote racial discrimination.
+
+
+SuShiE is described in
+
+.. code::
+
+    `Improved multi-ancestry fine-mapping identifies cis-regulatory variants underlying molecular traits and disease risk <https://www.medrxiv.org/content/10.1101/2024.04.15.24305836v1>`_
+
+    Zeyun Lu, Xinran Wang, Matthew Carr, Artem Kim, Steven Gazal, Pejman Mohammadi, Lang Wu, Alexander Gusev, James Pirruccello, Linda Kachuri, Nicholas Mancuso
+
 Contents
 ========
 
@@ -35,3 +48,17 @@ Contents
    Version History <version>
    Authors <authors>
    License <license>
+
+Other Software
+==============
+
+Feel free to use other software developed by `Mancuso
+Lab <https://www.mancusolab.com/>`_:
+
+* `MA-FOCUS <https://github.com/mancusolab/ma-focus>`_: a Bayesian
+    fine-mapping framework using statistics across multiple ancestries to identify the causal genes for complex traits.
+* `SuSiE-PCA <https://github.com/mancusolab/susiepca>`_: a scalable Bayesian variable selection technique for sparse principal component analysis
+* `twas_sim <https://github.com/mancusolab/twas_sim>`_: a Python software to simulate statistics.
+* `FactorGo <https://github.com/mancusolab/factorgo>`_: a scalable variational factor analysis model that learns pleiotropic factors from GWAS summary statistics.
+* `HAMSTA <https://github.com/tszfungc/hamsta>`_: a Python software to estimate heritability explained by local ancestry data from admixture mapping summary statistics.
+* `Traceax <https://github.com/tszfungc/traceax>`_: a Python library to perform stochastic trace estimation for linear operators.
diff --git a/sushie/cli.py b/sushie/cli.py
@@ -1101,6 +1101,7 @@ def build_finemap_parser(subp):
             " Use 'space' to separate ancestries if more than two.",
             " Keep the same ancestry order as phenotype's.",
             " SuShiE currently does not take plink 2 format.",
+            " Data has to only contain bialleic variant.",
         ),
     )
 
@@ -1112,6 +1113,8 @@ def build_finemap_parser(subp):
         help=(
             "Genotype data in vcf format. Use 'space' to separate ancestries if more than two.",
             " Keep the same ancestry order as phenotype's. The software will count RFE allele.",
+            " If gt_types is UNKNOWN, it will be coded as NA, and be imputed by allele frequency.",
+            " Data has to only contain bialleic variant.",
         ),
     )
 
@@ -1123,6 +1126,7 @@ def build_finemap_parser(subp):
         help=(
             "Genotype data in bgen 1.3 format. Use 'space' to separate ancestries if more than two.",
             " Keep the same ancestry order as phenotype's.",
+            " Data has to only contain bialleic variant.",
         ),
     )
 

diff --git a/sushie/io.py b/sushie/io.py
@@ -174,6 +174,7 @@ def read_data(
 
 def read_triplet(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
     """Read in genotype data in `plink 1 <https://www.cog-genomics.org/plink/1.9/input#bed>`_ format.
+        `pandas_plink <https://pandas-plink.readthedocs.io/>`_ package is used to read in the plink file.
 
     Args:
         path: The path for plink genotype data (suffix only).
@@ -196,6 +197,8 @@ def read_triplet(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
 
 def read_vcf(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
     """Read in genotype data in `vcf <https://en.wikipedia.org/wiki/Variant_Call_Format>`_ format.
+        `cyvcf2 <https://brentp.github.io/cyvcf2/>`_ package is used to read in the vcf file.
+        gt_types are used to determine the genotype matrix. It it is UNKNOWN, it will be coded as NA.
 
     Args:
         path: The path for vcf genotype data (full file name). It will count REF allele.
@@ -215,6 +218,7 @@ def read_vcf(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
     for var in vcf:
         # var.ALT is a list of alternative allele
         bim_list.append([var.CHROM, var.ID, var.POS, var.ALT[0], var.REF])
+        var.gt_types = jnp.where(var.gt_types == 3, jnp.nan, var.gt_types)
         tmp_bed = 2 - var.gt_types
         bed_list.append(tmp_bed)
 
@@ -226,6 +230,7 @@ def read_vcf(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
 
 def read_bgen(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
     """Read in genotype data in `bgen <https://www.well.ox.ac.uk/~gav/bgen_format/>`_ 1.3 format.
+     `bgen-reader <https://pypi.org/project/bgen-reader/>`_ package is used to read in the bgen file.
 
     Args:
         path: The path for bgen genotype data (full file name).