Skip to content

Commit

Permalink
Merge pull request #39 from mancusolab/dev
Browse files Browse the repository at this point in the history
Fix typos and update documentation, update vcf reading function
  • Loading branch information
zeyunlu committed Apr 24, 2024
2 parents a266425 + 03fba91 commit 9622576
Show file tree
Hide file tree
Showing 4 changed files with 49 additions and 14 deletions.
27 changes: 13 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,11 @@ etc.) across multiple ancestries.
``` diff
- We detest usage of our software or scientific outcome to promote racial discrimination.
```
SuShiE is described in,

SuShiE is described in
> [Improved multi-ancestry fine-mapping identifies cis-regulatory variants underlying molecular traits and disease risk](https://www.medrxiv.org/content/10.1101/2024.04.15.24305836v1).
>
> Zeyun Lu, Xinran Wang, Matthew Carr, Artem Kim, Steven Gazal, Pejman Mohammadi, Lang Wu, Alexander Gusev, James Pirruccello, Linda Kachuri, Nicholas Mancuso.
> medRxiv 2024.04.15.24305836; doi: https://doi.org/10.1101/2024.04.15.24305836.
Check [here](https://mancusolab.github.io/sushie/) for full
documentation.
Expand Down Expand Up @@ -89,10 +88,8 @@ You can play it with your own ideas!

## Notes

- SuShiE currently only supports **continuous** phenotype
fine-mapping.
- SuShiE currently only supports fine-mapping on
[autosomes]~~(https://en.wikipedia.org/wiki/Autosome)~~.
- SuShiE currently only supports **continuous** phenotype fine-mapping.
- SuShiE currently only supports fine-mapping on autosomes.
- SuShiE uses [JAX](https://github.com/google/jax) with [Just In
Time](https://jax.readthedocs.io/en/latest/jax-101/02-jitting.html)
compilation to achieve high-speed computation. However, there are
Expand All @@ -103,13 +100,14 @@ You can play it with your own ideas!

## Version History

| Version | Description |
| --------- | --------- |
| 0.1 | Initial Release |
| 0.11 | Fix the bug for OLS to compute adjusted r squared. |
| 0.12 | Update io.corr function so that report all the correlation results no matter cs is pruned or not. |
| 0.13 | Add `--keep` command to enable user to specify a file that contains the subjects ID SuShiE will perform on. Add `--ancestry_index` command to enable user to specify a file that contains the ancestry index for fine-mapping. With this, user can input single phenotype, genotype, and covariate file that contains all the subjects across ancestries. Implement padding to increase inference time. Record elbo at each iteration and can access it in the `infer.SuShiEResult` object. The alphas table now outputs the average purity and KL divergence for each `L`. Change `--kl_threshold` to `--divergence`. Add `--maf` command to remove SNPs that less than minor allele frequency threshold within each ancestry. Add `--max_select` command to randomly select maximum number of SNPs to compute purity to avoid unnecessary memory spending. Add a QC function to remove duplicated SNPs. |
| 0.14 | Remove KL-Divergence pruning. Enhance command line appearance and improve the output files contents. Fix small bugs on multivariate KL. |
| Version | Description |
|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0.1 | Initial Release |
| 0.11 | Fix the bug for OLS to compute adjusted r squared. |
| 0.12 | Update io.corr function so that report all the correlation results no matter cs is pruned or not. |
| 0.13 | Add `--keep` command to enable user to specify a file that contains the subjects ID SuShiE will perform on. Add `--ancestry_index` command to enable user to specify a file that contains the ancestry index for fine-mapping. With this, user can input single phenotype, genotype, and covariate file that contains all the subjects across ancestries. Implement padding to increase inference time. Record elbo at each iteration and can access it in the `infer.SuShiEResult` object. The alphas table now outputs the average purity and KL divergence for each `L`. Change `--kl_threshold` to `--divergence`. Add `--maf` command to remove SNPs that less than minor allele frequency threshold within each ancestry. Add `--max_select` command to randomly select maximum number of SNPs to compute purity to avoid unnecessary memory spending. Add a QC function to remove duplicated SNPs. |
| 0.14 | Remove KL-Divergence pruning. Enhance command line appearance and improve the output files contents. Fix small bugs on multivariate KL. |
| 0.15 | Fix several typos; add a sanity check on reading vcf genotype data by assigning gt_types==Unknown as NA; Add preprint information. |

## Support

Expand Down Expand Up @@ -139,6 +137,7 @@ Lab](https://www.mancusolab.com/):
- [HAMSTA](https://github.com/tszfungc/hamsta): a Python software to
estimate heritability explained by local ancestry data from
admixture mapping summary statistics.
- [Traceax](https://github.com/tszfungc/traceax): a Python library to perform stochastic trace estimation for linear operators.

------------------------------------------------------------------------

Expand Down
27 changes: 27 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,19 @@ SuShiE🍣

SuShiE (Sum of Shared Single Effect) is a Python software to fine-map causal SNPs, compute prediction weights, and infer effect size correlation for molecular data (e.g., mRNA levels and protein levels etc.) across multiple ancestries. **The manuscript is in progress.**

.. code:: diff
- We detest usage of our software or scientific outcome to promote racial discrimination.
SuShiE is described in

.. code::
`Improved multi-ancestry fine-mapping identifies cis-regulatory variants underlying molecular traits and disease risk <https://www.medrxiv.org/content/10.1101/2024.04.15.24305836v1>`_
Zeyun Lu, Xinran Wang, Matthew Carr, Artem Kim, Steven Gazal, Pejman Mohammadi, Lang Wu, Alexander Gusev, James Pirruccello, Linda Kachuri, Nicholas Mancuso
Contents
========

Expand Down Expand Up @@ -35,3 +48,17 @@ Contents
Version History <version>
Authors <authors>
License <license>

Other Software
==============

Feel free to use other software developed by `Mancuso
Lab <https://www.mancusolab.com/>`_:

* `MA-FOCUS <https://github.com/mancusolab/ma-focus>`_: a Bayesian
fine-mapping framework using statistics across multiple ancestries to identify the causal genes for complex traits.
* `SuSiE-PCA <https://github.com/mancusolab/susiepca>`_: a scalable Bayesian variable selection technique for sparse principal component analysis
* `twas_sim <https://github.com/mancusolab/twas_sim>`_: a Python software to simulate statistics.
* `FactorGo <https://github.com/mancusolab/factorgo>`_: a scalable variational factor analysis model that learns pleiotropic factors from GWAS summary statistics.
* `HAMSTA <https://github.com/tszfungc/hamsta>`_: a Python software to estimate heritability explained by local ancestry data from admixture mapping summary statistics.
* `Traceax <https://github.com/tszfungc/traceax>`_: a Python library to perform stochastic trace estimation for linear operators.
4 changes: 4 additions & 0 deletions sushie/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -1101,6 +1101,7 @@ def build_finemap_parser(subp):
" Use 'space' to separate ancestries if more than two.",
" Keep the same ancestry order as phenotype's.",
" SuShiE currently does not take plink 2 format.",
" Data has to only contain bialleic variant.",
),
)

Expand All @@ -1112,6 +1113,8 @@ def build_finemap_parser(subp):
help=(
"Genotype data in vcf format. Use 'space' to separate ancestries if more than two.",
" Keep the same ancestry order as phenotype's. The software will count RFE allele.",
" If gt_types is UNKNOWN, it will be coded as NA, and be imputed by allele frequency.",
" Data has to only contain bialleic variant.",
),
)

Expand All @@ -1123,6 +1126,7 @@ def build_finemap_parser(subp):
help=(
"Genotype data in bgen 1.3 format. Use 'space' to separate ancestries if more than two.",
" Keep the same ancestry order as phenotype's.",
" Data has to only contain bialleic variant.",
),
)

Expand Down
5 changes: 5 additions & 0 deletions sushie/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ def read_data(

def read_triplet(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
"""Read in genotype data in `plink 1 <https://www.cog-genomics.org/plink/1.9/input#bed>`_ format.
`pandas_plink <https://pandas-plink.readthedocs.io/>`_ package is used to read in the plink file.
Args:
path: The path for plink genotype data (suffix only).
Expand All @@ -196,6 +197,8 @@ def read_triplet(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:

def read_vcf(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
"""Read in genotype data in `vcf <https://en.wikipedia.org/wiki/Variant_Call_Format>`_ format.
`cyvcf2 <https://brentp.github.io/cyvcf2/>`_ package is used to read in the vcf file.
gt_types are used to determine the genotype matrix. It it is UNKNOWN, it will be coded as NA.
Args:
path: The path for vcf genotype data (full file name). It will count REF allele.
Expand All @@ -215,6 +218,7 @@ def read_vcf(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
for var in vcf:
# var.ALT is a list of alternative allele
bim_list.append([var.CHROM, var.ID, var.POS, var.ALT[0], var.REF])
var.gt_types = jnp.where(var.gt_types == 3, jnp.nan, var.gt_types)
tmp_bed = 2 - var.gt_types
bed_list.append(tmp_bed)

Expand All @@ -226,6 +230,7 @@ def read_vcf(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:

def read_bgen(path: str) -> Tuple[pd.DataFrame, pd.DataFrame, Array]:
"""Read in genotype data in `bgen <https://www.well.ox.ac.uk/~gav/bgen_format/>`_ 1.3 format.
`bgen-reader <https://pypi.org/project/bgen-reader/>`_ package is used to read in the bgen file.
Args:
path: The path for bgen genotype data (full file name).
Expand Down

0 comments on commit 9622576

Please sign in to comment.