To identify desired gene module in WGCNA, we proposed the gmcNet. gmcNet is a GNN-based clsutering algorithm, which can cluster genes according to the co-expression topology (genes in the same module should be strongly connected) and to the single-level expression (genes in the same module should have similar expression patterns). The key innovation of gmcNet is incorporating the single-expression of genes with co-expression of their neighbor genes.
gmcNet requries four inputs to implement unsupervised clustering. Let, is the number of genes and
is the number of expression sample.
: Single-expression features of
genes.
: Topological overlap matrix, which is created using the topological overlap measure between
genes.
: Topological overlap matrix, which is created only with gene pairs of positive correlation coefficient.
: Topological overlap matrix, which is created only with gene pairs of neagtive correlation coefficient.
gmcNet includes a co-expression pattern recognizer (CEPR) and module classifier.
CEPR : With massage passing operation, CEPR generates the embedding feature , which accounts for single-epxression and two diffrent co-expressions in
dimension.
Module classifier : Given CEPR-embedding feature , the module classifier computes module-assignment probability
using a multi-layer perceptron (MLP), where
is the number of modules. Finally,
th-row of
corresponds to module assifnment probability of gene
. In other words, gene
belongs to module
if
is the maximum value of the
th-row of
.
our models were implemented by tensorflow 2.3 in Python 3.8.6
Requirements can be installed through the following command in your shell.
pip install -r [CODE PATH]/requirements.txt
expr : gene expression data. A text file with a header line, and then one line per sample with +1 columns. The first column is gene name and others are
expression values. An example file format is in
data
folder as sample.txt
.
TOM (optional) : If you already created TOM through the R library WGCNA
, you can use them for gmcNet. The three TOMs (,
,
), required to implement gmcNet, must be located in one folder with the name of (
whole.txt
, positive.txt
, negative.txt
), repectively. TOM files must include -rows and
-columns, and then the
th-column of
th-row is the topological overlap measure of gene
and
. You can find an example files in
out/TOMs
folder.
Before excute gmcNet, you shuld set the configuration at main.py
.
'betas' : smoothing parameter for (whole, positive, negative) networks
'save_TOM' : save TOM or not in output path
'save_embed' : save embedding features or not in output path
'n_cluster' : number of cluster (k)
'epochs' : trainning epochs
'lr' : trainning learning rate
'mp_layers' : number of message passing layers
'CEPR_features' : CEPR_embedding demesions
'lambda' : balancing hyper-parameter
'Lo_thr' : orthogonal threshold
'tune_epoch' : first tunning epochs, which prevent the empty modules
'tune_lr' : learning rate for first tunning
'device' : used GPU device. if you don't use GPU, then write False
python main.py --expr [expr] --out [out]
- [expr] :
expr
file path. - [out] : Path for saving the results.
python main.py --expr [expr] --TOM [TOM] --out [out]
- [expr] :
expr
file path. - [TOM] : Path for TOM folder including three diffrent TOM files (
whole.txt
,positive.txt
,negative.txt
). - [out] : Path for saving the results.
python main.py --expr data/sample.txt --out out
python main.py --expr data/sample.txt --TOM out/TOMs --out out