Skip to content

A Unified Python Library for Graph Prompting

License

Notifications You must be signed in to change notification settings

sheldonresearch/ProG

Repository files navigation


🌟ProG: A Unified Python Library for Graph Prompting🌟

🌟ProG🌟 (Prompt Graph) is a library built upon PyTorch to easily conduct single or multi-task prompting for pre-trained Graph Neural Networks (GNNs). You can easily use this library to conduct various graph workflows like supervised learning, pre-training and prompting, and pre-training and finetuning for your node/graph-level tasks. The starting point of this library is our KDD23 paper All in One (Best Research Paper Award, which is the first time for Hong Kong and Mainland China).

The ori branch of this repository is the source code of All in One, in which you can conduct even more kinds of tasks with more flexible graph prompts. Beyond All in One, the main branch of this library now supports more than 5 graph prompt models (e.g. All-in-One, GPPT, GPF Plus, GPF, GraphPrompt, etc) with more than 6 pre-training strategies (e.g. DGI, GraphMAE, EdgePreGPPT, EdgePreGprompt, GraphCL, SimGRACE, etc), and have been tested on more than 15 graph datasets, covering both homophilic and heterophilic graphs from various domains with different scales. Click here to see the full and latest supportive list (backbones, pre-training strategies, graph prompts, and datasets).

🌟Acknowledgement


  • 2024/06/08: We use our developed ProG to extensively evaluate various graph prompts, and released our analysis report as follows:
  • 2024/05/28: We are so happy to announce that we have finished most of the updating works for ProG! (the main branch of this repository. If you wish to find the original ProG package, go to the ori branch)
  • 2024/01/01: A big updated version released!
  • 2023/11/28: We released a comprehensive survey on graph prompt!
  • 2023/11/15: We released a 🦀repository🦀 for a comprehensive collection of research papers, datasets, and readily accessible code implementations.

Installation

Pypi

From ProG 1.0 onwards, you can install and use ProG. For this, simply run

pip install prompt-graph

Or you can git clone our repository directly.

Environment Setup

Before you begin, please make sure that you have Anaconda or Miniconda installed on your system. This guide assumes that you have a CUDA-enabled GPU.

# Create and activate a new Conda environment named 'ProG'
conda create -n ProG
conda activate ProG

# Install Pytorch and DGL with CUDA 11.7 support
# If your use a different CUDA version, please refer to the PyTorch and DGL websites for the appropriate versions.
conda install numpy
conda install pytorch==2.0.1 pytorch-cuda=12.2 -c pytorch -c nvidia

# Install additional dependencies
pip install torch_geometric pandas torchmetrics Deprecated 

In addition, You can use our pre-train GNN directly or use our pretrain module to pre-train the GNN you want by

pip install torch_cluster  -f https://data.pyg.org/whl/torch-2.3.0+cu121.html

the torch and cuda version can refer to https://data.pyg.org/whl/

Quick Start

The Architecture of ProG is shown as follows:

We have provided scripts with hyper-parameter settings to get the experimental results

In the pre-train phase, you can obtain the experimental results by running the parameters you want:

python pre_train.py --task Edgepred_Gprompt --dataset_name 'PubMed' --gnn_type 'GCN' --hid_dim 128 --num_layer 2 --epochs 1000 --seed 42 --device 0

With Customized Hyperparameters

In downstream_task, you can obtain the experimental results by running the parameters you want, for example,

python downstream_task.py --pre_train_model_path './Experiment/pre_trained_model/Cora/Edgepred_Gprompt.GCN.128hidden_dim.pth' --task NodeTask --dataset_name 'Cora' --gnn_type 'GCN' --prompt_type 'GPF-plus' --shot_num 1 --hid_dim 128 --num_layer 2  --lr 0.02 --decay 2e-6 --seed 42 --device 0
python downstream_task.py --pre_train_model_path './Experiment/pre_trained_model/BZR/DGI.GCN.128hidden_dim.pth' --task GraphTask --dataset_name 'BZR' --gnn_type 'GCN' --prompt_type 'All-in-one' --shot_num 1 --hid_dim 128 --num_layer 2  --lr 0.02 --decay 2e-6 --seed 42 --device 1

With Optimal Hyperparameters through Random Search

Perform a random search of hyperparameters for the GCN model on the Cora dataset. (NodeTask)

python bench.py --pre_train_model_path './Experiment/pre_trained_model/Cora/Edgepred_Gprompt.GCN.128hidden_dim.pth' --task NodeTask --dataset_name 'Cora' --gnn_type 'GCN' --prompt_type 'GPF-plus' --shot_num 1 --hid_dim 128 --num_layer 2 --seed 42 --device 0
Table of The Following Contents
  1. Supportive List
  2. Pre-train your GNN model
  3. Downstream Tasks
  4. Datasets
  5. Prompt Class
  6. Environment Setup
  7. TODO List

Supportive List

Supportive graph prompt approaches currently (keep updating):

  • [All in One] X. Sun, H. Cheng, J. Li, B. Liu, and J. Guan, “All in One: Multi-Task Prompting for Graph Neural Networks,” KDD, 2023
  • [GPF Plus] T. Fang, Y. Zhang, Y. Yang, C. Wang, and L. Chen, “Universal Prompt Tuning for Graph Neural Networks,” NeurIPS, 2023.
  • [GraphPrompt] Liu Z, Yu X, Fang Y, et al. Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. The Web Conference, 2023.
  • [GPPT] M. Sun, K. Zhou, X. He, Y. Wang, and X. Wang, “GPPT: Graph Pre-Training and Prompt Tuning to Generalize Graph Neural Networks,” KDD, 2022
  • [GPF] T. Fang, Y. Zhang, Y. Yang, and C. Wang, “Prompt tuning for graph neural networks,” arXiv preprint, 2022.

Supportive graph pre-training strategies currently (keep updating):

  • For node-level, we consider DGI and GraphMAE, where DGI maximizes the mutual information between node and graph representations for informative embeddings and GraphMAE learns deep node representations by reconstructing masked features.
  • For edge-level, we introduce EdgePreGPPT and EdgePreGprompt, where EdgePreGPPT calculates the dot product as the link probability of node pairs and EdgePreGprompt samples triplets from label-free graphs to increase the similarity between the contextual subgraphs of linked pairs while decreasing the similarity of unlinked pairs.
  • For graph-level, we involve GraphCL, SimGRACE, where GraphCL maximizes agreement between different graph augmentations to leverage structural information and SimGRACE tries to perturb the graph model parameter spaces and narrow down the gap between different perturbations for the same graph.

Supportive graph backbone models currently (keep updating):

  • Graph Convolutional Network (GCN), GraphSAGE, GAT, and Graph Transformer (GT).

Beyond the above graph backbones, you can also seamlessly integrate nearly all graph models implemented by PyG.

**Click [here] to see more details information on these graph prompts, pre-training strategies, and graph backbones. **

Pre-train your GNN model

We have designed four pre_trained class (Edgepred_GPPT, Edgepred_Gprompt, GraphCL, SimGRACE), which is in ProG.pretrain module, you can pre_train the model by running pre_train.py and setting the parameters you want. Or just unzip to get our dataset pretrained model which is already pre-trained.

unzip Experiment.zip
import prompt_graph as ProG
from ProG.pretrain import Edgepred_GPPT, Edgepred_Gprompt, GraphCL, SimGRACE, NodePrePrompt, GraphPrePrompt, DGI, GraphMAE
from ProG.utils import seed_everything
from ProG.utils import mkdir, get_args
from ProG.data import load4node,load4graph

args = get_args()
seed_everything(args.seed)


if args.task == 'SimGRACE':
    pt = SimGRACE(dataset_name = args.dataset_name, gnn_type = args.gnn_type, hid_dim = args.hid_dim, gln = args.num_layer, num_epoch=args.epochs, device=args.device)
if args.task == 'GraphCL':
    pt = GraphCL(dataset_name = args.dataset_name, gnn_type = args.gnn_type, hid_dim = args.hid_dim, gln = args.num_layer, num_epoch=args.epochs, device=args.device)
if args.task == 'Edgepred_GPPT':
    pt = Edgepred_GPPT(dataset_name = args.dataset_name, gnn_type = args.gnn_type, hid_dim = args.hid_dim, gln = args.num_layer, num_epoch=args.epochs, device=args.device)
if args.task == 'Edgepred_Gprompt':
    pt = Edgepred_Gprompt(dataset_name = args.dataset_name, gnn_type = args.gnn_type, hid_dim = args.hid_dim, gln = args.num_layer, num_epoch=args.epochs, device=args.device)
if args.task == 'DGI':
    pt = DGI(dataset_name = args.dataset_name, gnn_type = args.gnn_type, hid_dim = args.hid_dim, gln = args.num_layer, num_epoch=args.epochs, device=args.device)
if args.task == 'NodeMultiGprompt':
    nonlinearity = 'prelu'
    pt = NodePrePrompt(args.dataset_name, args.hid_dim, nonlinearity, 0.9, 0.9, 0.1, 0.001, 1, 0.3, args.device)
if args.task == 'GraphMultiGprompt':
    nonlinearity = 'prelu'
    pt = GraphPrePrompt(graph_list, input_dim, out_dim, args.dataset_name, args.hid_dim, nonlinearity,0.9,0.9,0.1,1,0.3, 0.1, args.device)
if args.task == 'GraphMAE':
    pt = GraphMAE(dataset_name = args.dataset_name, gnn_type = args.gnn_type, hid_dim = args.hid_dim, gln = args.num_layer, num_epoch=args.epochs, device=args.device,
                  mask_rate=0.75, drop_edge_rate=0.0, replace_rate=0.1, loss_fn='sce', alpha_l=2)
pt.pretrain()

Load Data

Before we do the downstream task, we need to load the nessary data. For some specific prompt, we need to choose function load_induced_graph to the input of our tasker

def load_induced_graph(dataset_name, data, device):

    folder_path = './Experiment/induced_graph/' + dataset_name
    if not os.path.exists(folder_path):
            os.makedirs(folder_path)

    file_path = folder_path + '/induced_graph_min100_max300.pkl'
    if os.path.exists(file_path):
            with open(file_path, 'rb') as f:
                print('loading induced graph...')
                graphs_list = pickle.load(f)
                print('Done!!!')
    else:
        print('Begin split_induced_graphs.')
        split_induced_graphs(data, folder_path, device, smallest_size=100, largest_size=300)
        with open(file_path, 'rb') as f:
            graphs_list = pickle.load(f)
    graphs_list = [graph.to(device) for graph in graphs_list]
    return graphs_list


args = get_args()
seed_everything(args.seed)

print('dataset_name', args.dataset_name)
if args.task == 'NodeTask':
    data, input_dim, output_dim = load4node(args.dataset_name)   
    data = data.to(args.device)
    if args.prompt_type in ['Gprompt', 'All-in-one', 'GPF', 'GPF-plus']:
        graphs_list = load_induced_graph(args.dataset_name, data, args.device) 
    else:
        graphs_list = None 
         

if args.task == 'GraphTask':
    input_dim, output_dim, dataset = load4graph(args.dataset_name)

Downstream Tasks

In downstreamtask.py, we designed two tasks (Node Classification, Graph Classification). Here are some examples.

import prompt_graph as ProG
from ProG.tasker import NodeTask, LinkTask, GraphTask

if args.task == 'GraphTask':
    input_dim, output_dim, dataset = load4graph(args.dataset_name)

if args.task == 'NodeTask':
    tasker = NodeTask(pre_train_model_path = args.pre_train_model_path, 
                    dataset_name = args.dataset_name, num_layer = args.num_layer,
                    gnn_type = args.gnn_type, hid_dim = args.hid_dim, prompt_type = args.prompt_type,
                    epochs = args.epochs, shot_num = args.shot_num, device=args.device, lr = args.lr, wd = args.decay,
                    batch_size = args.batch_size, data = data, input_dim = input_dim, output_dim = output_dim, graphs_list = graphs_list)


if args.task == 'GraphTask':
    tasker = GraphTask(pre_train_model_path = args.pre_train_model_path, 
                    dataset_name = args.dataset_name, num_layer = args.num_layer, gnn_type = args.gnn_type, hid_dim = args.hid_dim, prompt_type = args.prompt_type, epochs = args.epochs,
                    shot_num = args.shot_num, device=args.device, lr = args.lr, wd = args.decay,
                    batch_size = args.batch_size, dataset = dataset, input_dim = input_dim, output_dim = output_dim)

_, test_acc, std_test_acc, f1, std_f1, roc, std_roc, _, _= tasker.run()

Kindly note that the comparison takes the same pre-trained pth. The absolute value of performance won't mean much because the final results may vary depending on different pre-training states.It would be more interesting to see the relative performance with other pre-training paradigms.

Bench Random Search

In our bench

Datasets

Dataset Graphs Avg.nodes Avg.edges Features Node classes Task (N / G) Category
Cora 1 2,708 5,429 1,433 7 N Homophilic
Pubmed 1 19,717 88,648 500 3 N Homophilic
CiteSeer 1 3,327 9,104 3,703 6 N Homophilic
Actor 1 7600 30019 932 5 N Heterophilic
Wisconsin 1 251 515 1703 5 N Heterophilic
Texas 1 183 325 1703 5 N Heterophilic
ogbn-arxiv 1 169,343 1,166,243 128 40 N Homophilic & Large scale
Dataset Graphs Avg.nodes Avg.edges Features Graph classes Task (N / G) Domain
MUTAG 188 17.9 19.8 7 2 G small molecule
IMDB-BINARY 1000 19.8 96.53 0 2 G social network
COLLAP 5000 74.5 2457.8 0 3 G social network
PROTEINS 1,113 39.1 72.8 3 2 G proteins
ENZYMES 600 32.6 62.1 18 6 G proteins
DD 1,178 284.1 715.7 89 2 G proteins
COX2 467 41.2 43.5 3 2 G small molecule
BZR 405 35.8 38.4 3 2 G small molecule

TODO List

Note Current experimental datasets: Node/Edge:Cora/Citeseer/Pubmed; Graph:MUTAG

  • Write a comprehensive usage document(refer to pyG)
  • Write a tutorial, and polish data code, to make our readers feel more easily to deal with their own data. That is to: (1) provide a demo/tutorial to let our readers know how to deal with data; (2) polish data code, making it more robust, reliable, and readable.
  • Pre_train: implementation of InfoGraph, contextpred, AttrMasking, ContextPred, GraphLoG, JOAO
  • Add Prompt: prodigy (NeurIPS'2023 Spotlight)
  • induced graph(1.better way to generate induced graph/2.simplify the 3 type of generate-func)
  • support deep GNN layers by adding the feature DeepGCNLayer

🌹Please Cite Our Work If Helpful:

Thanks! / 谢谢! / ありがとう! / merci! / 감사! / Danke! / спасибо! / gracias! ...

@inproceedings{sun2023all,
  title={All in One: Multi-Task Prompting for Graph Neural Networks},
  author={Sun, Xiangguo and Cheng, Hong and Li, Jia and Liu, Bo and Guan, Jihong},
  booktitle={Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery \& data mining (KDD'23)},
  year={2023},
  pages = {2120–2131},
  location = {Long Beach, CA, USA},
  isbn = {9798400701030},
  url = {https://doi.org/10.1145/3580305.3599256},
  doi = {10.1145/3580305.3599256}
}

@article{zi2024prog,
      title={ProG: A Graph Prompt Learning Benchmark}, 
      author={Chenyi Zi and Haihong Zhao and Xiangguo Sun and Yiqing Lin and Hong Cheng and Jia Li},
      year={2024},
      journal = {arXiv:2406.05346},
      eprint={2406.05346},
      archivePrefix={arXiv}
}


@article{sun2023graph,
  title = {Graph Prompt Learning: A Comprehensive Survey and Beyond},
  author = {Sun, Xiangguo and Zhang, Jiawen and Wu, Xixi and Cheng, Hong and Xiong, Yun and Li, Jia},
  year = {2023},
  journal = {arXiv:2311.16534},
  eprint = {2311.16534},
  archiveprefix = {arxiv}
}


@inproceedings{zhao2024all,
      title={All in One and One for All: A Simple yet Effective Method towards Cross-domain Graph Pretraining}, 
      author={Haihong Zhao and Aochuan Chen and Xiangguo Sun and Hong Cheng and Jia Li},
      year={2024},
      booktitle={Proceedings of the 27th ACM SIGKDD international conference on knowledge discovery \& data mining (KDD'24)}
}


@inproceedings{gao2024protein,
  title={Protein Multimer Structure Prediction via {PPI}-guided Prompt Learning},
  author={Ziqi Gao and Xiangguo Sun and Zijing Liu and Yu Li and Hong Cheng and Jia Li},
  booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
  year={2024},
  url={https://openreview.net/forum?id=OHpvivXrQr}
}


@article{chen2024prompt,
      title={Prompt Learning on Temporal Interaction Graphs}, 
      author={Xi Chen and Siwei Zhang and Yun Xiong and Xixi Wu and Jiawei Zhang and Xiangguo Sun and Yao Zhang and Yinglong Zhao and Yulin Kang},
      year={2024},
      eprint={2402.06326},
      archivePrefix={arXiv},
      journal = {arXiv:2402.06326}
}

@article{li2024survey,
      title={A Survey of Graph Meets Large Language Model: Progress and Future Directions}, 
      author={Yuhan Li and Zhixun Li and Peisong Wang and Jia Li and Xiangguo Sun and Hong Cheng and Jeffrey Xu Yu},
      year={2024},
      eprint={2311.12399},
      archivePrefix={arXiv},
      journal = {arXiv:2311.12399}
}

@article{wang2024ddiprompt,
  title={Advanced Drug Interaction Event Prediction},
  author={Wang, Yingying and Xiong, Yun and Wu, Xixi and Sun, Xiangguo and Zhang, Jiawei},
  journal={arXiv preprint arXiv:2402.11472},
  year={2024}
}

Media Coverage

Media Reports

Online Discussion

Other research papers released by us


Call for Contributors!

Once you are invited as a contributor, you would be asked to follow the following steps:

  • step 1. create a temp branch (e.g. xgTemp) from the main branch (latest branch).
  • step 2. fetch origin/xgTemp to your local xgTemp, and make your own changes via PyCharm etc.
  • step 3. push your changes from local xgTemp to your github cloud branch: origin/xgTemp.
  • step 4. open a pull request to merge from your branch to main.

When you finish all these jobs. I will get a notification and approve merging your branch to main. Once I finish, I will delete your branch, and next time you will repeat the above jobs.

A widely tested main branch will then be merged to the stable branch and a new version will be released based on stable branch.