Skip to content

Zhuofeng-Li/TEG-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Why TEGs instead of TAGs?

Textual-Edge Graphs (TEGs) incorporate textual content on both nodes and edges, unlike Text-Attributed Graphs (TAGs) featuring textual information only at the nodes. Edge texts are crucial for understanding document meanings and semantic relationships. For instance, as shown below, to understand the knowledge "Planck endorsed the uncertainty and probabilistic nature of quantum mechanics," citation edge (Book D - Paper E) text information is essential. This reveals the comprehensive connections and influences among scholarly works, enabling a deeper analysis of document semantics and knowledge networks.

alt text

Overview

Textual-Edge Graphs Datasets and Benchmark (TEG-DB) is a comprehensive and diverse collection of benchmark textual-edge datasets featuring rich textual descriptions on nodes and edges, data loaders, and performance benchmarks for various baseline models, including pre-trained language models (PLMs), graph neural networks (GNNs), and their combinations. This repository aims to facilitate research in the domain of textual-edge graphs by providing standardized data formats and easy-to-use tools for model evaluation and comparison.

Features

  • Unified Data Representation: All TEG datasets are represented in a unified format. This standardization allows for easy extension of new datasets into our benchmark.
  • Highly Efficient Pipeline: TEG-Benchmark is highly integrated with PyTorch Geometric (PyG), leveraging its powerful tools and functionalities. Therefore, its code is concise. Specifically, for each paradigm, we provide a small .py file with a summary of all relevant models and a .ssh file to run all baselines in one click.
  • Comprehensive Benchmark and Analysis: We conduct extensive benchmark experiments and perform a comprehensive analysis of TEG-based methods, delving deep into various aspects such as the impact of different models, the effect of embeddings generated by Pre-trained Language Models (PLMs) of various scales, and the influence of different domain datasets. The statistics of our TEG datasets are as follows:

Datasets

Please click Huggingface TEG-Benchmark to find the TEG datasets we upload!

We have constructed 9 comprehensive and representative TEG datasets (we will continue to expand). These datasets cover domains including Book Recommendation, E-commerce, Academic, and Social networks. They vary in size, ranging from small to large. Each dataset contains rich raw text data on both nodes and edges, providing a diverse range of information for analysis and modeling purposes.

TEG-DB is an ongoing effort, and we are planning to increase our coverage in the future.

Our experiments

Please check the experimental results and analysis from our paper.

Star and Cite

Please star our repo 🌟 and cite our paper if you feel useful. Feel free to email us ([email protected]) if you have any questions.

@misc{li2024tegdb,
      title={TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs}, 
      author={Zhuofeng Li and Zixing Gou and Xiangnan Zhang and Zhongyuan Liu and Sirui Li and Yuntong Hu and Chen Ling and Zheng Zhang and Liang Zhao},
      year={2024},
      eprint={2406.10310},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

Package Usage

Requirements

  • pyg=2.5.2

You can quickly install the corresponding dependencies,

conda env create -f environment.yml

More details about folders

The TEG folder in the project is designated for storing data preprocessing code to ensure data output in PyG Data format. The example folder is intended to house all methods training codes. Within it, the linkproppred and nodeproppred subfolders represent edge-level and node-level tasks, respectively. In the next level of directories, we organize the training codes by using folders named after different domain datasets.

Below we will take the children dataset in the goodreads folder as an example to show how to use our benchmark.

Datasets setup

You can go to the Huggingface TEG-Benchmark to find the datasets we upload! In each dataset folder, you can find the .json file in raw folder, .npy file (text embedding we extract from the PLM) in emb folder. Please copy thses files directly in goodreads/children folder!

cd example/linkproppred/goodreads/children

cd raw

# copy `.json` files to `raw`

cd emb

# copy `.npy` files to `emb` 

GNN for link prediction

cd example/linkproppred/goodreads

# Run the edge_aware_gnn.py script
python edge_aware_gnn.py --data_type children --emb_type Bert --model_type GraphTransformer

# Run all baseline methods
# bash run_all.sh

GNN for node classification

Copy the children dataset and embeddings into the example/nodeproppred/goodreads/children directory, as we did before (the same dataset and embeddings used for link prediction are also used for node classification).

cd example/nodeproppred/goodreads

# Run the edge_aware_gnn.py script
python edge_aware_gnn.py --data_type children --emb_type Bert --model_type GraphTransformer

# Run all baseline methods
# bash run_all.sh

Here are explanations of some important args,

--data_type: "the name of dataset"
--emb_type: "embedding type"
--model_type: "model type"

LLM for link prediction

cd example/linkproppred/llm

# Run the gpt.py script
python gpt.py --data_type children --GPT_type gpt-4 

LLM for node classification

cd example/nodeproppred/llm

# Run the gpt.py script
python gpt.py --data_type children --GPT_type gpt-4  --predict_node_number 11

Reference

Please read the following materials carefully to set up your dataset!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •