Skip to content

A project implementing better evaluation scenarios for community models for malicious content detection, and meta-learning GNNs to achieve better downstream adaptation.

Notifications You must be signed in to change notification settings

ioverho/meta-learning-gnns-1

 
 

Repository files navigation

A (More) Realistic Evaluation Setup for Generalisation of Community Models on Malicious Content Detection

Ivo Verhoeven, Pushkar Mishra, Rahel Beloch, Helen Yannakoudakis§ and Ekaterina Shutova

† ILLC, University of Amsterdam, ‡ MetaAI, London, § Dept. of Informatics, King’s College London

Arxiv Link

This is an anonymized version of the codebase accompanying our paper, without any references users or content from the datasets used. As we are bound by the Twitter API terms-of-service, we cannot re-release these datasets.

The environment files used can be found under ./env_cpu.yaml and ./env_gpu.yaml. GPU acceleration is strongly recommended.

Four python scripts can be found in ./main (namely, preprocess.py, train.py, evaluate.py, transfer.py). These take the modelling pipeline from preprocessing the data to training models to evaluating models on training data and finally evaluating those models under transfer to a new dataset. Each of these scripts has an accompanying Hydra config file, see ./main/config/*.yaml.

Three example SLURM job files can be found under ./process_datasets.job, ./train_protomaml.job and ./transfer_twitter_sweep.job. These serve to illustrate how one might interface with the mentioned Python files. These files were actually used for running experiments on Snellius; the Dutch national supercomputer.

Please direct your questions to: [email protected]

Structure

/data/
    various empty folders where the raw data is supposed to be stored
/job_parameters/
    some text files with hyperparameters for the SLURM jobs
/main/
    ├── /config/
    │       config files needed for Hydra processing, replaces CLI
    ├── /data/
    │       various empty folders where the processed data should be stored
    ├── /data_loading/
    │       code for sampling subgraphs and episodes
    ├── /data_prep/
    │       code taking raw datasets and forming social media graphs
    ├── /models/
    │       PyTorch models
    ├── /utils/
    │       various utility functions
    ├── preprocess.py
    │       script taking adjacency lists to meta-learning episode graphs
    ├── train.py
    │       script for pre-training models
    ├── evaluate.py
    │       script for evaluating models after pre-training
    └── transfer.py
            script for transferring to auxilliary datasets
/results/
    results stored for various model runs

Data

Again, we cannot release the datasets. Instead, we point the user to the repositories of the original datasets, and describe the structure necessary to reconstruct the graphs.

Since we avoid user modelling, it should be possible to reconstruct the graphs using only the edge lists and article content, neither of which are condfidential. However, note that some users or articles have been moderated out, making exact replication of our results impossible.

GossipCop

Citation: Shu, K., Mahudeswaran, D., Wang, S., Lee, D., & Liu, H. (2020). Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3), 171-188.

Github Repo: FakeNewsNet

Dataset structure
/fake/
    └── /gossipcop-$DOCID/'news content.json'
/real/
    └── /gossipcop-$DOCID/'news content.json'
/retweets/
    └── gossipcop-$USERID.csv
/tweets/
    └── gossipcop-$USERID.csv
/user_followers/
    └── $USERID.json
/user_following/
    └── $USERID.json

twiterHateSpeech

Citation: Waseem, Z., & Hovy, D. (2016, June). Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop (pp. 88-93).

Github Repo: Hate Speech Twitter annotations

Dataset structure ```txt ├── authors.txt └── twitter_data_waseem_hovy.csv ```

CoAID

Citation: Cui, L., & Lee, D. (2020). Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.

Github Repo: CoAID

Dataset structure ```txt /main/ ├── /05-01-2020/ ├── NewsFakeCOVID-19.csv ├── NewsFakeCOVID-19_tweets.csv ├── NewsFakeCOVID-19_replies.csv ├── NewsRealCOVID-19.csv ├── NewsRealCOVID-19_tweets.csv └── NewsRealCOVID-19_replies.csv ├── /07-01-2020/ └── Idem ├── /09-01-2020/ └── Idem └── /11-01-2020/ └── Idem /retweets/ └── $USERID.csv /tweets/ └── $USERID.csv /user_followers/ └── $USERID.json /user_following/ └── $USERID.json ```

Running Code

We use Hydra as a configuration system. All scripts in /main/ can be run from the command line, using the Hydra syntax. For example,

python -u evaluate.py \
    fold=$FOLD \
    data.processed_data_dir=$DATA_DIR \
        structure=episodic_khop \
        learning_algorithm=protomaml \
        ++learning_algorithm.n_inner_updates=10 \
        ++learning_algorithm.lr_inner=5.0e-3 \
        ++learning_algorithm.head_lr_inner=1.0e-2 \
        ++learning_algorithm.reset_classifier=true \
        ++optimizer.lr=5.0e-4 \
        ++optimizer.weight_decay=5.0e-2 \
        ++model.hid_dim=256 \
        ++model.fc_dim=64 \
        ++model.n_heads=3 \
        ++model.node_mask_p=0.10 \
        ++model.dropout=0.50 \
        ++model.attn_dropout=0.10 \
        ++callbacks.early_stopping.metric='val/mcc' \
        ++callbacks.early_stopping.mode=max \
        use_train=false \
        use_val=true \
        use_test=true \
        checkpoint_dir="meta-gnn" \
        checkpoint_name=protomaml

would train a ProtoMAML model on an accompanying dataset. See the three *.job files for example SLURM bash scripts.

Results

To reproduce (most of) the figures and tables in the paper, we have included code to parse files in the /results/ folder. This can be run interactively using the ./results_parser.ipynb. Be warned, this is extremely poorly formatted code.

Citation

@inproceedings{verhoeven-etal-2024-generalisation,
    title = "A (More) Realistic Evaluation Setup for Generalisation of Community Models on Malicious Content Detection",
    author = "Verhoeven, Ivo and
        Mishra, Pushkar and
        Beloch, Rahel and
        Yannakoudakis, Helen and
        Shutova, Ekaterina",
    booktitle = "Findings of NAACL 2024",
    year = "2024",
    url = "",
}

About

A project implementing better evaluation scenarios for community models for malicious content detection, and meta-learning GNNs to achieve better downstream adaptation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 65.9%
  • Python 33.2%
  • Shell 0.9%