Skip to content

Code repository for "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023) @MinaAlmasi @drasbaek


Notifications You must be signed in to change notification settings


Repository files navigation

Fine-tuning GPT-3 for Synthetic Danish News Generation

This repository contains the code written for the paper titled, "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023).

The project involved fine-tuning GPT-3 to produce synthetic news articles in Danish and evaluating the model in binary classification tasks. The evaluation relied on both human participants (A) and machine classifiers (B).

To read the details of this evaluation, please refer to (Almasi & Schiønning, 2023).


Due to constraints with copyright and GDPR, only the test data and the synthetically generated GPT-3 data is uploaded to this GitHub repository. For all other purposes, dummy data is provided to reproduce the pipelines (see also Project Structure). To run any of the pipelines, follow the instructions in the Pipeline section.

For any other questions regarding the project, please contact the authors.

Project Structure

The repository is structured as such:

dummy_data Dummy data to run GPT-3 pipeline, reproduce plots from experiment A (human participants) and technical pipelines from experiment B (machine classifiers). Created to mimic actual data to the extent that is possible.
dummy_results Files that come from running dummy scripts in src. Due to limited dummy data, these may not contain any intelligible information.
data Contains the 96 test articles used in both Experiment A and B (i.e., for evaluating both human participants and machine detectors) and the 609 articles generated by GPT-3 for fine-tuning BERT.
plots Plots used in (Almasi & Schiønning, 2023)
results Results from machine classifiers presented in (Almasi & Schiønning, 2023)
src All code organised in folders process_articles, gpt3 and classifiers
tokens Empty folder to place openai_token.txt (for GPT-3 pipeline) and hf_token.txt (to push model to HF Hub, OPTIONAL!!!) Run to install general requirements, packages in virtual environment. Note that additional setup may be required for the individual pipelines. Run to reproduce classifier pipelines Run to reproduce BERT pipeline

Please note that the files in results, plots and data contain actual data pertaining to (Almasi & Schiønning, 2023) while the files in dummy_data and dummy_results do not.


For this project, Python (version 3.10) and R was used. Python's venv needs to be installed for the setup to work.

General setup

To install necessary requirements in a virtual environment (env), please run the in the terminal:


The individual technical pipelines may require extra setup. These steps are explained in their respective README's.

[1] Article Preprocessing

Refer to located in src/process_articles to reproduce the article preprocessing.

[2] Fine-Tuning and Text Generation with GPT-3

To fine-tune and/or generate text with GPT-3 with dummy data, refer to the located in src/gpt3.

⚠️ NOTE! The current script finetunes "text-davinci", but this will be deprecated on the 4th of January 2024. You can read more on about this at

[3] Experiment A: Analysis of Human Participants

To run the analysis, please refer to the Rmarkdown exp-a-analysis.Rmd in the src folder.

[4] Experiment B: Constructing Machine Classifiers

To construct the machine classifiers (BOW, TF-IDF, fine-tuned BERT), follow the instructions in the located in src/classifiers.

⚠️ NOTE! While the fine-tuning of NbAiLab/nb-bert-large is done on dummy data, the inference is done with the actual fine-tuned classifier on the real test data.

The fine-tuned BERT can be accessed from the Hugging Face Hub:



For any questions regarding the paper or reproducibility of the project, you can contact us:


Code repository for "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023) @MinaAlmasi @drasbaek








No releases published
