Skip to content

Commit

Permalink
Working on JOSS submission
Browse files Browse the repository at this point in the history
  • Loading branch information
john-hawkins committed Sep 28, 2024
1 parent f17ca6f commit e03cebb
Show file tree
Hide file tree
Showing 5 changed files with 161 additions and 8 deletions.
File renamed without changes.
122 changes: 122 additions & 0 deletions docs/paper/joss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
title: 'Projit: A Python Package and CLI for Data Science Project Management'
tags:
- Python
- data science
- machine learning
- statistics
- open science
authors:
- name: John Hawkins
orcid: 0000-0001-6507-3671
equal-contrib: true
affiliation: "1" # (Multiple affiliations must be quoted)
affiliations:
- name: Transitional AI Research Group, Australia
index: 1
date: 27 Sep 2024
bibliography: paper.bib

---

# Summary

Data science projects occupy an unsual space between fast scripting, software development,
and methodologically rigorous experimentation. They require careful discipline to
prevent subtle problems like target leakage, over-fitting or p-hacking. At the same time
they cannot deliver results if they are locked down by rigid frameworks. Typically,
data scientists use custom workflows, or proprietary cloud systems to automate and
standardise certain elements like management of data sets, scripts, model artefacts
and experimental results. The general absence of standardisation means that we cannot
easily migrate projects or audit them without significant investment in understanding
a codebase. Nor can we easily repeat experiments or conduct meta-analysis across
multiple projects.


# Statement of need

`Projit` is a Python package for managing data science project meta-data
inside a simple local JSON store. It also provides a CLI tool for
interogating this data so that the current state of a project can easily
be assessed and understood. The API for `projit` was
designed so that it can be included in arbitrary python scripts to access
locate datasets, register experiments and store results along
with hyper-parameters.

The `projit` datastore is light-weight enough that it can easily be stored
with code inside a source code repository. Meaning that future users can
interogate the experiment history of the project. This is useful for both
project continuation, auditing/repeatability and opening the possibility
of scripted meta-data analysis. It has already been
used in a number of scientific publications to manage the results of
machine learning experiments into systematic reviews for biomedical
projects [@Hawkins+Tivey:2024] and the analysis of text features derived
from URLS [@Hawkins:2023]. In addition, `projit` has been used by the author
inside multiple industry based proprietary machine learning projects.

# Methodology

The core design principle of projit is that data science projects should
be structured as loosely coupled components. Meaning, dependency is inevitable,
but it should be kept to an absolute minimum.
For example, experiments depend on a data processing
pipeline, but do not need to depend on anything but the output of that process.
Experiments should be able to be executed in parallel, so that they can be
re-run as required. They do not need to be aware of each other, but they should
generate standardised result sets for comparison.

To facilitate loose coupling between stages of the project the `projit` utility
imposes a simple schema for components of a data science project. These consist
of:
* Datasets
* Experiments
* Results
All of these entities can be added, removed or modified using either the CLI tool
or the Python package within scripts. The relation of these components is depicted
in Figure \autoref{fig:projit}

![Projit Application Entities.\label{fig:projit}](images/Projit_decoupled_process.drawio.png)

In the development of `projit` we have drawn on additional design principles from
other open source projects.

## Project Structure

There is an optional setting that allows users to determine a standard project structure.
This option will initialise any project with a predetermined set of directories and
files. We draw upon the principle used in the Cookie Cutter Data Science project when
implementing these project structures [@cookiecutter].

## Natural Language Sub Command CLI

In order to make the CLI interface easy to use we borrowed multiple ideas from the
design of the Git CLI [@git]. Firstly, any command will recursively search from the
current directory to discover the current project. This means users can run commands
from anywhere inside the project without tracking the location of the root directory.
Secondly, we develop a sub-command structure that provides
something close to a natural language interface. For example, the primary commands
`list` or `rm` can be applied to any of the `projit` entities, as shown in the code
listing below:

```
projit list datasets
projit list experiments
```

# Research Applications

The fundamental research application of `projit` is in managing the project lifecycle
and efficiency of development. Results to all experiments can be tracked and interrogated
to easily produce tables of data. An additional level of application comes with a focus
on open science, allowing other teams to review and audit experiment history, then
easily repeat or extend experiments. Finally, there is a research application in meta-analysis.
Projects in which the projit meta-data are stored along with open source code can
be interoggated to look at the performance of certain techniques or algorithms across
multiple projects.

# Acknowledgements

We acknowledge contributions from Jesse Wu and Priyabrata Karmakar
in testing or reviewing the functionality and codebase of projit.

# References
2 changes: 1 addition & 1 deletion docs/paper/paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ \section{Introduction}
and comparison of these experiments.

\begin{figure*}
\includegraphics[scale=0.6]{./Projit_decoupled_process.drawio.png}
\includegraphics[scale=0.6]{./images/Projit_decoupled_process.drawio.png}
\caption{Projit Process for Decoupled Data Science}
\label{fig:projit}
\end{figure*}
Expand Down
45 changes: 38 additions & 7 deletions docs/paper/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -149,13 +149,25 @@ @techreport{Schmitt2015
title = {Scientific Discovery in the Era of Big Data: More than the Scientific Method}
}

@article{HawkinsTivey2023,
author = {Hawkins, J. and Tivey, D.},
year = {2023},
month = {09},
pages = {},
title = {Efficient Systematic Reviews: Literature Filtering with Transformers and Transfer Learning},
journal = {Submitted}
@inproceedings{Hawkins+Tivey:2024,
author = {John Hawkins and David Tivey},
year = {2024},
month = {06},
title = {Literature Filtering for Systematic Reviews with Transformers},
booktitle = {2nd International Conference on Communications, Computing and Artificial Intelligence (CCCAI 2024)},
address = {Jeju, Korea},
editor = {},
ISBN = {},
doi = {https://doi.org/10.1145/3676581.3676582}
}

@inproceedings{Hawkins:2023,
author = {John Hawkins},
year = {2023},
title = {What's in a Domain? Anaylsis of URL Features},
booktitle = {4th International Conference on Data Science and Cloud Computing (DSCC 2023)},
month = {March},
pages = {85-93}
}

@article{Reznik2022,
Expand All @@ -168,3 +180,22 @@ @article{Reznik2022
journal = {Computers \& Geosciences},
doi = {10.1016/j.cageo.2022.105194}
}

@misc{cookiecutter,
author = {Carmine Paolino},
title = {Cookiecutter Modern Data Science},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/crmne/cookiecutter-modern-datascience}
}

@misc{git,
author = {Linus Torvalds},
title = {Git},
year = {2005},
publisher = {Software Freedom Conservancy},
journal = {Git repository},
url = {https://git.kernel.org/pub/scm/git/git.git/}
}

0 comments on commit e03cebb

Please sign in to comment.