Skip to content

Commit

Permalink
Working on JOSS paper draft
Browse files Browse the repository at this point in the history
  • Loading branch information
john-hawkins committed Sep 29, 2024
1 parent d061ab1 commit 41be5a2
Show file tree
Hide file tree
Showing 3 changed files with 42 additions and 13 deletions.
39 changes: 27 additions & 12 deletions docs/paper/joss.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: 'Projit: A Python Package and CLI for Data Science Project Management'
title: 'Projit: An Open Source tool for Decoupled Data Science'
tags:
- Python
- data science
Expand Down Expand Up @@ -29,25 +29,34 @@ data scientists use custom workflows, or proprietary cloud systems to automate a
standardise certain elements like management of data sets, scripts, model artefacts
and experimental results. The general absence of standardisation means that we cannot
easily migrate projects or audit them without significant investment in understanding
a codebase. Nor can we easily repeat experiments or conduct meta-analysis across
multiple projects.
a codebase, nor can we easily repeat experiments or conduct meta-analysis across
multiple projects. We present `projit` -- a simple open source package and CLI
for maintaining data science project meta-data and interoperability between stages
and processes.


# Statement of need

Software approaches to managing scientific data, processes and meta-data are
typically either built as front-ends for specific
scientific domains @[Howe2008,Pettit:2010] (leveraging known analytical practices in
the given domain) or they are designed to faciliate interoperability between different
technology stacks @[Subramanian2013]. Machine learning focused frameworks tend to
focus on solving problems of model training and deployment for specific technologies\cite@[Alberti:2018,MolnerDomenech:2020], and hence have limited generality.

`Projit` is a Python package for managing data science project meta-data
inside a simple local JSON store. It also provides a CLI tool for
interogating this data so that the current state of a project can easily
be assessed and understood. The API for `projit` was
designed so that it can be included in arbitrary python scripts to access
designed so that it can be included in arbitrary python scripts to
locate datasets, register experiments and store results along
with hyper-parameters.

The `projit` datastore is light-weight enough that it can easily be stored
with code inside a source code repository. Meaning that future users can
interogate the experiment history of the project. This is useful for both
project continuation, auditing/repeatability and opening the possibility
of scripted meta-data analysis. It has already been
of scripted meta-data analysis. The package has been
used in a number of scientific publications to manage the results of
machine learning experiments into systematic reviews for biomedical
projects [@Hawkins+Tivey:2024] and the analysis of text features derived
Expand All @@ -68,9 +77,10 @@ generate standardised result sets for comparison.
To facilitate loose coupling between stages of the project the `projit` utility
imposes a simple schema for components of a data science project. These consist
of:
* Datasets
* Experiments
* Results
- Datasets
- Experiments
- Results

All of these entities can be added, removed or modified using either the CLI tool
or the Python package within scripts. The relation of these components is depicted
in Figure \autoref{fig:projit}
Expand All @@ -93,16 +103,21 @@ In order to make the CLI interface easy to use we borrowed multiple ideas from t
design of the Git CLI [@git]. Firstly, any command will recursively search from the
current directory to discover the current project. This means users can run commands
from anywhere inside the project without tracking the location of the root directory.
Secondly, we develop a sub-command structure that provides
something close to a natural language interface. For example, the primary commands
`list` or `rm` can be applied to any of the `projit` entities, as shown in the code
listing below:
Secondly, we develop a sub-command structure that allows the `'projit` CLI to be
a versatile tool with something close to a natural language interface.
For example, the primary command `list` can be applied to any of the `projit`
entities, as shown in the code listing below:

```
projit list datasets
projit list experiments
projit list results
```

The same principle applies to the remove and add commands, which naturally require
additional paramaters to specifiy what is being added or removed. The design goal
of the CLI is to make project intuitive without imposing arbitrary constraints.

# Research Applications

The fundamental research application of `projit` is in managing the project lifecycle
Expand Down
7 changes: 6 additions & 1 deletion docs/paper/paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,11 @@ \section{Introduction}
and that the core scientific tasks of conceiving,
designing and running experiments needs to be managed at an ever increasing scale.

Software approaches to managing data, processes and meta-data are typically referred to
as e-science platforms. In general they are either build for specific scientific domains
\cite{Howe2008,} or they are designed for very specific technology stacks (common in a given domain)
\cite{Subramanian2013}.

Platforms for eScience offer a variety of solutions for these problems including
tracking the lineage and management of data, referred to as
the provenance problem \cite{Sahoo:2008,Conquest:2021}.
Expand Down Expand Up @@ -204,7 +209,7 @@ \section{Introduction}
Frameworks for eScience will typically need to take a position on the extent to which
they are domain specific versus general purpose. A domain specific approach can
integrate multiple data sources in a domain aware fashion that can faciliate
automated or assisted scientific discover\cite{Howe2008}. On the
automated or assisted scientific discovery\cite{Howe2008,Pettit:2010}. On the
other hand a general purpose framework facilitates multi-disciplinary collaboration
and permits meta-analysis that transcends the boundaries of disciplines.
The other key dimension for a decision is the extent to which an eScience
Expand Down
9 changes: 9 additions & 0 deletions docs/paper/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -199,3 +199,12 @@ @misc{git
url = {https://git.kernel.org/pub/scm/git/git.git/}
}

@inproceedings{Pettit:2010,
author = {Pettit, Christopher and Russel, A.B.M. and Michael, Anthony and Aurambout, Jean-Philippe and Sharma, Subhash and Williams, Stephen and Hunter, David and Chan, Pang and Borda, Ann and Bishop, Ian and Abramson, David},
year = {2010},
month = {12},
pages = {73-80},
title = {Realising an eScience Platform to Support Climate Change Adaptation in Victoria},
journal = {Proceedings - 2010 6th IEEE International Conference on e-Science, eScience 2010},
doi = {10.1109/eScience.2010.32}
}

0 comments on commit 41be5a2

Please sign in to comment.