diff --git a/docs/paper/Projit_decoupled_process.drawio b/docs/paper/images/Projit_decoupled_process.drawio similarity index 100% rename from docs/paper/Projit_decoupled_process.drawio rename to docs/paper/images/Projit_decoupled_process.drawio diff --git a/docs/paper/Projit_decoupled_process.drawio.png b/docs/paper/images/Projit_decoupled_process.drawio.png similarity index 100% rename from docs/paper/Projit_decoupled_process.drawio.png rename to docs/paper/images/Projit_decoupled_process.drawio.png diff --git a/docs/paper/joss.md b/docs/paper/joss.md new file mode 100644 index 0000000..c044b2e --- /dev/null +++ b/docs/paper/joss.md @@ -0,0 +1,122 @@ +--- +title: 'Projit: A Python Package and CLI for Data Science Project Management' +tags: + - Python + - data science + - machine learning + - statistics + - open science +authors: + - name: John Hawkins + orcid: 0000-0001-6507-3671 + equal-contrib: true + affiliation: "1" # (Multiple affiliations must be quoted) +affiliations: + - name: Transitional AI Research Group, Australia + index: 1 +date: 27 Sep 2024 +bibliography: paper.bib + +--- + +# Summary + +Data science projects occupy an unsual space between fast scripting, software development, +and methodologically rigorous experimentation. They require careful discipline to +prevent subtle problems like target leakage, over-fitting or p-hacking. At the same time +they cannot deliver results if they are locked down by rigid frameworks. Typically, +data scientists use custom workflows, or proprietary cloud systems to automate and +standardise certain elements like management of data sets, scripts, model artefacts +and experimental results. The general absence of standardisation means that we cannot +easily migrate projects or audit them without significant investment in understanding +a codebase. Nor can we easily repeat experiments or conduct meta-analysis across +multiple projects. + + +# Statement of need + +`Projit` is a Python package for managing data science project meta-data +inside a simple local JSON store. It also provides a CLI tool for +interogating this data so that the current state of a project can easily +be assessed and understood. The API for `projit` was +designed so that it can be included in arbitrary python scripts to access +locate datasets, register experiments and store results along +with hyper-parameters. + +The `projit` datastore is light-weight enough that it can easily be stored +with code inside a source code repository. Meaning that future users can +interogate the experiment history of the project. This is useful for both +project continuation, auditing/repeatability and opening the possibility +of scripted meta-data analysis. It has already been +used in a number of scientific publications to manage the results of +machine learning experiments into systematic reviews for biomedical +projects [@Hawkins+Tivey:2024] and the analysis of text features derived +from URLS [@Hawkins:2023]. In addition, `projit` has been used by the author +inside multiple industry based proprietary machine learning projects. + +# Methodology + +The core design principle of projit is that data science projects should +be structured as loosely coupled components. Meaning, dependency is inevitable, +but it should be kept to an absolute minimum. +For example, experiments depend on a data processing +pipeline, but do not need to depend on anything but the output of that process. +Experiments should be able to be executed in parallel, so that they can be +re-run as required. They do not need to be aware of each other, but they should +generate standardised result sets for comparison. + +To facilitate loose coupling between stages of the project the `projit` utility +imposes a simple schema for components of a data science project. These consist +of: +* Datasets +* Experiments +* Results +All of these entities can be added, removed or modified using either the CLI tool +or the Python package within scripts. The relation of these components is depicted +in Figure \autoref{fig:projit} + +![Projit Application Entities.\label{fig:projit}](images/Projit_decoupled_process.drawio.png) + +In the development of `projit` we have drawn on additional design principles from +other open source projects. + +## Project Structure + +There is an optional setting that allows users to determine a standard project structure. +This option will initialise any project with a predetermined set of directories and +files. We draw upon the principle used in the Cookie Cutter Data Science project when +implementing these project structures [@cookiecutter]. + +## Natural Language Sub Command CLI + +In order to make the CLI interface easy to use we borrowed multiple ideas from the +design of the Git CLI [@git]. Firstly, any command will recursively search from the +current directory to discover the current project. This means users can run commands +from anywhere inside the project without tracking the location of the root directory. +Secondly, we develop a sub-command structure that provides +something close to a natural language interface. For example, the primary commands +`list` or `rm` can be applied to any of the `projit` entities, as shown in the code +listing below: + +``` +projit list datasets +projit list experiments +``` + +# Research Applications + +The fundamental research application of `projit` is in managing the project lifecycle +and efficiency of development. Results to all experiments can be tracked and interrogated +to easily produce tables of data. An additional level of application comes with a focus +on open science, allowing other teams to review and audit experiment history, then +easily repeat or extend experiments. Finally, there is a research application in meta-analysis. +Projects in which the projit meta-data are stored along with open source code can +be interoggated to look at the performance of certain techniques or algorithms across +multiple projects. + +# Acknowledgements + +We acknowledge contributions from Jesse Wu and Priyabrata Karmakar +in testing or reviewing the functionality and codebase of projit. + +# References diff --git a/docs/paper/paper.tex b/docs/paper/paper.tex index fe1ef1a..7fd3e98 100644 --- a/docs/paper/paper.tex +++ b/docs/paper/paper.tex @@ -223,7 +223,7 @@ \section{Introduction} and comparison of these experiments. \begin{figure*} -\includegraphics[scale=0.6]{./Projit_decoupled_process.drawio.png} +\includegraphics[scale=0.6]{./images/Projit_decoupled_process.drawio.png} \caption{Projit Process for Decoupled Data Science} \label{fig:projit} \end{figure*} diff --git a/docs/paper/refs.bib b/docs/paper/refs.bib index f130af8..45765d9 100644 --- a/docs/paper/refs.bib +++ b/docs/paper/refs.bib @@ -149,13 +149,25 @@ @techreport{Schmitt2015 title = {Scientific Discovery in the Era of Big Data: More than the Scientific Method} } -@article{HawkinsTivey2023, - author = {Hawkins, J. and Tivey, D.}, - year = {2023}, - month = {09}, - pages = {}, - title = {Efficient Systematic Reviews: Literature Filtering with Transformers and Transfer Learning}, - journal = {Submitted} +@inproceedings{Hawkins+Tivey:2024, + author = {John Hawkins and David Tivey}, + year = {2024}, + month = {06}, + title = {Literature Filtering for Systematic Reviews with Transformers}, + booktitle = {2nd International Conference on Communications, Computing and Artificial Intelligence (CCCAI 2024)}, + address = {Jeju, Korea}, + editor = {}, + ISBN = {}, + doi = {https://doi.org/10.1145/3676581.3676582} +} + +@inproceedings{Hawkins:2023, + author = {John Hawkins}, + year = {2023}, + title = {What's in a Domain? Anaylsis of URL Features}, + booktitle = {4th International Conference on Data Science and Cloud Computing (DSCC 2023)}, + month = {March}, + pages = {85-93} } @article{Reznik2022, @@ -168,3 +180,22 @@ @article{Reznik2022 journal = {Computers \& Geosciences}, doi = {10.1016/j.cageo.2022.105194} } + +@misc{cookiecutter, + author = {Carmine Paolino}, + title = {Cookiecutter Modern Data Science}, + year = {2020}, + publisher = {GitHub}, + journal = {GitHub repository}, + url = {https://github.com/crmne/cookiecutter-modern-datascience} +} + +@misc{git, + author = {Linus Torvalds}, + title = {Git}, + year = {2005}, + publisher = {Software Freedom Conservancy}, + journal = {Git repository}, + url = {https://git.kernel.org/pub/scm/git/git.git/} +} +