diff --git a/docs/paper/joss.md b/docs/paper/joss.md index afd9628..7b826e2 100644 --- a/docs/paper/joss.md +++ b/docs/paper/joss.md @@ -41,25 +41,25 @@ and processes. Software approaches to managing scientific data, processes and meta-data are typically either built as front-ends for specific -scientific domains [@Howe2008;Pettit:2010] (leveraging known analytical practices in -the given domain) or they are designed to faciliate interoperability between different +scientific domains [@Howe2008;@Pettit:2010] +or they are designed to faciliate interoperability between different technology stacks [@Subramanian2013]. Machine learning focused frameworks tend to focus on solving problems of model training and deployment for specific -technologies\cite[@Alberti:2018;MolnerDomenech:2020], and hence have limited generality. +technologies [@Alberti:2018;@MolnerDomenech:2020], and hence have limited generality. `Projit` is a Python package for managing data science project meta-data -inside a simple local JSON store. It also provides a CLI tool for +inside a simple local JSON store. It provides a CLI tool for interrogating this data so that the current state of a project can easily be assessed and understood. The API for `projit` was designed so that it can be included in arbitrary python scripts to locate datasets, register experiments and store results along with hyper-parameters. -The `projit` datastore is light-weight enough that it can easily be stored -with code inside a source code repository. Meaning that future users can -interrogate the experiment history of the project. This is useful for both +The `projit` datastore is light-weight so it can be saved +with code inside a source code repository. Allowing future users to +interrogate the experiment history of project. This is useful for both project continuation, auditing/repeatability and opening the possibility -of scripted meta-data analysis. The package has been +of scripted meta-data analysis. The `projit` package has been used in a number of scientific publications to manage the results of machine learning experiments into systematic reviews for biomedical projects [@Hawkins+Tivey:2024] and the analysis of text features derived @@ -80,12 +80,12 @@ generate standardised result sets for comparison. To facilitate loose coupling between stages of the project the `projit` utility imposes a simple schema for components of a data science project. These consist of: -- Datasets -- Experiments -- Results +* Datasets +* Experiments +* Results All of these entities can be added, removed or modified using either the CLI tool -or the Python package within scripts. The relation of these components is depicted +or the Python package within scripts. These entities in a project are depicted in Figure \autoref{fig:projit} ![Projit Application Entities.\label{fig:projit}](images/Projit_decoupled_process.drawio.png) @@ -109,29 +109,29 @@ from anywhere inside the project without tracking the location of the root direc Secondly, we develop a sub-command structure that allows the `'projit` CLI to be a versatile tool with something close to a natural language interface. For example, the primary command `list` can be applied to any of the `projit` -entities, as shown in the code listing below: +entities, as shown in the command below: ``` -projit list datasets -projit list experiments -projit list results +> projit list datasets ``` The same principle applies to the remove and add commands, which naturally require additional paramaters to specifiy what is being added or removed. The design goal -of the CLI is to make project intuitive without imposing arbitrary constraints. +of the CLI is to make projit intuitive without imposing arbitrary constraints. # Research Applications The fundamental research application of `projit` is in managing the project lifecycle -and efficiency of development. Results to all experiments can be tracked and -interrogated to easily produce tables of data. -An additional level of application comes with a focus +and efficiency of development. Paths to datasets are retrieved from meta-data, not +hard coded. Experiments are named, with execution times tracked. The Results to +all experiments can be tracked over each iteration, with hyper-parameters and +interrogated to easily produce tables of data and analysis. +Additional application comes with a focus on open science, allowing other teams to review and audit experiment history, then easily repeat or extend experiments. Finally, there is a research application in meta-analysis. Projects in which the projit meta-data are stored along with open source code can -be interrogated to look at the performance of certain techniques or algorithms across +be analysed to look at the performance of certain techniques or algorithms across multiple projects. # Acknowledgements