Skip to content

Commit

Permalink
More references and literature review
Browse files Browse the repository at this point in the history
  • Loading branch information
john-hawkins committed Jun 21, 2023
1 parent b94c363 commit 6bf092b
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 18 deletions.
76 changes: 58 additions & 18 deletions docs/paper/paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -149,10 +149,23 @@ \section{Introduction}
to enable efficiency, but has the effect of limiting general applicability.
Many other eScience frameworks focus on the lineage and management of data, referred to as
the so-called provenance problem \cite{Sahoo:2008,Conquest:2021}
The goal of the provenance frameworks is sufficient auditibiliy of data sources that will
render eScience transparent and repeatable.

Other frameworks and approaches focus on understanding how to do large scale collaborative science, or
The goal of the provenance frameworks is sufficient auditibiliy of data that will
render eScience transparent and repeatable. This can be auditing of data from multiple source systems,
or it can be auditing of logs generated during data processing\cite{Ferdous2020}. Regardless of the
specific data to be audited, these frameworks focus on developing unified systems and processes so
that auditing can be easily performed over many projects.

In addition to systems for storage of data, eScience application may include facilities for orchestration
of data processes and services\cite{Subramanian2013}, analysis of results, generation of insights
and documentation.

Frameworks for eScience will typically need to take a position on the extent to which they are domain
specific, versus general purpose. A domain specific approach that integrates multiple data sources in
a domain aware fashion can faciliate automated or assisted scientific discover\cite{Howe2008}. On the
other hand a general purpose framework facilitates multi-disciplinary collaboration and permits meta-analysis
that transcends the boundaries of disciplines.

Other frameworks and approaches in eScience focus on understanding how to do large scale collaborative science, or
facilitate meta-level learning of various kinds\cite{Hunter:2005,Liu:2023}. The better we track the
process of science as a whole, the better we can understand both how to improve scientific processes
as well as data mine the history of science for phenomena that were difficult to detect.
Expand All @@ -174,21 +187,31 @@ \section{Methodology}
\item Tracking: Tracking of Experiments and outputs
\item Results: Comparison of Methods and Results
\item Documentation: Generation of Documentation
\item Reproduction: Reproducibility of Projects
\item Reproducibility: Facilitate reproduction of results
\item Meta-Analysis: Facilitation of Meta-Analysis
\end{itemize}

The elements in this list are organised in an approximately sequential manner. However, as we discuss
them below it should be apparent that there are many ways in which these elements support each other.
Firstly, and foremost, aata driven projects
require a method of accessing the required \textbf{source} data and will need to maintain records
of this data provenance. There will typically be \textbf{processing} applied to these datasets to

First and foremost, data driven projects require a access to the required \textbf{source} data
and need to maintain records of this data provenance for \textbf{reproducibility}.
There will typically be \textbf{processing} applied to these datasets to
render them applicable to experimentation and analysis. An ideal tool will track the sequential
nature of this processing as well as store information about the location of each resulting dataset.
nature of this \textbf{processing} as well as store information about the location of each resulting dataset.
The data processed in this way is then available for \textbf{reuse} across experiments and analysis,
making \textbf{results} comparable and facilitating \textbf{meta-analysis}.


making \textbf{results} comparable.

The centralised storage of data in a unified format allows for scripted generation of \textbf{documentation},
and facilitates easy \textbf{meta-analysis}. If the product metadata is stored in a public or open source
repository then it is possible to build tools that extract and process the data from multiple projects. It
will permit the emgergence of an ecosystem of tools that mine the history of experiments conducted on the
same or similar source data, evaluate experimental protocols or algorithms across projects and potentially
automate some forms of \textbf{meta-analysis}.

To achieve these advantages we require a uniform system for storing all necessary data that are inputs and outputs
for each stage of a data science experiment. The central store permits decoupling of processes by allowing each
element of the process to be implemented and executed independently of the others.

\subsection{Projit Process}

Expand All @@ -208,12 +231,29 @@ \subsection{Projit Process}
and analysis of results all happen independently. Each of them accesses the projit store for
the information they need, storing information

\subsection{Application}



\section{Results}

\subsection{Implementation}

Projit has been implemented as python package that functions as both a command line application
and library that can be included inside other scripts and applications. The command line application
can be used to query the project metadata in much the same way that the git application can be used.
A user can add, modify and list the collection of data assets in the project: datasets, experiments
and results are all accessible from the command line application.

The python package can be included in a script so that the script can access the project metadata store.
This allows the script to find the location of common datasets, register themselves as an experiment
and store results once the script is complete. Programmatic interaction with the project data through
the projit API is what permits the scripts of a project to be decoupled and contribute to the project
without being aware of how any other element is structured or implemented.

\section{Case Study}

We have utilised the projit application across multiple data science projects to store reusable datasets
and the results of all experiments. Additionally, the metadata store contains infomation about the number
of times each experiment has been executed, and the execution time utilised on each run. This allows us
to generate an ad hoc script that can compare projects in terms of the data used, the number of experiments
conducted and the total execution time. This script is constructed for illustrative purposes to show that
the projit tool can permit arbitrary meta-analysis of projects through the standardised metadata stored across
git repositories.


\section{Conclusion}
Expand Down
31 changes: 31 additions & 0 deletions docs/paper/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,35 @@ @article{Liu:2023
doi = {10.1038/s41562-023-01562-4}
}

@inbook{Ferdous2020,
author = {Ferdous, Rayhan and Roy, Banani and Roy, Chanchal and Schneider, Kevin},
year = {2020},
month = {01},
pages = {185-200},
title = {Workflow Provenance for Big Data: From Modelling to Reporting},
booktitle = {Data Management and Analysis}
isbn = {978-3-030-32586-2},
doi = {10.1007/978-3-030-32587-9_11}
}

@inproceedings{Howe2008,
author = {Howe, Bill and Lawson, Peter and Bellinger, M. Renee and Anderson, Erik and Santos, Emanuele and Freire, Juliana and Scheidegger, Carlos and Baptista, Antonio and Silva, Claudio},
year = {2008},
month = {12},
pages = {127-134},
title = {End-to-End eScience: Integrating Workflow, Query, Visualization, and Provenance at an Ocean Observatory},
journal = {Proceedings - 4th IEEE International Conference on eScience, eScience 2008},
doi = {10.1109/eScience.2008.67}
}

@article{Subramanian2013,
author = {Subramanian, Sattanathan and Sztromwasser, Pawel and Puntervoll, Pål and Petersen, Kjell},
year = {2013},
month = {08},
pages = {},
title = {Pipelined data-flow delegated orchestration for data-intensive eScience workflows},
volume = {9},
journal = {International Journal of Web Information Systems},
doi = {10.1108/IJWIS-05-2013-0012}
}

0 comments on commit 6bf092b

Please sign in to comment.