Skip to content

vkuznet/MLaaS4HEP

Repository files navigation

Machine Learning as a Service for HEP

Build Status License:MIT DOI Tweet

MLaaS for HEP is a set of Python based modules to support reading HEP data and stream them to ML of user choice for training. It consists of three independent layers:

  • data streaming layer to handle remote data, see reader.py
  • data training layer to train ML model for given HEP data, see workflow.py
  • data inference layer, see tfaas_client.py

The general architecture of MLaaS4HEP looks like this: MLaaS4HEP-architecture Even though this architecture was originally developed for dealing with HEP ROOT files we extend it to other data formats. So far the following data formats are supported: JSON, CSV, Parquet, ROOT. The former ones support reading files from local file system or HDFS, while later (ROOT) format allows to read ROOT files from local file system or remote files via xrootd protocol.

The pre-trained models can be easily uploaded to TFaaS inference server for serving them to clients.

Dependencies

The MLaaS4HEP relies on third-party libraries to support reading different data-formats. Here we outline main of them:

  • pyarrow for reading data from HDFS file system
  • uproot for reading ROOT files
  • numpy, pandas for data representation
  • modin for fast panda support
  • numba for speeing up individual functions For ML modeling you may use your favorite framework, e.g. Keras, TensorFlow, scikit-learn, PyTorch, etc. Therefore, we suggest to use anaconda to install its dependencies:
# to install pyarrow, uproot
conda install -c conda-forge pyarrow uproot numba scikit-learn
# to install pytorch
conda install -c pytorch pytorch
# to install TensorFlow, Kearas, Numpy, Pandas
conda install keras numpy pandas

Instalation

The easiest way to install and run MLaaS4HEP and TFaaS is to use pre-build docker images

# run MLaaS4HEP docker container
docker run veknet/mlaas4hep
# run TFaaS docker container
docker run veknet/tfaas

Reading ROOT files

MLaaS4HEP python repository provides two base modules to read and manipulate with HEP ROOT files. The reader.py module defines a DataReader class which is able to read either local or remote ROOT files (via xrootd). And, workflow.py module provide a basic DataGenerator class which can be used with any ML framework to read HEP ROOT data in chunks. Both modules are based on uproot framework.

Basic usage

# setup the proper environment, e.g. 
# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework
# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries

# get help and option description
reader --help

# here is a concrete example of reading local ROOT file:
reader --fin=/opt/cms/data/Tau_Run2017F-31Mar2018-v1_NANOAOD.root --info --verbose=1 --nevts=2000

# here is an example of reading remote ROOT file:
reader --fin=root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root --verbose=1 --nevts=2000 --info

# both of aforementioned commands produce the following output
First pass: 2000 events, 35.4363200665 sec, shape (2316,) 648 branches: flat 232 jagged
VMEM used: 960.479232 (MB) SWAP used: 0.0 (MB)
Number of events  : 1131872
# flat branches   : 648
...  # followed by a long list of ROOT branches found along with their dimentionality
TrigObj_pt values in [5.03515625, 1999.75] range, dim=21

More examples about using uproot may be found here and here

How to train ML model on HEP ROOT data

The HEP data are presented in ROOT data-format. The DataReader class provides access to ROOT files and various APIs to access the HEP data.

A simple workflow example can be found in workflow.py code that executes a full HEP ML workflow, i.e. it can read remote files and perform the training of ML models with HEP ROOT files.

If you clone the repo and setup your PYTHONPATH you should be able to run it as simple as

# setup the proper environment, e.g. 
# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework
# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries

workflow --help

# run the code with list of LFNs from files.txt and using labels file labels.txt
workflow --files=files.txt --labels=labels.txt

# run pytorch example
workflow --files=files.txt --labels=labels.txt --model=ex_pytorch.py

# run keras example
workflow --files=files.txt --labels=labels.txt --model=ex_keras.py

# cat files.txt
#dasgoclient -query="file dataset=/Tau/Run2018C-14Sep2018_ver3-v1/NANOAOD"
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/069A01AD-A9D0-7C4E-8940-FA5990EDFFCE.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/577AF166-478C-1F40-8E10-044AA4BC0576.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/9A661A77-58AC-0245-A442-8093D48A6551.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/C226A004-077B-7E41-AFB3-6AFB38D1A63B.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/D1E05C97-DB14-3941-86E8-C510D602C0B9.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/6FA4CC7C-8982-DE4C-BEED-C90413312B35.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/282E0083-6B41-1F42-B665-973DF8805DE3.root

# cat labels.txt
1
0
1
0
1
1
1

# run keras example and save our model into external file
workflow --files=files.txt --labels=labels.txt --model=ex_keras.py --fout=model.pb

The workflow.py relies on two JSON files, one which contains parameters for reading ROOT files and another with specification of ROOT branches. The later will be generated by reading ROOT file itself.

How to train data using other data-formats

You may use workflow.py to use other data-formats, e.g. CSV, JSON, Parquet, to train your model. The procedure is identical to dealing with HEP ROOT files.

# prepare your files.txt and labels.txt files, e.g. here we show example
# of using json gzipped files located on HDFS
cat files.txt
hdfs:///path/file1.json.gz
hdfs:///path/file2.json.gz

# optionally define your preprocessing function, see example in ex_preproc.py

# run workflow with your set of files, labels, model and preprocessing function
# and save it into model.pb file
workflow --files=files.txt --labels=labels.txt --model=ex_keras.py --preproc=ex_preproc.py --fout=model.pb

We provide more comprehensive example over here

HEP resnet

We provided full code called hep_resnet.py as a basic model based on ResNet implementation. It can classify images from HEP events, e.g.

hep_resnet.py --fdir=/path/hep_images --flabels=labels.csv --epochs=200 --mdir=models

Here we supply input directory /path/hep_images which contains HEP images in train folder along with labels.csv file which provides labels. The model runs for 200 epochs and save Keras/TF model into models output directory.

TFaaS inference server

We provide inference server in separate TFaaS repository. It contains full set of instructions how to build and set it up.

TFaaS client

To access your ML model in TFaaS inference server you only need to rely on HTTP protocol. Please see TFaaS repository for more information.

But for convenience we also provide pure python client to perform all necessary actions against TFaaS server. Here is short description of available APIs:

# setup url to point to your TFaaS server
url=http://localhost:8083

# create upload json file, which should include
# fully qualified model file name
# fully qualified labels file name
# model name you want to assign to your model file
# fully qualified parameters json file name
# For example, here is a sample of upload json file
{
    "model": "/path/model_0228.pb",
    "labels": "/path/labels.txt",
    "name": "model_name",
    "params":"/path/params.json"
}

# upload given model to the server
tfaas_client.py --url=$url --upload=upload.json

# list existing models in TFaaS server
tfaas_client.py --url=$url --models

# delete given model in TFaaS server
tfaas_client.py --url=$url --delete=model_name

# prepare input json file for querying model predictions
# here is an example of such file
{"keys":["attribute1", "attribute2"], values: [1.0, -2.0]}

# get predictions from TFaaS server
tfaas_client.py --url=$url --predict=input.json

# get image predictions from TFaaS server
# here we refer to uploaded on TFaaS ImageModel model
tfaas_client.py --url=$url --image=/path/file.png --model=ImageModel

Citation

Please use this publication for further citation: DOI: 10.1007/s41781-021-00061-3