Introduction

This repository contains the data sets, the code and the results that were reported on the article "Effect of missing data on multitask prediction methods". This material is made available to aid reproduction of our work.

training file: a file URL
testing file: can be a file URL or a number. If it is number, it determines the seed that will split the data set given in training file intro train and test sets.
output_file: a file URL
prediction_mode: whether classification or regression is performed For a list with most parameters see DNN.py and MacauRegression.py, at the top of the file is a DEFAULT_MAP dictionary that contains all parameters and their default value, as well as comments explaining their meaning. An exemplary JSON file for DNN and Macau is available in the code folder. If the value of a parameter is a list, and the option to read parameters from a file is not selected, the program will generate random values based on the list at each iteration.

Random Forest

Generate random forest models both for PKIS and HTSFP subsets.

Example usage:

$ python -m missing_data_multitask_methods.RandomForest examples/RF_example.json

Deep Neural Networks

Generate DNNs using Tensorflow and was used both for PKIS and HTSFP subsets.

Example usage:

$ python -m missing_data_multitask_methods.DNN examples/DNN_examples.args

Macau Regression

MacauRegression.py: generates Macau models and provides regression results (used on PKIS data)

Macau Classification

MacauClassification.py: generates Macau simulating a classification procedure (used on HTSFP subsets)

Results

In this folder are the result files that were outputted by the programs and used in our analysis. This folder is organized first by dataset and then by technique. Inside most folder are 36 files:

10 correspond to the label removal model ran on each of the sets of hyperparameters
10 correspond to the compound removal model, again one per set of hyperparameters
16 correspond to the seed variation test, where the first number is the seed of the train/test split and the second number is the seed of the label removal process Folder for Random Forest only contain 10 files, those of the compound removal model DNN and Macau folders for PKIS contain 10 additional files, those of the assay removal model

In each output file the following structure is used. Each line represents a model (DNN or Macau). The first columns represent all parameters of each program. Next column is the time to train and evaluate the model. Then the results of the model are provided per target. For each target, measure values for the training and test set are given. Regression measures provided: R2(coefficient of determination), r2(square of the correlation coefficient), RMSD, MAE. Classification measures provided: precision, recall, fscore (F1-score), mcc.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
examples		examples
results		results
src/missing_data_multitask_methods		src/missing_data_multitask_methods
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Contents

Data

Code

Installation

Usage

Random Forest

Deep Neural Networks

Macau Regression

Macau Classification

Results

About

Releases 2

Packages

Contributors 2

Languages

License

SheffieldChemoinformatics/missing-data-multitask-methods

Folders and files

Latest commit

History

Repository files navigation

Introduction

Contents

Data

Code

Installation

Usage

Random Forest

Deep Neural Networks

Macau Regression

Macau Classification

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages