Skip to content
icghita edited this page Aug 24, 2022 · 20 revisions

Documentation

Contents

  1. Overview
  2. What does this software do?
  3. Input and output files
  4. UI layout
  5. How to use this software

1. Overview

This software uses artificial neural networks to predict the efficiency of antibodies against different strains of viruses by analyzing the structure of their envelope proteins.

2. What does this software do?

Artificial neural networks are statistical methods of creating relationships between established patterns and then using those relationships to predict new patterns. This software uses two types of neural networks: feedforward neural networks and self organizing maps. A feedforward neural network, also called a supervised network, requires two sets of data: an input dataset and a target dataset, with each entry of the input dataset having a corresponding target entry. After the feedforward neural network has been trained, it can receive a new entry and it will generate an output based on the relationships established prior between the input and target sets. A self organizing map requires only an input dataset and it will cluster the data by associating each entry with one of the neurons within it based on common features within the data, some neurons containing more entries while others containing less. After it has been trained it can accept a new input entry to which it will assign a neuron, indicating the data cluster into which the entry fits. Execute this software through Matlab's Guide environment or compile it into an executable.

3. Input and output files

The input file is a .FASTA file and is used for both feedforward neural networks and self organizing maps. This file should store the envelope proteins of the virus strains that are analyzed. Fasta files store proteins by name and a char string which encodes the constituent aminoacids of the protein. Due to the nature of neural networks, the lengths of the entries in the Fasta files must be the same, otherwise, the length of the first entry will be taken as default and all other entries with different lengths will be ignored. The target file is a .CSV file and is used only for feedforward neural networks. This file should store values pertaining to the efficiency of antibodies against each virus strain, which is typically the I50 value (half maximal inhibitory concentration). The software will search by name strains of viruses in both the .FASTA and the .CSV file and, when a match is found, associate them with eachother for analysis. The first column should contain the names of virus strains. These names should be the same as those in the .FASTA files, otherwise correspondence between the two sets of entries will not be established and such entries will be ignored. Further columns should consist in a header with the antibody name and the I50 values for each virus strain underneath. Any number of antibody columns is allowed.

Neural networks are stored in a .mat file. A path to a .mat file is required to create neural networks. If the file at the path location does not exist, it will be created. A .mat file can store any number of neural networks of any type (feedforward, SOM, different codifications, different input sizes, etc.).

4. UI layout

The UI has four sections: a "General Parameters" panel, which contains the required parameters for any neural network, a "Feedforward Neural Network Parameters" panel with the parameters needed only for Feedforward Networks, a "Self Organizing Map Parameters" with the parameters needed by Self Organizing Maps and a "Commands and Output" panel which contains the buttons to utilize the various functions of the software and the output values.

5. How to use this software

a) General Parameters & Self Organizing Maps Parameters panels / Creating a Self Organizing Map

To load protein sequences go to the General Parameters panel. At the top of this panel there is the "Viruses" subpanel in which data regarding to the virus strains input data goes. The first field is the file path of the fasta file containing the viruses and a browse button to search the file using explorer. After choosing a file, the list underneath will be populated with the virus strains' names. The second field is a filter for the virus list, which will filter the names containing the search string. After selecting an entry from the list, the "Details" button can be used to open a new window showing the protein sequence of the protein in single letter format, its length and the indexes of its glycosylation sites if the "Show Glycosylation Sites" checkbox is selected. The filtering is only for searching certain viruses to view them, the software will use the whole data in the fasta file.

At the bottom of the panel are the parameters relating to the neural network. The field "ANN Storage Filepath" is the path to the .mat file to store the neural networks. The field "Network Type" selects between "Feedforward Neural Network" and "Self Organizing Map". "Protein Codification" selects between codifications:

  • "A (Numerical)" turns the letter based representation of the proteins in the fasta file into integer representation (1 to 25) and feed that to the neural network;
  • "B (Raw Properties)" turns each aminoacid letter in the fasta into 6 numbers, representing the properties Volume, Bulkiness, Flexibility, Polarity, Aromacity, Charge, used in "Medicinal Protein Engineering" by Yury E. Khudyakov, Chapter 2, https://www.crcpress.com/Medicinal-Protein-Engineering/Khudyakov/p/book/9780849373688. This will create a much more complicated neural network than codification A, with 6 layers of inputs, one for each property, and will take more time to train;
  • "A-6 (Properties codification)" behaves the same as codification B, but instead of the values of properties it substitutes integers from 0 to 10 which approximate them; the network will have 6 layers of inputs;
  • "A-9 (Properties codification)" codifies the six properties into 9 binary values, generating a network with 9 layers of inputs. When selecting a codification, a multiplier will appear next to the "No. of hidden neurons" field in the "Feedforward Neural Network Parameters" panel, indicating the number of input layers which will then be multiplied with the hidden neurons value, resulting in the actual total number of hidden neurons. The field "Name" gives a name to the network for easy recognition. Leaving it empty will default to the current date and time as the name. On the bottom left there is a list of numbers, each number being the index of a neural network in the .mat file. Selecting a number will display its details in the table to the right.

Next, go to to the Self Organizing Map Parameters panel. "Map Topology" chooses the way the neurons of the map are arranged: hexagonal, rectangular or random. "Neuron distance fcn." is the function used to calculate the distance between neurons; values it can take and references with details:

"Map Width" and "Map Height" are the width and height of the map counted in adjacent neurons. "No. of training steps" is the number of times the training algorithm is run (100 is a good starting value). "Initial neighborhood size" is the initial number of neighboring neurons each neuron has. This neighborhood means that when the weight of a neuron is changed, so are the weights of its neighbors. As the algorithm progresses, the neighborhood size shrinks.

Afterwards, click "Create ANN" in the "Commands and Outputs" panel.

New windows will open with relevant information: one with a diagram of the neurons and their clusters' sizes (how many entries are associated with each) and one with a table with each column containing the virus strains associated with a neuron, the neuron's index being the header of the column, sorted by cluster size. The newly created network has been saved in the mat file can be seen in the table. The table contains the network type, its input size (the length of the virus proteins it was trained on), its performance - only for feedforward networks (mean squared error, a lower value means better performance), its codification, the antibody used to train it, its classes values (if available) and the I50 limits (the minimum and maximum values of the I50 values used to train the network), the last 3 are only for feedforward networks.

b) "Feedforward Neural Network Parameters" panel / Creating a Feedforward Neural Network

After filling the parameters in the first panel, the second panel must be also filled to create a feedforward network. The top subpanel "Antibodies" is similar to "Viruses"; the filepath field receives the path of the excel file containing the antibody neutralization data and the values of each antibody can be viewed by selecting its name and the "Details" button. "Coverage Plot" will show the cumulative frequencies for all the antibodies on a semilogarithmic scale. Only the selected antibody from the list will be fed into the network as the target.

The "No. of ANN iterations" field accepts an integer which specifies how many times the neural network should be trained. Out of all the training attempts, the one with the best performance (which is the minimum mean square error) shall be saved in the storage file. The "No. of hidden neurons" field represents:

  • for codification A (Numerical) - the total number of neurons in the hidden layer; a higher number usually means better network performance;
  • for other codifications - the number of neurons in the hidden layer of each input layer; the total number of hidden neurons will be the input value times the multiplier before, which is determined by the selected codification. The "Training function" chooses the function used to update the weights of the network. For details see https://www.mathworks.com/help/nnet/ug/train-and-apply-multilayer-neural-networks.html "Use Parallel Computing" allows the use of multiple CPU cores if available. "Use GPU" allows for the use of the graphic card if available, however the training function will be set to Scaled Conjugate Gradient. "Data Division" specifies how the data will be split in percentage to be used in the algorithm. The data from 0% to the first input value will be used for training, the data from the first input to the second will be used for validation and the rest for testing. The "I50 Classes" subpanel allows for the conversion of the I50 values in the excel file into three custom classes which will then be fed into the network: the first class will be made from the values less than the value introduced in the first field and be noted by "0", the second class will be made from the values between the first and second values and be noted by "0.5" and the last will be made from the rest and be noted by "1". After creating the network , new windows will be opened containing plots with the regression analysis of the virus and antibody data.

c) "Commands and Output" panel / Analysis tools

The last panel contains tools to be used on already trained networks in the mat file. "View ANN" shows the structure of the network. "Use ANN" uses the selected network with the selected virus strain as input (from the "Viruses" subpanel in "General Parameters" panel) and generates the output in the filepath specified above. The length of the input virus strain protein must be the same as the input size of the network (which is the length of the protein data that was used to train it originally). The output will be:

  • for feedforward networks; the neuron index for self organizing maps;
  • on the right - only for feedforward networks: the renormalized value for networks without classes (the input data is normalized when fed to the network and so is the output), the class for networks with classes.

For feedforward networks the following commands can be also used: "Regression Plot" shows the regression for the selected network using the original data used to train it.

"Sensitivity Analysis" can be used on to analyse the impact of each input neuron on the performance. It will set the first input neuron to 0 (if the networks uses codification A) or each input of each of the input layers to 0 (for the other codifications) and use the data in the fasta and excel files for the other inputs. The new performance gets substracted from the original performance and the difference is plotted for this neuron. The process is repeated for each of the network's inputs.

For self organizing maps the following commands can be used: "Plot SOM Hits" shows the map diagram with the size of each neuron (how many inputs have been assigned to it). The neurons are numbered from bottom left to top right, going to the right.

"View Clusters" shows a table where each column represents a neuron, with the header being the neuron's index (as they are arranged in "Plot SOM Hits") and the contents of the column being virus strains assigned to that neuron. The neuron columns are sorted descending in order of number of strains assigned.