This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.
This dataset contains data about direct marketing campaigns (from May 2008 to November 2010) of a Portuguese banking institution. The goal is to predict if the client will subscribe a term deposit, indicated in the variable 'y' (Yes = 1, No = 0)
The best performing model was created through the algorithm Voting Ensemble Classifier (PreFittedSoftVotingClassifier) generated by AutoML with the accuracy score of 0.9177975287231737. The model showed a 0,82% improvement compared to the model created by the HyperDrive method (Accuracy: 0.9103692463328276, Regularization Strength: 0.9072469401405283, Max iterations: 150).
Pipeline Architecture
The pipeline consists of data preparation, training and test stages
Data Preparation In this stage, the CSV file was downloaded as dataset and converted to a Pandas dataframe. The data was then cleaned, hot encoded and divided in two dataframes - the feature variables and the target variable.
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. Classifier
Logistic Regression from the Scikit-Learn library was used to demonstrate the HyperDrive approach.
Training configuration using HyperDrive Package
Hyperparameters to optimise:
C - regularisation strength max_iter - maximum number of iterations required for the classifier to converge
The parameter search space:
'C': uniform(0.1,1) 'max_iter': choice(50,100,150,200)
Sampling method: RandomParameterSampling - random search strategy to find the values
Primary metric to optimise: Accuracy
Early termination policy: BanditPolicy(slack_factor=0.1,evaluation_interval = 1,delay_evaluation=5)
Primary metric goal: PrimaryMetricGoal.MAXIMIZE
Max total runs: 100
Training and Test
The data was split into train (70%) and test (30%) datasets. We optimised hyperparameters by fitting multiple models with different hyperparameters on the train set and validating the models using the test set. The best run was selected and saved.
What are the benefits of the parameter sampler you chose? RandomParameterSampling is faster than Grid Sampling because it picks randomly hyperparameter values from the defined search space. It helps users to later refine the search based on the initial results.
What are the benefits of the early stopping policy you chose? Bandit stops the runs where the primary metric is not within the slack amount compared to the best performing run. For example, in this experiement after the interval 5 any run whose best metric is less than (1/(1+0.1) or 91% of then best performing run will be terminated.
There are two other stopping policies: Median stopping policy and Truncation selection policy.
Median stopping is based on the averages of primary metrics reported by the runs. Considering a delay_evaluation=5, in this policy after the interval 5 any run whose best metric is worse than the median of the running averages over intervals 1:5 across all training runs. In my opinion, this policy is slower because needs to computes running averages across all training runs, although it can be used to for less agress savings and without terminating promising jobs.
Truncation selection uses the percentage of performance of all runs to terminate the run. For example, with a truncation_percentage=10 a run terminates if is in the lowest 10% of performance of the previous runs. This policy can be more agressive if you choose a greater value for the truncation_percentage.
Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters
The best performing model generated by AutoML was using the Voting Ensemble Classifier algorithm (PreFittedSoftVotingClassifier), which combines multiple models to produce a better result compared to a single model.
The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Reference: https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier
Voting Ensemble (AutoML) | LogisticRegressionCV (HyperDrive) |
---|---|
Accuracy: 0.9177975287231737 | Accuracy: 0.9103692463328276 |
boosting_type='gbdt', | Cs=0.9072469401405283 |
class_weight=None, | Max iterations=150 |
colsample_bytree=1.0 | | |
importance_type='split' | | |
learning_rate=0.1 | | |
max_depth=-1 | | |
min_child_samples=20 | | |
min_child_weight=0.001 | | |
min_samples_leaf=0.01 | | |
min_samples_split=0.01 | | |
min_weight_fraction_leaf=0.0 | | |
n_estimators=25 | | |
The Hyperdrive and AutoML approaches produced similar results (0.9103692463328276 and 0.9177975287231737 respectively), the improvement in using AutoML was only 0.82% but I would still recommend using it. In the Hyperdrive method the user needs to develop the data preparation, training and validation stages, including specifying the range of hyperparameters will be used on the experiment, this can delay the delivery of the final model once the user needs to test diferente ranges if the result is not satisfactory. AutoML selects estimators, performs feature engineering and chooses hyperparameters, saving timing in development. In AutoML the user can start the search identifying the best models and then focusing on the best metrics to the Use Case.
For future work it will be necessary to test the methods using other metrics to get more reliable predictions, for example Recall, F1 Score or AUC weighted, depending on the use case and how the data is balanced (or imbalanced), accuracy can be not the best metric. Also it is possible to try other algorithms with the Hyperdrive to verify if there are improvements to achieve after AutoML show us the best performer models, for example Voting Ensemble Classifier, it is possible AutoML didn't identify the best hyperparameters in the time it was given.