Automate Hyperparameter Tuning and Experiment Tracking for Cloud-Based Training
If you are a machine learning practitioner, you probably know how tedious and time-consuming it can be to find the optimal set of hyperparameters for your model.
Hyperparameters are the configuration variables that affect the performance and behavior of your model, such as learning rate, batch size, number of layers, etc. Tuning them manually can be a trial-and-error process that requires a lot of experimentation and intuition.
Fortunately, there are tools and techniques that can help you automate this process and save you time and resources.
In this article, I will show you how to use Azure Machine Learning (Azure ML) to perform hyperparameter tuning and experiment tracking for cloud-based training.
While it offers a lot of features, I will focus on two specific aspects of Azure ML: hyperdrive and mlflow.
Hyperdrive is a service that allows you to run hyperparameter tuning experiments on Azure ML compute clusters. It supports various sampling methods, such as random, grid, or Bayesian optimization, and various early termination policies, such as median stopping or bandit.
Mlflow is an open-source platform that enables you to track, compare, and manage your machine learning experiments. It integrates seamlessly with Azure ML and allows you to log metrics, parameters, artifacts, and models from your experiments.
To demonstrate how to use these tools, I will use a simple example of training a convolutional neural network (CNN) to classify images of handwritten digits from the MNIST dataset. The MNIST dataset is a classic benchmark for image recognition and consists of 60,000 training images and 10,000 test images of digits from 0 to 9. The goal is to find the best combination of hyperparameters that can achieve the highest accuracy on the test set.
The steps involved in this tutorial are:
- Create an Azure ML workspace and a compute cluster
- Upload the MNIST dataset to Azure ML datastore
- Define the CNN model and the training script
- Configure the hyperdrive experiment
- Run the hyperdrive experiment and monitor the results
- Analyze the best run and register the best model
Letβs get started!
Step 1: Create an Azure ML workspace and a compute cluster
Before we can use Azure ML services, we need to create an Azure ML workspace. A workspace is a cloud resource that contains all the assets and configurations related to your machine learning projects, such as datasets, models, experiments, compute targets, etc. You can create a workspace either through the Azure portal or using Python SDK.
For this tutorial, I will use Python SDK to create a workspace. To do so, you need to have an Azure subscription and install Azure ML SDK on your local machine or notebook environment.
The following code snippet shows how to create a workspace using Python SDK:
# Import azureml.core module
from azureml.core import Workspace
# Specify subscription ID, resource group name, workspace name, and region
subscription_id = "<your-subscription-id>"
resource_group = "<your-resource-group-name>"
workspace_name = "<your-workspace-name>"
workspace_region = "<your-workspace-region>"
# Create workspace object
ws = Workspace.create(name=workspace_name,
subscription_id=subscription_id,
resource_group=resource_group,
location=workspace_region,
exist_ok=True)
# Print workspace details
ws.get_details()
The exist_ok
parameter in the Workspace.create
method allows you to reuse an existing workspace if it has the same name as the one you specified. Otherwise, it will create a new one.
The ws.get_details()
method prints out some information about your workspace, such as its ID, location, type, etc.
After creating a workspace, we need to create a compute cluster that we will use to run our hyperparameter tuning experiments. A compute cluster is a scalable set of virtual machines that can execute your machine learning tasks. You can create a compute cluster either through the Azure portal or using Python SDK.
I will use Python SDK to create a compute cluster.
# Import azureml.core and azureml.compute modules
from azureml.core import ComputeTarget
from azureml.compute import AmlCompute, ComputeConfig
# Specify cluster name, VM size, and maximum number of nodes
cluster_name = "<your-cluster-name>"
vm_size = "STANDARD_D2_V2"
max_nodes = 4
# Create compute configuration object
compute_config = ComputeConfig(vm_size=vm_size,
max_nodes=max_nodes)
# Create compute target object
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
# Wait for the cluster to be provisioned
compute_target.wait_for_completion(show_output=True)
The ComputeTarget.create
method creates a new compute cluster in your workspace with the specified name and configuration. The compute_config
object defines the VM size and the maximum number of nodes that the cluster can scale up to. The compute_target.wait_for_completion
method waits for the cluster to be provisioned and prints out its status.
You can check the details of your compute cluster by using the compute_target.get_status()
method or by visiting the Azure portal.
Now that we have a workspace and a compute cluster, we are ready to upload our dataset and define our model.
Step 2: Upload the MNIST dataset to Azure ML datastore
The MNIST dataset is a public dataset that can be downloaded from various sources, such as [Kaggle] or [TensorFlow]. However, to use it with Azure ML, we need to upload it to an Azure ML datastore. A datastore is a cloud storage service that can store and access data for your machine learning projects. Azure ML supports various types of datastores, such as Azure Blob Storage, Azure File Share, Azure Data Lake Storage, etc.
For this tutorial, I will use the default datastore that is automatically created when you create a workspace. The default datastore is an Azure Blob Storage account that is associated with your workspace. You can access it by using the ws.get_default_datastore()
method.
To upload the MNIST dataset to the default datastore, we need to download it first and save it as a zip file in our local machine or notebook environment.
# Import tensorflow and zipfile modules
import tensorflow as tf
import zipfile
# Load the MNIST dataset from tensorflow
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Save the training and test data as numpy arrays
np.save("x_train.npy", x_train)
np.save("y_train.npy", y_train)
np.save("x_test.npy", x_test)
np.save("y_test.npy", y_test)
# Create a zip file containing the data files
with zipfile.ZipFile("mnist.zip", "w") as zip:
zip.write("x_train.npy")
zip.write("y_train.npy")
zip.write("x_test.npy")
zip.write("y_test.npy")
The tf.keras.datasets.mnist.load_data()
method downloads and loads the MNIST dataset as numpy arrays. The np.save
method saves the arrays as binary files with the .npy
extension. The zipfile.ZipFile
class creates a zip file object that can write files to a compressed archive.
After creating the zip file, we can upload it to the default datastore by using the Datastore.upload_files
method. The following code snippet shows how to do so:
# Import azureml.core module
from azureml.core import Datastore
# Get the default datastore object
datastore = ws.get_default_datastore()
# Upload the zip file to the default datastore
datastore.upload_files(files=["mnist.zip"],
target_path="mnist",
overwrite=True,
show_progress=True)
The Datastore.upload_files
method uploads one or more files from your local machine or notebook environment to the specified path in the datastore. The overwrite
parameter allows you to overwrite any existing files with the same name. The show_progress
parameter prints out the upload progress.
You can check the uploaded files by using the Datastore.path
method or by visiting the Azure portal.
Now that we have uploaded our dataset to the datastore, we need to register it as a dataset in our workspace. A dataset is an abstraction that represents a pointer to your data source. It allows you to access and manipulate your data without having to download it or load it into memory. You can create a dataset either from files or from tabular data.
For this tutorial, I will create a dataset from files by using the Dataset.File.from_files
method. The following code snippet shows how to do so:
# Import azureml.core module
from azureml.core import Dataset
# Create a dataset from files in the datastore
dataset = Dataset.File.from_files(path=(datastore, "mnist/mnist.zip"))
# Register the dataset in the workspace
dataset = dataset.register(workspace=ws,
name="mnist",
description="MNIST dataset",
create_new_version=True)
# Print dataset details
dataset.to_path()
The Dataset.File.from_files
method creates a file dataset object from one or more file paths in a datastore or a public URL. The dataset.register
method registers the dataset in your workspace with a specified name and description. The create_new_version
parameter allows you to create a new version of the dataset if it already exists. The dataset.to_path
method prints out the file paths of the dataset.
You can check the registered datasets by using the Dataset.get_all()
method or by visiting the Azure portal.
We have now uploaded and registered our MNIST dataset in our workspace. We are ready to define our CNN model and our training script.
Step 3: Define the CNN model and the training script
The CNN model that we will use for this tutorial is a simple one that consists of two convolutional layers, two max pooling layers, a flatten layer, and a dense layer. The convolutional layers apply filters to the input images and produce feature maps that capture the spatial patterns in the images. The max pooling layers reduce the size of the feature maps and extract the most important features. The flatten layer reshapes the feature maps into a one-dimensional vector that can be fed to the dense layer. The dense layer is a fully connected layer that outputs the class probabilities for each digit.
The following code snippet shows how to define the CNN model using TensorFlow and Keras:
# Import tensorflow and keras modules
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Define the CNN model
def create_model():
# Create a sequential model
model = keras.Sequential()
# Add a convolutional layer with 32 filters, 3x3 kernel size, ReLU activation, and input shape of 28x28x1
model.add(layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)))
# Add a max pooling layer with 2x2 pool size
model.add(layers.MaxPooling2D((2, 2)))
# Add another convolutional layer with 64 filters, 3x3 kernel size, and ReLU activation
model.add(layers.Conv2D(64, (3, 3), activation="relu"))
# Add another max pooling layer with 2x2 pool size
model.add(layers.MaxPooling2D((2, 2)))
# Add a flatten layer
model.add(layers.Flatten())
# Add a dense layer with 10 units and softmax activation
model.add(layers.Dense(10, activation="softmax"))
# Return the model
return model
The keras.Sequential
class creates a sequential model that stacks layers one after another. The layers.Conv2D
class creates a convolutional layer with the specified number of filters, kernel size, activation function, and input shape. The layers.MaxPooling2D
class creates a max pooling layer with the specified pool size. The layers.Flatten
class creates a flatten layer that reshapes the input into a one-dimensional vector. The layers.Dense
class creates a dense layer with the specified number of units and activation function.
After defining the model, we need to write a training script that will use the model to train on the MNIST dataset. The training script will also use mlflow to log some metrics and parameters from the training process.
# Import modules
import os
import argparse
import numpy as np
import tensorflow as tf
import mlflow
import mlflow.tensorflow
# Enable autologging for mlflow
mlflow.tensorflow.autolog()
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate", type=float, default=0.01, help="Learning rate")
parser.add_argument("--batch_size", type=int, default=32, help="Batch size")
parser.add_argument("--epochs", type=int, default=10, help="Number of epochs")
args = parser.parse_args()
# Set hyperparameters
learning_rate = args.learning_rate
batch_size = args.batch_size
epochs = args.epochs
# Load the MNIST dataset from zip file
with np.load("mnist/mnist.zip") as data:
x_train = data["x_train"]
y_train = data["y_train"]
x_test = data["x_test"]
y_test = data["y_test"]
# Normalize and reshape the data
x_train = x_train / 255.0
x_test = x_test / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
# Create the CNN model
model = create_model()
# Compile the model with Adam optimizer, sparse categorical crossentropy loss, and accuracy metric
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=["accuracy"])
# Train the model on the training data and validate on the test data
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test))
# Evaluate the model on the test data and print the results
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test loss: {test_loss}")
print(f"Test accuracy: {test_acc}")
# Save the model as a TensorFlow SavedModel
model.save("outputs/model")
We have now defined our CNN model and our training script. We are ready to configure our hyperdrive experiment.
Step 4: Configure the hyperdrive experiment
A hyperdrive experiment is an experiment that runs multiple child runs with different hyperparameter values and finds the best run based on a primary metric. To configure a hyperdrive experiment, we need to specify four components:
- A hyperparameter sampling method
- A primary metric and a goal
- An early termination policy
- A run configuration
The hyperparameter sampling method defines how to sample the hyperparameter values from a predefined search space. Azure ML supports various sampling methods, such as random sampling, grid sampling, or Bayesian sampling. For this tutorial, I will use random sampling, which randomly selects the hyperparameter values from a uniform or discrete distribution.
The primary metric and the goal define how to evaluate and compare the child runs. The primary metric is the name of the metric that we want to optimize, such as accuracy or loss. The goal is either to maximize or minimize the primary metric. For this tutorial, I will use accuracy as the primary metric and maximize as the goal.
The early termination policy defines how to stop the child runs that are not promising and save resources. Azure ML supports various early termination policies, such as median stopping policy or bandit policy. For this tutorial, I will use median stopping policy, which stops a run if its primary metric is worse than the median of the running averages of all runs.
The run configuration defines how to execute the child runs on the compute target. It specifies the script name, the arguments, the environment, and the compute target for each run. For this tutorial, I will use a TensorFlow environment that has TensorFlow and mlflow installed, and the compute cluster that we created earlier.
The following code snippet shows how to configure a hyperdrive experiment using Python SDK:
# Import modules
from azureml.core import ScriptRunConfig, Environment
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.train.hyperdrive import HyperDriveConfig, RandomParameterSampling, PrimaryMetricGoal, MedianStoppingPolicy
# Define the hyperparameter search space
param_sampling = RandomParameterSampling({
"--learning_rate": uniform(0.001, 0.1),
"--batch_size": choice(16, 32, 64),
"--epochs": choice(5, 10, 15)
})
# Define the primary metric and the goal
primary_metric_name = "accuracy"
primary_metric_goal = PrimaryMetricGoal.MAXIMIZE
# Define the early termination policy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1)
# Create a TensorFlow environment
tf_env = Environment(name="tf_env")
tf_env.python.user_managed_dependencies = False
tf_env.docker.enabled = True
tf_env.docker.base_image = DEFAULT_CPU_IMAGE
tf_env.python.conda_dependencies.add_pip_package("tensorflow==2.4.1")
tf_env.python.conda_dependencies.add_pip_package("mlflow")
# Create a script run configuration
src = ScriptRunConfig(source_directory=".",
script="train.py",
arguments=["--learning_rate", "--batch_size", "--epochs"],
compute_target=compute_target,
environment=tf_env)
# Create a hyperdrive configuration
hyperdrive_config = HyperDriveConfig(run_config=src,
hyperparameter_sampling=param_sampling,
primary_metric_name=primary_metric_name,
primary_metric_goal=primary_metric_goal,
max_total_runs=20,
max_concurrent_runs=4,
policy=early_termination_policy)
We have now configured our hyperdrive experiment. We are ready to run it and monitor the results.
Step 5: Run the hyperdrive experiment and monitor the results
To run the hyperdrive experiment, we need to create an experiment object and submit the hyperdrive configuration to it. The experiment object represents a named collection of runs in your workspace. The following code snippet shows how to do so:
# Import azureml.core module
from azureml.core import Experiment
# Create an experiment object
experiment = Experiment(workspace=ws, name="hyperdrive_mnist")
# Submit the hyperdrive configuration to the experiment
hyperdrive_run = experiment.submit(hyperdrive_config)
The Experiment
class creates an experiment object with the specified workspace and name. The experiment.submit
method submits the hyperdrive configuration to the experiment and returns a run object that represents the parent run of the hyperdrive experiment.
To monitor the results of the hyperdrive experiment, we can use various methods and tools, such as:
- The
RunDetails
widget - The Azure portal
- The mlflow UI
The RunDetails
widget is a Jupyter widget that can display the details and status of a run in your notebook. It can show various information, such as:
- The run ID, status, start time, end time, and duration
- The hyperparameter values and metrics of each child run
- The best run and its details
- The logs and outputs of each child run
- The charts and graphs of the metrics and parameters
The following code snippet shows how to use the RunDetails
widget to monitor the hyperdrive experiment:
# Import azureml.widgets module
from azureml.widgets import RunDetails
# Create a RunDetails widget object
widget = RunDetails(hyperdrive_run)
# Display the widget in the notebook
widget.show()
The RunDetails
class creates a RunDetails widget object with the specified run. The widget.show
method displays the widget in the notebook.
You can also use mlflow UI and Azure portal to monitor your hyperdrive experiment and see how it progresses. You can also use them to compare and analyze the results of different child runs and find the best one.
Step 6: Analyze the best run and register the best model
After the hyperdrive experiment is completed, we can analyze the results and find the best run based on the primary metric. We can also register the best model in our workspace for future use. The following code snippet shows how to do so:
# Import azureml.core module
from azureml.core import Model
# Get the best run and its details
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
best_run_parameters = best_run.get_details()["runDefinition"]["arguments"]
# Print the best run ID, metrics, and parameters
print(f"Best run ID: {best_run.id}")
print(f"Best run accuracy: {best_run_metrics['accuracy']}")
print(f"Best run learning rate: {best_run_parameters[1]}")
print(f"Best run batch size: {best_run_parameters[3]}")
print(f"Best run epochs: {best_run_parameters[5]}")
# Register the best model in the workspace
model = best_run.register_model(model_name="hyperdrive_mnist",
model_path="outputs/model",
description="CNN model for MNIST classification using hyperdrive",
tags={"type": "CNN", "dataset": "MNIST", "method": "hyperdrive"},
model_framework=Model.Framework.TENSORFLOW,
model_framework_version="2.4.1")
The hyperdrive_run.get_best_run_by_primary_metric
method returns the best run object based on the primary metric. The best_run.get_metrics
method returns a dictionary of metrics logged by the best run. The best_run.get_details
method returns a dictionary of details about the best run, such as its ID, status, arguments, etc.
The print
function prints out the best run ID, metrics, and parameters.
The best_run.register_model
method registers the model from the best run in the workspace with the specified name, path, description, tags, framework, and version. The Model
class defines some constants for common frameworks, such as TensorFlow, PyTorch, etc.
You can check the registered models by using the Model.list
method or by visiting the Azure portal.
We have now completed our hyperdrive experiment and registered our best model. We have learned how to use Azure ML to perform hyperparameter tuning and experiment tracking for cloud-based training. We have also seen how to use hyperdrive, mlflow, and TensorFlow to build, train, and evaluate a CNN model for MNIST classification. You can check the full code here
Please feel free to leave a comment below if you have any questions or feedback.