How to Reduce Inference Costs for ML Models

Amazon SageMaker is a fully-managed service that enables data scientists and developers to build, train, and deploy machine learning (ML) models at 50% lower cost than self-managed deployments on Elastic Compute Cloud (Amazon EC2).

Elastic Inference is a capability of SageMaker that delivers 20% better performance for model inference than AWS Deep Learning Containers on EC2 by accelerating inference through model compilation, model server tuning, and underlying hardware and software acceleration technologies.

Inference is the process of making predictions using a trained ML model. For production ML applications, inference accounts for up to 90% of total compute costs.

Hence, when deploying an ML model for inference, accelerating inference performance on low-cost instance types is an effective way to reduce overall compute costs while meeting performance requirements such as latency and throughput.

For example, running ML models on GPU-based instances provides good inference performance; however, selecting the right instance size and optimizing GPU utilization is challenging because different ML models require different amounts of compute and memory resources.

Elastic Inference Accelerators (EIA) solve this problem by enabling you to attach the right amount of GPU-powered inference acceleration to any Amazon SageMaker ML instance. You can choose any CPU instance type that best suits your application’s overall compute and memory needs, and separately attach the right amount of GPU-powered inference acceleration needed to satisfy your performance requirements.

This allows you to reduce inference costs by using compute resources more efficiently. Along with hardware acceleration, Elastic Inference offers software acceleration through SageMaker Neo, a capability of SageMaker that automatically compiles ML models for any ML framework and to any target hardware. With SageMaker Neo, you don’t need to set up third-party or framework-specific compiler software or tune the model manually for optimizing inference performance.

With Elastic Inference, you can combine software and hardware acceleration to get the best inference performance on SageMaker.

In this article, We show you how to compile a pre-trained TensorFlow ResNet-50 model using SageMaker Neo and how to deploy this model to a SageMaker Endpoint with Elastic Inference.

Setup

First, we need to ensure we have SageMaker Python SDK >=2.32.1 and import necessary Python packages.

import numpy as np
import time
import json
import requests
import boto3
import os
import sagemaker

Next, we’ll get the IAM execution role and a few other SageMaker specific variables from our notebook environment so that SageMaker can access resources in your AWS account.

from sagemaker import get_execution_role
from sagemaker.session import Session 

role = get_execution_role()
sess = Session()

region = sess.boto_region_name
bucket = sess.default_bucket()

Import ResNet50 model from Keras

import tensorflow as tf
import tarfile 

tf.keras.backend.set_image_data_format('channels_last')

pretrained_model = tf.keras.applications.resnet.ResNet50()
saved_model_dir = '1'

tf.saved_model.save(pretrained_model, saved_model_dir) 

with tarfile.open('model.tar.gz', 'w:gz') as tar: 
  tar.add(saved_model_dir)

Upload model artifact to S3

SageMaker Neo expects a path to the model artifact in Amazon S3, so we will upload the model artifact to an S3 bucket.

from sagemaker.utils import name_from_base

prefix = name_from_base('ResNet50')

input_model_path = session.upload_data(
        path='model.tar.gz',
        bucket=bucket,
        key_prefix=prefix)

print('S3 path for input model: {}'.format(input_model_path))

Compile model for EI Accelerator using SageMaker Neo

Now the model is ready to be compiled by SageMaker Neo. Note that ml_eia2 needs to be set for target_instance_family field in order for the model to be optimized for EI accelerator deployment.

In order to compile the model, you also need to provide the model input_shape and any optional compiler_options to your model. Note that 32-bit floating-point types (FP32) are the default precision mode for ML models.

from sagemaker.tensorflow import TensorFlowModel

# Compile the model for EI accelerator in SageMaker Neo

tensorflow_model = TensorFlowModel(
      model_data=input_model_path,
      role=role, 
      framework_version='2.3') 

output_path = '/'.join(input_model_path.split('/')[:-1])
compilation_job_name = prefix + "-fp32"

compiled_model_fp32 = tensorflow_model.compile(
     target_instance_family='ml_eia2',
     input_shape={"input_1": [1, 224, 224, 3]}, 
     output_path=output_path,
     role=role,
     job_name=compilation_job_name,
     framework='tensorflow',
     compiler_options={"precision_mode": "fp32"}
)

Deploy compiled model to an Endpoint with EI Accelerator attached

Deploying a model to a SageMaker Endpoint uses the same deploy function whether or not a model is compiled using SageMaker Neo. The only change required for utilizing EI Accelerator is to provide an accelerator_type parameter, which determines the type of EI accelerator to be attached to your endpoint.

predictor_compiled_fp32 = compiled_model_fp32.deploy(
      initial_instance_count=1,
      instance_type='ml.m5.xlarge', 
      accelerator_type='ml.eia2.large'
)

Compare latency with unoptimized model on EIA

We could see that model compiled with FP16 precision mode is faster than the model compiled with FP32, now let’s get the latency for an uncompiled model as well.

# Create a TensorFlow SageMaker model

tensorflow_model = TensorFlowModel( 
      model_data=input_model_path,
      role=role, 
      framework_version='2.3') 

# Deploy the uncompiled model to SM endpoint with EI attached

predictor_uncompiled = tensorflow_model.deploy(
   initial_instance_count=1,
   instance_type='ml.m5.xlarge', 
   accelerator_type='ml.eia2.large'
) 

# Benchmark the SageMaker endpoint

benchmark_sm_endpoint(predictor_uncompiled, data)

BenchMark of the optimized model:

From the benchmark above, the output will be similar to the following:

Doing warmup round of 100 inferences (not counted)
Running 1000 inferences
Client end-to-end latency percentiles:
Avg | P50 | P90 | P99
103.2129 | 124.4727 | 129.1123 | 133.2371

Benchmark of the unoptimized model:

Doing warmup round of 100 inferences (not counted)
Running 1000 inferences
Client end-to-end latency percentiles:
Avg | P50 | P90 | P99
117.1654 | 137.9665 | 143.5326 | 150.2070

Clean up endpoints

Having an endpoint running will incur some costs. Therefore, we would delete the endpoint to release the resources after finishing this example.

session.delete_endpoint(
     predictor_compiled_fp32.endpoint_name
  )

session.delete_endpoint(predictor_uncompiled.endpoint_name)

Conclusion

SageMaker Elastic Inference is an easy-to-use solution for adding model optimizations to improve inference performance on Amazon SageMaker. With Elastic Inference accelerators, you can get GPU inference acceleration and remain more cost-effective than standalone SageMaker GPU instances.