Overview

  1. Intro to Amazon SageMaker: what it is, what it sovles
  2. Setting up a SageMaker Studio development environment: create an instance of SageMaker Studio, and learn about its GUI, create Jupyter notebooks, and connect to git repos.
  3. Interactive model tranining in SageMaker Studio: train model using builtin models and custom scripts, build Docker images for custom solution. Interact with SageMaker API using its Python SDK.
  4. Experiment management using Experiments: train both TensorFlow and PyTorch to train CV models, and run experiments to find the best model
  5. Deploy trained models: run batch offline inference, and deploy as API for production online inference. We will also show how to scale the production resource based on demand.
  6. Monitoring deployed models: use SageMaker Model Monitor to monitor the model deployed in 5, inspect incoming data with training data to find if there is data drift

Challenges of running production ML systems

  • Experimentation tracking: selecting the best model from many iterations is hard
  • Debugging: debugging in ML is hard, debugging an automated ML job is harder
  • Deployment: deployment is not well understood since most data scientists lack DevOps and software engineering skills
  • Scaling: instances need to automatically scale based on demand
  • Model monitoring: models need to be monitored to identify regressions in quality, and provide signal for actions such as model retraining, auditing upstream, fix data quality issues.

We will solve these challenges with Amazon SageMaker! We will be able to

  • train models using built-in algorithms and custom scripts
  • run large scale experiments in an autoscaling distributed environment
  • collect and analyze data from experiments
  • perform offline inference on batch data
  • deploy models as persistent endpoints that scale with demand
  • monitor deployed endpoints for data drift

1. Intro to Amazon SageMaker

SageMaker is an ML platform. That means it lets you quickly train and deploy ML models to production.

It takes care of a lot of “plumbing” tasks for ML systems. These tasks are NOT feature engineering and algorithmic work. They are more mundane ones such as managing infra, logging, monitoring inputs and outputs. They enable us to do algorithmic work much easier.

SageMaker features:

  • Studio - integrated ML environment where you can build, train, deploy, and analyze your models in one place.
  • Studio notebooks - Jupyter notebook interface with Single Sign-On integration, fast startup time, and single click sharing
  • Preprocessing - preprocess and analyze data over a distributed cluster
  • Experimenst - experiment management and tracking
  • Debugger - inspect data and error in automated training processes
  • Batch transform - perform batch inference
  • Hosted endpoints - deploy models as APIs
  • Model monitor - monitor model quality

1.1 SageMaker architecture overview

SageMaker consists of several services.

  • Docker: depending on your use case. To use builtin models, no need to work with Docker. If you have customized model and solutions, you need to build and publish Docker images.
  • Elastic Container Registry (ECR): this is an AWS service. SageMaker gets the desired Docker image from ECR to run the job. Amazon has predefined Docker images for SageMaker to do basic things. We will use them later.
  • S3: SageMaker stores and gets the data in S3.
  • EC2: SageMaker launches EC2 instances to do the work and terminate when the job is complete. You only pay for the use. If you choose to deploy as an API, SageMaker launches the number of instances as you specify and does not terminate them. You can let it scale as needed.

SageMaker architecture

And there are 2 background services that are not directly relevant in the architecture but also important:

  • CloudWatch: it monitors the resources and stores log files
  • IAM: Amazon uses Identity and Access Management users to manage your access.

2. SageMaker Studio

Features

  • Notebooks
  • Experiments
  • Autopilot (enable developers with no ML expertise to train models)
  • Debugger
  • Model Monitor

2.1 Create a Studio instance

In quick start, set your username,

Studio create 1

then create a new IAM role, leave the S3 setting to default unless you have special naming requirements.

Studio create 2

After creating the new IAM role, hit submit to create the SageMaker Studio instance. You’ll see the following page and it will take a couple of minutes.

Studio create 3

2.2 Create a notebook

Once the Studio instance is created, click on file -> new -> notebook, select a kernel to start a new notebook. After that, the bottom left has a green message that should say “kernel starting”, and at the top right there is “unknown” meaning the EC2 resource underlying the kernel is not ready yet. During the starting phase, you can’t run any notebook code. Just wait for a while. Once it’s ready, the “unknown” should turn to something like “2 vCPU + 4 GB”.

The kernel tab on the left sidebar shows the active kernels (Docker images) you are running.

Kernel

The “Python3 (Data Science)” kernel has the usual data science libraries pre-installed.

Kernel DS

IMPORTANT: Don’t forget to stop the kernel when you are done! You get billed by the hour for active kernels based on the EC2 instance type. Find the pricing details here.

2.3 Connect to a git repo in Studio

Click file -> new -> terminal, it opens a bash terminal for us. git clone your desired repo from Github.

Now you should see your repo’s directory in the file browser in the left sidebar. You can also check the git history and status in the git tab in the sidebar.

2.4 Install libraries

To install additional libraries to the kernel, just do it in the notebook like this

!{sys.executable} -m pip install sagemaker-experiments
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install torchvision

2.5 Demo in Notebook: Create S3 bucket

Running the following code in the notebook:

BUCKET = 'sagemaker-course-20200804'

boto_session = boto3.Session()

try:
    if boto_session.region_name == "us-east-1":
        boto_session.client('s3').create_bucket(Bucket=BUCKET)
    else:
        boto_session.client('s3').create_bucket(
            Bucket=BUCKET,
            CreateBucketConfiguration={'LocationConstraint': boto_session.region_name})
except Exception as e:
    print(e)

It should run without any error. I went to the S3 console and verified, it worked.

3. Training models in SageMaker

We can either

  • use builtin models, or
  • write custom code and use a predefined Docker image to run it, or
  • build our own Docker image and our own code to train customized solution (requires Docker knowledge, more advanced, not covered here)

The first two approaches will be demonstrated. Refer to the code in notebook here.

3.1 Train an XGBoost model with builtin algorithms

We can use the builtin XGBoost algorithm to train our model. To train the model, we need to first upload the training data to S3. We can use the SageMaker Python SDK to create a SageMaker Session object and a boto session object for S3 upload.

After uploading the data, we need to create pointers to the data in order to feed into the train function later.

Next, to train the model, we need to create a training job. It needs some info: the url of the S3 bucket (pointer), compute resource for training, url of s3 bucket to store the output, the Docker image for the predefined algorithm, etc.

sagemaker xgboost

Be sure to familiarize yourself with the XGBoost algorithm because we set the hyperparameters assuming you know how this works.

Note that get_image_uri with xgboost and repo_version gets the correct Docker image for the builtin XGBoost algorithm. We then create the Estimator object necessary to launch the training job. We specify image_name, role, train_instance_type and count, output_path, a custom base_job_name of your choosing which is later used as the identifier for the trained model in S3, and sagemaker_session.

from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker import estimator

xgboost_image_name = get_image_uri(boto_session.region_name, 'xgboost', repo_version='0.90-2')

xgb_model = estimator.Estimator(image_name=xgboost_image_name,
                                role=role,
                                train_instance_count=1,
                                train_instance_type='ml.m4.xlarge',
                                output_path=f"s3://{BUCKET}/{PREFIX}",
                                base_job_name="builtin-xgboost",
                                sagemaker_session=sagemaker_session)

xgb_model.set_hyperparameters(max_depth=5,
                              subsample=0.8,
                              num_round=600,
                              eta=0.2,
                              gamma=4,
                              min_child_weight=6,
                              silent=0,
                              objective='binary:logistic')

xgb_model.fit({'train': s3_input_train,
               'validation': s3_input_validation})

"""
2020-08-05 19:20:32 Starting - Starting the training job...
2020-08-05 19:20:34 Starting - Launching requested ML instances......
2020-08-05 19:21:51 Starting - Preparing the instances for training......
2020-08-05 19:23:00 Downloading - Downloading input data......
2020-08-05 19:23:46 Training - Downloading the training image..INFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
INFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
[19:24:09] 2333x69 matrix with 160977 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
INFO:root:Determined delimiter of CSV input is ','
[19:24:09] 666x69 matrix with 45954 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,
INFO:root:Single node training.
INFO:root:Train matrix has 2333 rows
INFO:root:Validation matrix has 666 rows
[0]#011train-error:0.077154#011validation-error:0.099099
[1]#011train-error:0.050579#011validation-error:0.081081
[2]#011train-error:0.048864#011validation-error:0.075075

...
...

[594]#011train-error:0.020574#011validation-error:0.061562
[595]#011train-error:0.020574#011validation-error:0.061562
[596]#011train-error:0.020574#011validation-error:0.061562
[597]#011train-error:0.020574#011validation-error:0.061562
[598]#011train-error:0.020574#011validation-error:0.061562
[599]#011train-error:0.021003#011validation-error:0.061562

2020-08-05 19:24:24 Uploading - Uploading generated training model
2020-08-05 19:24:24 Completed - Training job completed
Training seconds: 84
Billable seconds: 84
"""

AWS bills the training based on the time and compute used. After this step, we have trained our first SageMaker model! Go to S3 and check the bucket we created, we can see there is a directory called /builtin-xgboost-<date-ran>, an /output directory inside it, and the serialized model file model.tar.gz in /output.

3.2 Train an sklearn model using prebuilt Docker images and custom code

For a complete list of training environment variables used by SageMaker Docker images, check out the SageMaker training toolkit repo.

Here is an example of a training script. It’s a command line tool that takes in the environment variables and the model hyperparameters needed.

joblib is used to serialize the model.

import argparse
import os
import pandas as pd

from sklearn import ensemble
from sklearn.externals import joblib


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
    parser.add_argument('--n_estimators', type=int, default=100)

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    train_data = pd.concat(raw_data)

    # labels are in the first column
    train_y = train_data.ix[:,0]
    train_X = train_data.ix[:,1:]

    # Here we support a single hyperparameter, 'n_estimators'. Note that you can add as many
    # as your training my require in the ArgumentParser above.
    n_estimators = args.n_estimators

    # Now use scikit-learn's random forest classifier to train the model.
    clf = ensemble.RandomForestClassifier(n_estimators=n_estimators)
    clf = clf.fit(train_X, train_y)

    # Serialize the model.
    joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))


def model_fn(model_dir):
    """Deserialize and return fitted model

    Note that this should have the same name as the serialized model in the main method
    """
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

To use this custom training script, we need the Estimator object similar to the one for the builtin algorithm case.

from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='../scripts/sklearn/sklearn_rf.py',
    code_location=f's3://{BUCKET}/{PREFIX}',
    hyperparameters={'n_estimators': 50},
    role=role,
    train_instance_type='ml.c4.xlarge',
    output_path=f's3://{BUCKET}/{PREFIX}',
    base_job_name="custom-code-sklearn",
    sagemaker_session=sagemaker_session)

Note that entry_point is the local path to the training script. It is the relative path of where we are (the current notebook). code_location is the S3 URI to which SageMaker will upload our custom training code.

To train, run

sklearn_estimator.fit({'train': s3_input_train})

In the background, it starts the compute resources, downloads the training script to them, run training, store the result artifacts in S3, and shutdown the instances.

This fit line is blocking in the notebook, meaning we can’t do anything else until it’s done. It prints the messages from the training job. To view the details about this training job, we can run the following code and go to the output url in the browser:

f"https://{boto_session.region_name}.console.aws.amazon.com/sagemaker/home?region={boto_session.region_name}#/jobs/{sklearn_estimator.jobs[0].job_name}"

Or we can navigate to this page from the SageMaker home page -> Training -> Training jobs -> the job id.

Go to S3, we can find the model artifacts under custom-code-sklearn-<timestamp> directory (because we called it custom-code-sklearn in our base_job_name property of the Estimator constructor). Unlike the previous builtin case, we not only have /output with the serialized model in it, we also have /debug-output and /source.

The powerful thing is that /source has the version of the code we used to train this model with, so we can have versioned code and model together. This is extremely important for versioning and reproducibility.

s3 model path

Notice that we didn’t provide a Docker image pointer to the Estimator object. SageMaker uses a default image based on the framework version and python version arguments.

We can check the image name by

sklearn_estimator = SKLearn(
    entry_point='../scripts/sklearn/sklearn_rf.py',
    code_location=f's3://{BUCKET}/{PREFIX}',
    hyperparameters={'n_estimators': 50},
    role=role,
    train_instance_type='ml.c4.xlarge',
    output_path=f's3://{BUCKET}/{PREFIX}',
    base_job_name="custom-code-sklearn",
    sagemaker_session=sagemaker_session)

print(sklearn_estimator.image_name)
"""
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3
"""

We see it returns a ECR URI that points to the default sklearn Docker image.

For reference, the sagemaker-scikit-learn-container repo contains the source code, including the Dockerfile, for creating this image.

3.3 Install additional dependencies when using pre-built Docker images

We don’t need to create a new Docker image just for some additional dependencies we need. We can include a requirements.txt file with those additional packages in the same directory as the training script (entry_point). In this case, that directory is ../scripts/sklearn/.

Say, we want the Python package eli5 (outputs the feature importance for a trained model). We just have a requirements.txt file with one line eli5, and put it in ../scripts/sklearn/ along with the training script.

The way to let SageMaker install it for the training job is to change the Estimator constructor in the following way.

sklearn_estimator = SKLearn(
    entry_point='sklearn_rf.py',
    source_dir='../scripts/sklearn',
    code_location=f's3://{BUCKET}/{PREFIX}',
    hyperparameters={'n_estimators': 50},
    role=role,
    train_instance_type='ml.c4.xlarge',
    output_path=f's3://{BUCKET}/{PREFIX}',
    base_job_name="install-libs-sklearn",
    sagemaker_session=sagemaker_session)

Notice that we added source_dir as a relative path. SageMaker knows to install dependencies from requirements.txt in that directory implicitly. We also change the base_job_name to have a different directory in S3 to check the difference vs. the previous job.

3.4 Data preprocessing example

Anyone who works with models knows that data preprocessing and feature engineering is a huge part of building models. Now we’ll demonstrate how to use the SageMaker Python SDK to use sklearn to preprocess the data.

First, upload the raw data to S3. Then import SKLearnProcessor from sagemaker.sklearn.processing to create the processor object. If you don’t need sklearn’s preprocessor, you can use the more general sagemaker.processing.scriptprocessor to run arbitrary script to do preprocessing.

The following is the preprocessing script. The main steps are:

  • read in the data, use the LabelBinarizer class to encode the target column
  • do the train-test split
  • create 2 pipelines. The first pipeline imputes the missing values for numerical columns. The second pipeline imputes the categorical columns and one-hot encode them
  • fit the processor on the training data, and transform the training and test datasets
  • serialize the train and test dataframes to csv files
# This line lets the notebook write the file to the right location
%%writefile ../scripts/sklearn/preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    args, _ = parser.parse_known_args()

    print('Received arguments {}'.format(args))

    input_data_path = os.path.join('/opt/ml/processing/input', 'raw_churn.csv')

    print('Reading input data from {}'.format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df)

    # Encode target
    lb = LabelBinarizer()
    label = lb.fit_transform(df['Churn?'])
    df['Churn?'] = label.flatten()

    negative_examples, positive_examples = np.bincount(df['Churn?'])
    print('Data after cleaning: {}, {} positive examples, {} negative examples'.format(df.shape, positive_examples, negative_examples))

    split_ratio = args.train_test_split_ratio
    print('Splitting data into train and test sets with ratio {}'.format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(df.drop('Churn?', axis=1), df['Churn?'], test_size=split_ratio, random_state=0)


    numerical_cols = ['Account Length', 'VMail Message', 'Day Mins', 'Day Calls', 'Eve Mins',
                      'Eve Calls', 'Night Mins', 'Night Calls', 'Intl Mins', 'Intl Calls',
                      'CustServ Calls']
    categorical_cols = ["State", "Int'l Plan", "VMail Plan"]

    num_proc = make_pipeline(SimpleImputer(strategy='median'))
    cat_proc = make_pipeline(
        SimpleImputer(strategy='constant', fill_value='missing'),
        OneHotEncoder(handle_unknown='ignore', sparse=False))
    preprocessor = make_column_transformer((numerical_cols, num_proc),
                                           (categorical_cols, cat_proc))
    print('Running preprocessing and feature engineering transformations')
    train_features = preprocessor.fit_transform(X_train)
    test_features = preprocessor.transform(X_test)

    print('Train data shape after preprocessing: {}'.format(train_features.shape))
    print('Test data shape after preprocessing: {}'.format(test_features.shape))

    one_hot_encoder = preprocessor.named_transformers_['pipeline-2'].named_steps['onehotencoder']
    encoded_cat_cols = one_hot_encoder.get_feature_names(input_features=categorical_cols).tolist()
    processed_cols = numerical_cols + encoded_cat_cols

    train_df = pd.DataFrame(train_features, columns=processed_cols)
    train_df.insert(0, 'churn', y_train)

    test_df = pd.DataFrame(test_features, columns=processed_cols)
    test_df.insert(0, 'churn', y_test)

    train_output_path = os.path.join('/opt/ml/processing/train', 'train.csv')
    test_output_path = os.path.join('/opt/ml/processing/test', 'test.csv')

    print('Saving training features to {}'.format(train_output_path))
    train_df.to_csv(train_output_path, header=True, index=False)

    print('Saving test features to {}'.format(test_output_path))
    test_df.to_csv(test_output_path, header=True, index=False)

We need to create processing_input and processing_output objects to set the paths for the input and output files. They make sure that both the S3 location and the file path in the filesystem of the worker instances are valid.

from sagemaker.processing import ProcessingInput, ProcessingOutput

processing_input = ProcessingInput(source=s3_raw_data, destination='/opt/ml/processing/input')

processing_output_train = ProcessingOutput(output_name='train.csv', source='/opt/ml/processing/train',
                                           destination=f's3://{BUCKET}/{PREFIX}/processing/')
processing_output_test = ProcessingOutput(output_name='test.csv', source='/opt/ml/processing/test',
                                          destination=f's3://{BUCKET}/{PREFIX}/processing/')

Next, we can create and run the processing job as follows.

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m4.xlarge',
                                     instance_count=1,
                                     sagemaker_session=sagemaker_session)

sklearn_processor.run(code='../scripts/sklearn/preprocessing.py',
                      inputs=[processing_input],
                      outputs=[processing_output_train, processing_output_test],
                      arguments=['--train-test-split-ratio', '0.2'])

We can inspect the input and output files with the following code. It shows their metadata on S3.

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

preprocessing_job_description['ProcessingInputs']

preprocessing_job_description['ProcessingOutputConfig']

Reference

  • Luigi’s SageMaker course