Skip to content

AWS SageMaker: Text Classification Using Keras & TensorFlow

Overview

In this tutorial, we will use Amazon SageMaker to build, train and deploy an NLP model using custom built TensorFlow Docker containers. The NLP model will classify news articles into the appropriate news category. To train the model, we will be using the UCI News Dataset which contains a list of about 420K articles and their appropriate categories (labels). There are four categories:

  • Business (b)
  • Science & Technology (t)
  • Entertainment (e)
  • Health & Medicine (m)

Prerequisites:

  1. Create an Amazon SageMaker Notebook

  2. Open JupyterLab

Local Test:

Clone the sample code

  • In JupyterLab, click the Terminal icon to open a new terminal
  • To clone the tutorial repository, run the following commands
  git https://github.com/lbnl-science-it/aws-sagemaker-keras-text-classification.git
  cd aws-sagemaker-keras-text-classification

Training and hosting the text classifier locally (on the JupyterLab notebook instance)

  cd container
  sh build_docker_local_test.sh

Prediction

  • Open another terminal in JupyterLab
  cd container/local_test
  ./predict.sh input.json application/json

Data Exploration

  • Navigate to aws-sagemaker-keras-text-classification and open sagemaker_keras_text_classification.ipynb
  • Execute code in JupyterLab notebook
import pandas as pd
import tensorflow as tf
import re
import numpy as np
import os

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.utils import to_categorical
WARNING:tensorflow:From /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/__init__.py:1467: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
news_dataset = pd.read_csv(os.path.join('./data', 'newsCorpora.csv'), names=column_names, header=None, delimiter='\t')
news_dataset.head()
TITLE URL PUBLISHER CATEGORY STORY HOSTNAME TIMESTAMP
Fed official says weak data caused by weather,... http://www.latimes.com/business/money/la-fi-mo... Los Angeles Times b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.latimes.com 1394470370698
news_dataset.groupby(['CATEGORY']).size()
CATEGORY
b    115967
e    152469
m     45639
t    108344
dtype: int64

Training and Hosting your Algorithm in Amazon SageMaker

image

Building and registering the container

The following shell code shows how to build the container image using docker build and push the container image to ECR using docker push.

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

%%sh
cd container
sh build_docker.sh
Login Succeeded
Stopping docker: [  OK  ]
Starting docker:    .[  OK  ]
Sending build context to Docker daemon  456.3MB
Step 1/9 : FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.14.0-cpu-py36-ubuntu16.04
 ---> e6a210ff54e4
Step 2/9 : RUN apt-get update &&     apt-get install -y nginx imagemagick graphviz
 ---> Using cache
 ---> 39d868b5172a
Step 3/9 : RUN pip install --upgrade pip
 ---> Using cache
 ---> 83610cd980fc
Step 4/9 : RUN pip install gevent gunicorn flask tensorflow_hub seqeval graphviz nltk spacy tqdm
 ---> Using cache
 ---> 090ab4fa935c
Step 5/9 : RUN python -m spacy download en_core_web_sm
 ---> Using cache
 ---> cf5c0166e852
Step 6/9 : RUN python -m spacy download en
 ---> Using cache
 ---> d39488c451a6
Step 7/9 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> 1b3759031fe0
Step 8/9 : COPY sagemaker_keras_text_classification /opt/program
 ---> Using cache
 ---> 092b378f8446
Step 9/9 : WORKDIR /opt/program
 ---> Using cache
 ---> b468053ed126
Successfully built b468053ed126
Successfully tagged sagemaker-keras-text-classification:latest
The push refers to repository [485444084140.dkr.ecr.us-west-2.amazonaws.com/sagemaker-keras-text-classification]
6738a352b9b5: Preparing
1084a39dd41b: Preparing
394cd2fefa75: Preparing
a8288ebab0e0: Preparing
a71b5abf35ed: Preparing
0ba58b74173a: Preparing
cc978a7bbd2a: Preparing
3a97a8d562fb: Preparing
cb460459ddc8: Preparing
b4064660a4cf: Preparing
b6e9883adafa: Preparing
9ee6d909e5a7: Preparing
e722e212cbab: Preparing
708ade65e147: Preparing
11fc4467b8a3: Preparing
0cf88c3675cd: Preparing
d456742927ee: Preparing
8722c9641a57: Preparing
7083756ef61f: Preparing
9d2fda619715: Preparing
e79142719515: Preparing
aeda103e78c9: Preparing
2558e637fbff: Preparing
f749b9b0fb21: Preparing
e722e212cbab: Waiting
708ade65e147: Waiting
11fc4467b8a3: Waiting
0cf88c3675cd: Waiting
d456742927ee: Waiting
8722c9641a57: Waiting
7083756ef61f: Waiting
9d2fda619715: Waiting
e79142719515: Waiting
aeda103e78c9: Waiting
2558e637fbff: Waiting
f749b9b0fb21: Waiting
0ba58b74173a: Waiting
cb460459ddc8: Waiting
cc978a7bbd2a: Waiting
3a97a8d562fb: Waiting
9ee6d909e5a7: Waiting
b6e9883adafa: Waiting
1084a39dd41b: Pushed
6738a352b9b5: Pushed
cc978a7bbd2a: Layer already exists
3a97a8d562fb: Layer already exists
cb460459ddc8: Layer already exists
394cd2fefa75: Pushed
b4064660a4cf: Layer already exists
b6e9883adafa: Layer already exists
9ee6d909e5a7: Layer already exists
a71b5abf35ed: Pushed
e722e212cbab: Layer already exists
708ade65e147: Layer already exists
0cf88c3675cd: Layer already exists
11fc4467b8a3: Layer already exists
d456742927ee: Layer already exists
7083756ef61f: Layer already exists
8722c9641a57: Layer already exists
9d2fda619715: Layer already exists
e79142719515: Layer already exists
2558e637fbff: Layer already exists
aeda103e78c9: Layer already exists
f749b9b0fb21: Layer already exists
a8288ebab0e0: Pushed
0ba58b74173a: Pushed
latest: digest: sha256:c684d06918d8e4c9ca2db9c35a663d044c89816a01df6fc34ddfd09b449c65ab size: 5344


WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Once you have your container packaged, you can use it to train and serve models. Let's do that with the algorithm we made above.

Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

### S3 prefix
prefix = 'sagemaker-keras-text-classification'

### Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3.

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

WORK_DIRECTORY = 'data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an Estimator that defines how to use the container to to train. This includes the configuration we need to invoke SageMaker training:

  • The container name. This is constucted as in the shell commands above.
  • The role. As defined above.
  • The instance count which is the number of machines to use for training.
  • The instance type which is the type of machine to use for training.
  • The output path determines where the model artifact will be written.
  • The session is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-keras-text-classification'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.c5.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

To view the progress of training: navigate to Amazon SageMaker Studio > Training > Training Jobs

tree.fit(data_location)
2020-04-22 01:18:26 Starting - Starting the training job...
2020-04-22 01:18:27 Starting - Launching requested ML instances......
2020-04-22 01:19:56 Starting - Preparing the instances for training......
2020-04-22 01:20:56 Downloading - Downloading input data
2020-04-22 01:20:56 Training - Downloading the training image......
2020-04-22 01:21:50 Training - Training image download completed. Training in progress./usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/usr/local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Starting the training.
                                               TITLE  ...      TIMESTAMP
1  Fed official says weak data caused by weather,...  ...  1394470370698
2  Fed's Charles Plosser sees high bar for change...  ...  1394470371207
3  US open: Stocks fall after Fed official hints ...  ...  1394470371550
4  Fed risks falling 'behind the curve', Charles ...  ...  1394470371793
5  Fed's Plosser: Nasty Weather Has Curbed Job Gr...  ...  1394470372027

[5 rows x 7 columns]
Found 65990 unique tokens.
Shape of data tensor: (422417, 100)
Shape of label tensor: (422417, 4)
x_train shape:  (337933, 100)
WARNING: Logging before flag parsing goes to stderr.
W0422 01:22:22.159757 140619530934016 deprecation.py:506] From /usr/local/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0422 01:22:22.187488 140619530934016 deprecation.py:506] From /usr/local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 100, 100)          1000000   
_________________________________________________________________
flatten (Flatten)            (None, 10000)             0         
_________________________________________________________________
dense (Dense)                (None, 2)                 20002     
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 12        
=================================================================
Total params: 1,020,014
Trainable params: 1,020,014
Non-trainable params: 0
_________________________________________________________________
2020-04-22 01:22:22.217855: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX512F
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-04-22 01:22:22.259738: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-04-22 01:22:22.260199: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x23b2eb0 executing computations on platform Host. Devices:
2020-04-22 01:22:22.260224: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-04-22 01:22:22.261048: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2020-04-22 01:22:22.291480: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Train on 337933 samples, validate on 84484 samples
Epoch 1/6
337933/337933 - 29s - loss: 0.6303 - acc: 0.7653 - val_loss: 0.5580 - val_acc: 0.7959
Epoch 2/6
337933/337933 - 29s - loss: 0.5502 - acc: 0.7977 - val_loss: 0.5558 - val_acc: 0.7955
Epoch 3/6
337933/337933 - 29s - loss: 0.5427 - acc: 0.8000 - val_loss: 0.5450 - val_acc: 0.8004
Epoch 4/6
337933/337933 - 29s - loss: 0.5391 - acc: 0.8010 - val_loss: 0.5407 - val_acc: 0.8018
Epoch 5/6
337933/337933 - 29s - loss: 0.5363 - acc: 0.8022 - val_loss: 0.5403 - val_acc: 0.8035
Epoch 6/6

2020-04-22 01:25:23 Uploading - Uploading generated training model
2020-04-22 01:25:23 Completed - Training job completed
337933/337933 - 29s - loss: 0.5342 - acc: 0.8032 - val_loss: 0.5424 - val_acc: 0.8014
Training complete. Now saving model to:  /opt/ml/model
W0422 01:25:15.442514 140619530934016 deprecation.py:506] From /usr/local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:97: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0422 01:25:15.442924 140619530934016 deprecation.py:506] From /usr/local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:97: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Test headline:  What Improved Tech Means for Electric, Self-Driving and Flying Cars
Predicted category:  b
Training seconds: 283
Billable seconds: 283

Deploy the model

Deploying the model to SageMaker hosting just requires a deploy call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

This step may take about 10-20 min

from sagemaker.predictor import json_serializer
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=json_serializer)
-------------!

Prediction

request = { "input": "‘Deadpool 2’ Has More Swearing, Slicing and Dicing from Ryan Reynolds"}
print(predictor.predict(request).decode('utf-8'))
{"result": "Entertainment"}

Endpoint Test

To view the endpoint_name you just deployed: navigate to Amazon SageMaker Studio > Inference > Endpoints

InvokeEndpoint using boto3

import json
import boto3

client = boto3.client('runtime.sagemaker')
endpoint_name = 'sagemaker-keras-text-classification-2020-04-22-01-18-26-512'

payload = { "input": "‘Deadpool 2’ Has More Swearing, Slicing and Dicing from Ryan Reynolds"}

response = client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/json',
                                  Accept='text/plain',
                                  Body=json.dumps(payload))

response_body = response['Body']
prediction = response_body.read().decode('utf-8')
print(prediction)
{"result": "Entertainment"}

InvokeEndpoint using AWS CLI

%%sh 
ENDPOINT_NAME="sagemaker-keras-text-classification-2020-04-22-01-18-26-512"
CONTENT_TYPE='application/json'

aws sagemaker-runtime invoke-endpoint \
--endpoint-name ${ENDPOINT_NAME} \
--content-type ${CONTENT_TYPE} \
--body '{"input": "Why Exercise Alone May Not Be the Key to Weight Loss"}' prediction_response.json

cat prediction_response.json
{
    "ContentType": "application/json",
    "InvokedProductionVariant": "AllTraffic"
}
{"result": "Health & Medicine"}

Optional cleanup

When you're done with the endpoint, you'll want to clean it up.

sess.delete_endpoint(predictor.endpoint)

References

  • https://github.com/aws-samples/amazon-sagemaker-keras-text-classification
  • https://github.com/lbnl-science-it/aws-sagemaker-keras-text-classification
  • https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html