MLOps: Model Monitoring 101

This post was originally published by Matt Mayo Editor at KD Nuggets

Figure

Fig 1: ML Workflow, Source: martinfowler.com

Background

ML models are driving some of the most important decisions for businesses. As such it is important that these models once deployed into production remain relevant in the context of the most recent data. A model may go out of context if there is data skew i.e. data distribution may have changed in production from what was used during training. It may also be that a feature becomes unavailable in production data or that the model may no longer be relevant as the real-world environment might have changed (e.g. Covid19) or further and more simply, the user behavior may have changed. Monitoring the changes in model’s behaviour and the characteristics of the most recent data used at inference is thus of utmost importance. This ensures that the model remains relevant and/or true to the desired performance as promised during the model training phase.

An instance of such a model monitoring framework is illustrated in Fig 2 below. The objective is to track models on various metrics, the details of which we will get into the next sections. But first let us understand the motivation of a model monitoring framework.

Figure

Fig 2: Model Monitoring Framework Illustrated

Motivation

Feedback loops play an important role in all aspects of life as well as business. Feedback loops are simple to understand: you produce something, measure information on the production, and use that information to improve production. It’s a constant cycle of monitoring and improvement. Anything that has measurable information and room for improvement can incorporate a feedback loop and ML models can certainly benefit from them.

A typical ML workflow includes steps like data ingestion, pre-processing, model building & evaluation and finally deployment. However, this lacks one key aspect i.e. feedback. The primary motivation of any “model monitoring” framework thus is to create this all-important feedback loop post deployment back to the model building phase (as depicted in Fig 1). This helps the ML models to constantly improve itself by making a decision to either update the model or continue with the existing model. In order to enable this decision the framework should track & reports various model metrics (details in “Metrics” section later) under two possible scenarios described below.

  1. Scenario I: The training data is available and the framework computes the said model metrics both on training data and production (inference) data post deployment and compare to make a decision.
  2. Scenario II: The training data is not available and the framework computes the said model metrics based only on the data that is available post deployment.

The following table lists the inputs required by the model monitoring framework to generate the said metrics, under the two scenarios.

Image

Based on which of the two scenarios is applicable, metrics highlighted in the next section are computed to decide if a model in production needs update or some other interventions.

Metrics

A proposed model monitoring metrics stack is given in Fig 3 below. It defines three broad types of metrics based on the dependency of the metric on data and/or ML model.

Figure

Fig 3: Model Monitoring Metrics Stack
  1. Stability Metrics — These metrics help us to capture two type of data distribution shifts:
    1. Prior Probability Shift — Captures the distribution shift of the predicted outputs and/or dependent variable between either the training data and production data (scenario I) or various time frames of the production data (scenario II). Examples of these metrics include Population Stability Index (PSI), Divergence Index (Concept Shift), Error Statistic (details & definition to follow in next article of this series)
    2. Covariate Shift — Captures the distribution shift of each independent variable between either the training data and production data (scenario I) or various time frames of the production data (scenario II), as applicable. Examples of these metrics include Characteristic Stability Index (CSI) & Novelty Index (details & definition to follow in next article of this series)
  2. Performance Metrics — These metrics help us to detect a concept shift in data i.e. identify whether the relation between independent & dependent variables has changed (e.g. post COVID the way users purchase during festivals may have changed). They do so by examining how good or bad the existing deployed model is performing viz-a-viz when it was trained (scenario I) or during a previous time frame post deployment (scenario II). Accordingly decision can be taken to re-work the deployed model or not. Examples of these metrics include,
    1. Project Metrics like RMSE, R-Square, etc for regression and accuracy, auc-roc, etc for classification.
    2. Gini and KS -Statistics: A statistical measure of how well the predicted probabilities/classes are separated (only for classification models)
  3. Operations Metrics — These metrics help us to determine how the deployed model is performing from an usage point of view. They are as such independent of model type, data & dont require any inputs as with the above two metrics. Examples of these metrics include,
    1. # of time ML API endpoints called in the past
    2. Latency when calling ML API endpoints
    3. IO/Memory/CPU usage when performing prediction
    4. System uptime
    5. Disk utilization

In the next blogs of the series we will get into more details of the various Stability & Performance metrics and we will see how we can use them to build our model monitoring framework.

Pronojit Saha is an AI practitioner with extensive experience in solving business problems, architecting, and building end-to-end ML driven products & solutions by leading and facilitating cross-functional teams. He is currently the Advanced Analytics Practice Lead at Abzooba, wherein apart from project execution he also engages in leading & growing the Practice by nurturing talent, building thought leadership, and enabling scalable processes. Pronojit has worked in the retail, healthcare, and Industry 4.0 domains. Time series analytics and natural language processing are his expertise and he has applied these along with other AI methodologies for use cases like price optimization, readmission prediction, predictive maintenance, aspect-based sentiment analytics, entity recognition, topic modeling, among others.

Dr. Arnab Bose is Chief Scientific Officer at Abzooba, a data analytics company and an adjunct faculty at the University of Chicago where he teaches Machine Learning and Predictive Analytics, Machine Learning Operations, Time Series Analysis and Forecasting, and Health Analytics in the Master of Science in Analytics program. He is a 20-year predictive analytics industry veteran who enjoys using unstructured and structured data to forecast and influence behavioral outcomes in healthcare, retail, finance, and transportation. His current focus areas include health risk stratification and chronic disease management using machine learning, and production deployment and monitoring of machine learning models.

Related:

Spread the word

This post was originally published by Matt Mayo Editor at KD Nuggets

Related posts