Delivering AI/ML without proper Dataops is just wishful thinking!


This post was originally published by Sandeep Uttamchandani at Towards Data Science - Medium Tagged

DataOps processes you need for effective AI/ML

GIF by giphy

Behind every successful AI/ML product is a fast and reliable data pipeline developed using well-defined DataOps processes!

To level-set, what is DataOps? From Wikipedia: “DataOps incorporates the agile methodology to shorten the cycle time of analytics development in alignment with business goals.”

  • Developing: Clear processes defined but accomplished manually by the data team
  • Optimal: Clear processes with self-service automation for data scientists, analysts, and users.
Similar to software development, DataOps can be visualized as an infinity loop
  • Create: “The query joins the tables in the data samples. I didn’t realize the actual data had a billion rows! ”
  • Orchestrate: “Pipeline completes but the output table is empty — the scheduler triggered the ETL before the input table was populated”
  • Test & Fix: “Tested in dev using a toy dataset — processing failed in production with OOM (out of memory) errors”
  • Continuous Integration; “Poorly written data pipeline got promoted to production — the team is now firefighting”
  • Deploy: “Did not anticipate the scale and resource contention with other pipelines”
  • Operate & Monitor: “Not sure why the pipeline is running slowly today”
  • Optimize & Feedback: “I tuned the query one-time — didn’t realize the need to do it continuously to account for data skew, scale, etc.”

This blog series will help you go from ad-hoc to well-defined DataOps processes as well as share ideas on how to make them self-service such that data scientists/users are not bottlenecked by data engineers.

For each stage of the DataOps lifecycle stage, follow the links for the key processes to define and the experiences in making them self-service (some of the links below are being populated —bookmark and come back):

  1. Formulating the scope and success criteria of the AI/ML problem
  2. How to select the right data processing technologies (batch, interactive, streaming) based on business needs
  1. How to streamline the data preparation process
  2. How to make behavioral data self-service
  1. Re-using ML Model Features
  2. Scheduling data pipelines
  1. Identify and remove data pipeline bottlenecks
  2. Verify data pipeline results for correctness, quality, performance, and efficiency.
  1. Scheduling window selection for data pipelines
  2. Changes rollback
  1. Managing data incidents in production
  2. Alerting on rogue (resource hogging) jobs
  1. Tracking lineage of data flows data
  2. Enforcing data quality with circuit breakers
  1. Alerting on budgets
DataOps as a team sport (Image by author)
Spread the word

This post was originally published by Sandeep Uttamchandani at Towards Data Science - Medium Tagged

Related posts