2021 is the year for feature store maturity
Feature store applications are fairly new product technology domain that allows for the development, maintaining, and monitoring of data features used by machine learning algorithms in artificial intelligence systems around us. Basically, a feature store is a data management layer used for saving and repurposing data features specifically designed for machine learning use cases. What is a feature? It is a measurable data property of an entity or representation of a object (e.g. product, customer, vendor, order, transaction, etc.) that machine learning models can understand.
The cores system capabilities for a feature store comprises of the abilities to support feature engineering (feature creation), a storage layer for both online and offline feature storage, a serving layer (via API, SDK), with a registry that features can be discovered with historical lineage that is trackable and lastly monitoring (and alerting) of features being used in understanding data drift with anomalies detection.
The benefits are ten fold in having a feature store. In summary, it accelerate model development in artificial intelligence, increases productivity, and provides a monitoring point-in-time governance on your data being used. It allows for feature versioning, reusability, and decoupling the data science process from machine learning engineering.
Just we are in the early days. This marketplace is young with many new products entering the domain. Not to mention, the AI Data Science Platforms (end-to-end systems) are beginning to shift, adding new feature store components. We are going to see a lot of maturing in this domain over the next few years.
As of today, consider taking a look at these leading Top 10 Feature Store Applications…
Feast is an open-source feature store (K8s). It allows for a fast path to operationalizing of your machine learning analytic data for model training and online inference (On-premises/ Manage-cloud). It is online only and only for ingest ready-made features. It allows team members to register, ingest, serve, and monitor features in production. Feast might be the feature store for your needs if you have an existing Kubernetes cluster and want to deploy Kubeflow with an open-source feature store. It comes with helm charts for installation (Postgres and Redis). For the offline features (BigQuery) can be stored in object store, distributed file system or in a data warehouse. Online uses BigTable/Redis. The metadata is DB Tables with feature engineering using Beam, Python.
You can read more about Feast in taking a look at the documentation for more information. Willem Pienaar, is the creator of Feast, where today, both Tecton and Gojek are the core contributors in supporting the open source project. You can find their slack channel #Feast and GitHub repository feast-dev/feast.
Tecton Enterprise (LinkedIn) founded by the team that created the Uber Michelangelo platform. Tecton provides a mature enterprise-ready feature store (Online/ Offline) and is one of the leading companies in the Managed-cloud feature space. As mentioned above, Tecton is also a core contributor to the Feast. You can learn about the differences between Feast and Tecton capabilities here. Tecton can handle booth combine batch, streaming, and real-time data. It has build in features using familiar machine learning programming languages, libraries including Python, SQL and PySpark. You are able to use Python SDK in your preferred notebook environment to create training datasets. It supports automate feature transformations and monitoring features throughout the lifecycle.
Hopswork is an open-source feature store supported by Logical Clocks. It is a managed feature store platform (Online/ Offline) for scale-out data science (On-premises/ Manage-cloud) that supports both GPUs and Big Data. Hopsworks is available both as open-source and enterprise versions. The open-source version is fully functional, but the enterprise contains additional support for Active Directory and OAuth-2 SSO, as well as integration with Kubernetes. On-premises it can be integrated with Cloudera by providing support for AD/Kerberos SSO. Allows for doing feature engineering in your Cloudera cluster (or engineering via Python Jupyter notebooks) then ingest Spark jobs running on Cloudera.
It can be used either through its user interface or via REST API. The platform supports Jupyter, plugin to IDEs (vi the REST API), Conda/Pip; machine learning frameworks TensorFlow, Keras, PyTorch, ScikitLearn, data analytics and BI applications SparkSQL, Hive, stream processing Spark streaming, Flink, Kafka and lastly model serving using Kubernetes/Docker. Offline supports Hudi/Hive with online via MySQL Cluster. The metadata is stored in DB Tables using Elasticsearch. Hopsworks’ has solid security model built around projects that provides a strong GDPR-compliant security model for managing sensitive data in a shared data platform.
Iguazio (LinkedIn) provides integrated central hub feature store (Online/ Offline) with advanced data transformation. It is a production-ready (On-premises / managed-cloud) feature store which is fully integrated into Iguazio’s data science platform. It is built on top of Iguazio’s real-time data layer which has been commercially available since the end of 2014.
The Iguazio feature store automates and simplifies the way features are engineered, with a single implementation for both real-time and batch. Iguazio might be a good selection if you are looking for an integrated feature store within a full data science solution (so that model development, deployment and monitoring are all integrated seamlessly), or if you need robust data transformation / real-time capabilities.
Iguazio also maintains MLRun, an open source MLOps Orchestration Framework (which tightly integrates with Kubeflow). It enables users to create robust ML pipelines within a single interface for incorporating the feature engineering produced by the feature store as well as other parts of the pipeline such as model development, deployment and serving. Offline uses Parquet and online is in mem database. Feature engineering is done via Spark, Python, Nuclio.
The product uses multi-model database for serving the computed features through many different APIs and formats (like files, SQL queries, pandas, real-time REST APIs, time-series, streaming). The integration components allow for the feature store to be centralized and versioned catalog where data team members can engineer and store features along with their metadata and statistics. The feature store is integrated with the model serving layer and therefore provides built in drift detection based on data drift.
Kaskada (LinkedIn) is a python (AWS-managed) recently release feature store application. Kaskada delivers an end-to-end platform for feature engineering and feature serving, including a collaborative interface for computing, storing, and serving features in production. It allows data scientists to own the end-to-end lifecycle for features, without needing help from engineering, and users are automatically service accurate, up-to-date predictions based on their most recent behavior. Data engineers simply call an API to get up-to-date feature vectors for each user.
Molecula (LinkedIn) provides centralized cloud-based feature store to allow access to your big data by reducing the dimensionality of the original source data. It supports REST API that taps into Kafka, MySQL, MS SQL, Snowflake, Cassandra, Teradata. In addition, it supports Spark, Parquet, S3, Big Query. Molecula extracts features, reducing the dimensionality and routes in real-time feature changes into a central store. The product is fairly new, recently release and accelerating quickly.
Butterfree is an open source spark-based framework for a feature store platform (Online/ Offline) with S3 and Cassandra. It is being built by Quintoandar (LinkedIn). Butterfree concept is around declarative feature engineering, where you can focus on what you want to achieve, while all transformations and engineering layers are abstracted, thus enabling a trouble-free approach for reading data sources and writing features for both offline and online destinations. Metadata is generated alongside the transformations, so it’s can be easy to export documentation afterward. It promotes team collaboration, with feature storage in a centralized repository to create data pipelines. It uses a Spark-based extract, transform and load modules.
The declarative feature engineering takes cares about what you want to compute and not how to code it. The feature store modeling is a library that provides everything needed to process and load data to the feature store. The main features can be checkout in Butterfree’s Documentation (.pdf) and via Butterfree’s notebook examples.
Scribble Data (LinkedIn and Facebook) has recently release feature store that has been increasing features available regularly. It supports the use cases of track utilization of the features along with ownership. You can use SDK and other services to rapidly implement feature engineering modules. There are abilities to discover datasets via a marketplace for features and along with search interface to build cohorts for analysis. Components allow administer versioned, auditable, parameterized pipelines, each generating multiple data sets. The product gives the ability to audit, check provenance of datasets by name or other attributes, and compare runs
Splice Machine (LinkedIn) is the only feature store (K8s) powered by one ACID-compliant dual OLTP/OLAP RDBMS (On-premises/ Manage-cloud). The product is recently release and is scale-out via SQL database (SQL HTAP) with built-in machine learning components. It is focused around reducing the effort of feature engineering to help solve for governance issues, such as bias, drift, or regulatory oversight. You can read the documentation.
StreamSQL is a feature store (online) for machine learning (GCP or AWS-managed) around the concept of event-sourcing on Kafka. It allows for generating data model features for serving using declarative definitions, the creation of training sets using the same feature definitions, capabilities of versioning, monitoring, and managing features. It allows for features to be shared, re-used, and discovered features across teams and models via API. The event storage is an immutable ledger of every domain event. It’s a young product, lacks capabilities, that has plans of adding new components on their roadmap.
Again, this product sector is in the early days. It is young and we are going to see many new products entering the domain over the next few years.
Amazon SageMaker, (AWS-managed) only recently has added feature store components to SageMaker Studio IDE. It is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning (ML) features. The SageMaker Feature Store provides a place for data scientists (or other SageMaker users) to name, organize, find, and share the features that are used in machine learning models. The software does the difficult bit of keeping track of all the ways in which different features are being used by different groups in different machine learning models.
Featuretoolsis is an open source python framework (github) for automated feature engineering that is being built by Alteryx Innovation Labs. It’s still maturing, lacks a storage layer component and therefore not a fully feature store application.
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production slides 5-8 does a great job of highlighting the differences in approach of feature stores today.
Jim Dowling, CEO of Logical Clocks, presenting Building a Feature Store around Dataframes and Apache Spark
Daniel Galinkin presenting Building a Real-Time Feature Store at iFood
Nikhil Simha presenting Zipline – A Declarative Feature Engineering Framework