This post was originally published by Ben Weber at Towards Data Science - Medium Tagged
Responding to Prediction Requests in Milliseconds
As an applied data scientist at Zynga, I’ve started getting hands on with building and deploying data products. As I’ve explored more and more use cases for machine learning, there’s been an increasing need for real-time machine learning (ML) systems, where the system performs feature engineering and model inference to respond to prediction requests within milliseconds. While I’ve previously used tools such as AWS SageMaker to do model inference in near real-time, I only recently explored options for also doing feature engineering on-the-fly for ML systems.
Ad technology is one of the domains where real-time ML is a requirement to build a system that performs well in the advertising marketplace. On the demand side of advertising, a real-time bidder implementing the OpenRTB specification needs to predict which ad impressions are most likely to drive conversion events. On the supply side, an ad mediation platform needs to determine the bid floor for advertising inventory in real-time in order to optimize advertising revenue.
In a real-time ML deployment, the system replies to a request within milliseconds of the request being made. There are two general workflows for making prediction requests with a real-time system:
- Web Requests
- Streaming Workflows
In the first case, the system or client that needs a prediction makes an HTTP request to an endpoint that responds directly to the request with a prediction. Other protocols, such as gRPC, can be used for this type of workflow.
The second workflow can be implemented in a variety of ways. For example, a request can be made to a Kafka topic, where it is processed with Spark Streaming, and the result is published to a separate topic. Other streaming frameworks such as Flink or GCP Dataflow can be used to respond to prediction requests in near real time.
Over the past year, I’ve gotten hands on with Golang to build real-time ML systems. While Python can be used to implement these types of systems, Golang is typically able to respond to a higher number of requests per second for a fixed number of machines. Additionally, Golang has some elegant features for working with NoSQL data stores when building real-time ML systems. For use cases with extremely large request volumes, I’ve targeted pure Go implementations for efficiency.
In this post, I’ll discuss some of the options available for building ML systems that perform feature engineering and model inference in real-time. I’ll discuss both pure Go approaches and options for hybrid approaches with Python.
Before a model can be applied, it’s usually necessary to translate a prediction request into a feature vector that can be passed to an ML model. For example, with an OpenRTB bid request, the JSON passed in the web request needs to be converted into a feature vector before predicting conversion events. Another use case is retrieving a user profile for a mobile game, and translating the profile into a feature vector for making a lifetime value (LTV) prediction. In these use cases, feature engineering is being performed in real-time based on the information in the request, or using previously stored information about a user in a mobile application.
There’s a big shift that needs to be made for feature engineering to move from a batch process where vectors are pre-computed, to a real-time system where the feature engineering is done on the fly for every incoming request. In ad tech, a system may receive a bid request and then augment the request with additional data points, such as information from a data-management platform (DPM), before performing feature engineering. All of this needs to be done with minimal latency, which restricts the types of approaches that can be used. Here are some of the approaches I’ve explored:
- Frequently pre-compute features and store in NoSQL
- Use NoSQL to store and update feature vectors
- Use NoSQL to store profiles, and translate with custom application code
- Use NoSQL to store profiles, and translate with a DSL
- Call a remote endpoint for feature engineering
Most of these solutions rely on using a NoSQL data store because it’s necessary to respond to requests quickly. There are several trade-offs to consider for which tool to use, such as the number of data points to store, read vs write requests, and maximum latency. Redis is a great option when getting started, and most cloud platforms have a managed offering.
This approach most closely represents the traditional batch workflow for making predictions, where a prediction is made for a large number of users at the same time, and the results are cached to an application database, such as Redis. This is not actually performing feature engineering in real-time, but can be implemented in a way that the feature vectors are frequently updated for users that are active in an application. For example, you can use a Spark job to query a relational database every 15 minutes, use SQL to translate user activity into feature vectors, and store the results to a NoSQL store. For many use cases, having 15 minutes of latency from a user performing actions in an application, to this activity showing up in a feature vector for model inference may not be a problem. However, it can be quite taxing on the relational database, and is only recommended for situations where the number of users to create vectors for is relatively small. This approach does not work for problems where the system needs to use context from the prediction request in the feature generation process, and should generally only be used as a fallback.
Storing Vectors Directly in NoSQL
A different approach for using NoSQL to perform feature engineering in real-time is to determine the shape of the required feature vectors for your application ahead of time, and to update these values in real time as new tracking events are received by the system. For example, the feature vector may have an attribute that tracks the total number of sessions that the user has had, and updates this value each time the user starts a new session. For mobile applications, a MMP with real-time callbacks events can be used to set up this tracking event functionality.
For this approach, a key such as a user ID is used as the key for the NoSQL store, and the value is the feature vector, which can be implemented in a variety of ways. In Redis, hashes can be used to update counter values in real time. The vector can also be stored as JSON, which is simple, but typically takes up more space. You can use programming language specific libraries for serialization, such as kryo for Java. The approach I typically use is Google Protocol Buffers for serializing data in a portable way.
This approach is basic, which makes it easier to implement, but it does have some major limitations. The first is that you need a fixed feature vector definition, because the vector is updated directly when new data is received, rather than aggregating data on-the-fly, which we’ll cover in later approaches. The second big limitation is that features must be able to be updated incrementally, rather than aggregated over past events. This means that features such as counters and flags can be used, but it’s not possible to use aggregations such as median or other calculations that require historic data.
Profile and Custom Application Code
To get more flexibility in the types of operations that can be performed for feature engineering, one option is to store a user profile in NoSQL and to perform feature engineering on-the-fly using the data stored in the profile to create the vector. This approach uses the same process as before where the value in the NoSQL store is updated in real-time as new data is received, but instead of updating a feature vector this approach updates a user profile that summarizes user activity. For example, the profile may store the past 3 game modes played by a user in a game, or track which power-up items the user has recently used.
When a prediction request is received using this approach, the request passes the user ID in the request, the prediction endpoint fetches the profile from NoSQL, and then custom application code translates the profile into a feature vector. In Golang, this can be implemented as a function that takes a protobuf object as input, and returns a feature vector (float64) as a result. This approach is typically the most computationally efficient approach, but requires a new deployment of the system each time that the definition of the feature vector changes. This works well for systems that only need to serve a small number of models and the definitions change infrequently, but can be problematic for systems that provide predictions for a variety of models.
Profile and a Domain Specific Language (DSL)
To remove the need to deploy the system each time that the feature vector definition changes, we need a more flexible approach for the system to update feature transforms. One approach is to use a data-driven or configuration-driven approach, where the transformation is defined as a data structure or configuration file that can be loaded by the system each time an update is made. For example, the system could be set up to ingest a JSON file that defines which attributes and values to 1-hot encode for an OpenRTB bid request. This enables data scientists to define the model pipeline in a separate runtime, such as PySpark, while the model is served in Go.
Over time the team may want to define more complex types of operations to perform on the request data and user profile to translate these data points into feature vectors. I’ve explored small domain-specific languages (DSL) for this problem, such as allowing ≥ and ≤ operators when comparing values and adding operators to check for an item in a list. This gets complicated pretty quickly, and becomes a new driver for system deployments, as new operators are defined for your custom DSL. Instead of building this from scratch for each new system, I’ve started using Google’s Common Expression Language (CEL) for translating user profiles into feature vectors. The language supports a variety of operations that can be performed on protobuf objects, and the output of the CEL program running on the input object is a feature vector. When using this approach, the feature engineering translation is defined as a CEL program, which can be passed as a string.
Feature Engineering Endpoint
If latency is not a concern and you want a great deal of flexibility in how to translate a user profile into a feature vector, you can set up an endpoint for performing that translation. This can be accomplished with a serverless function such as AWS Lambda or with a docker image deployed to a Kubernetes instance. This approach is useful when prototyping, but results in additional cloud infrastructure, and has all of the drawbacks of the custom application code approach. It’s still using this approach, but provides an option where only the serverless function needs to be redeployed rather than the whole system when the feature engineering definition is updated.
Once you have a feature vector for your prediction request, you need to apply an ML model when responding to real-time requests. There’s a variety of options for achieving this functionality, but using Golang does make things more complicated. Here are the approaches I’ve explored.
- Inference Endpoint
- Custom Application Code
- Cross-Runtime Algorithms
- Portable Model Formats
I’ve worked on data products where the training environment is PySpark and the runtime environment is Golang, which makes it tricky to use models trained with the standard data science stack such as sklearn, XGBoost, SparkML, and TensorFlow. I’ll cover some approaches for dealing with these limitations, but pure Go implementations are still somewhat limited.
One of the easiest ways to support a variety of algorithms for model inference is to use existing solutions for this problem, such as AWS SageMaker, BentoML, or fast.ai. These solutions serve models using a Python runtime and support the standard ML workbench of predictive models. When using this approach the system first performs feature generation in the current process, calls out to an endpoint to perform model inference, and then responds to the prediction request with the value produced by the inference endpoint.
This approach works well for most use cases, but there are some considerations. If the system is responding to a high volume of requests, then managed solutions such as SageMaker can become expensive. Another concern is latency, making a call to a remote process adds latency and can impact the SLAs of the system, which can be problematic for domains such as advertising technology.
Custom Application Code
If the algorithm that the system needs to run is relatively simple, such as a linear model, then it may be advantageous to write custom application code to implement the model. In Golang, linear and logistic regression are relatively easy to implement, but you’ll need to define a format for how to serialize a model from Python/PySpark to Go, depending on your training environment. For more complex algorithms, this approach is likely to quickly result in tech debt and eventually bugs.
Where this approach is quite useful is when building systems with extremely wide feature vectors, such as hundreds of thousands of features. If the vector is highly sparse, then lookup tables such as hash maps can be used to efficiently perform model inference when working with this scale of data.
One of the approaches I’ve frequently used for Golang ML systems is using algorithms that have implementations in both Python and Golang. LightGBM is a great example of an ML algorithm that works well in Python and Go with the leaves library. The intersection of libraries with actively supported implementations in Python and Go is small, but will hopefully grow over time. TensorFlow does have a Golang API, but it’s currently not stable.
Portable Model Formats
To help with the fact that only a handful of ML libraries have proper support in Golang, another approach that I’ve explored is portable model formats such as PMML and ONNX. PMML was the initial proposal for making ML models portable across different runtimes, but had only limited adoption by the key ML frameworks. The premise is that you can train a model in Python, save the result in the PMML format, and then load the model in a different runtime to perform model inference. Golang has implementations of PMML for inference such as goscore, but popular algorithms such as XGBoost are not currently supported.
ONNX is the next generation proposal for making ML models portable across runtimes. It was designed initially for deep learning, but has been extended to support classic algorithms. In theory, you can train models using SparkML, XGBoost, sklearn, Tensorflow, and others, and run model inference in any runtime with a ONNX implementation. The current state is that many of these frameworks have limited support to export to ONNX, and the frontrunner for the Golang implementation has limited support for the classic ML algorithms. Support for ONNX exporting and runtimes will hopefully improve over time, making pure Go implementations of ML systems a solid option.
Switching from batch to real-time ML predictions for data products requires new approaches for performing feature engineering and potentially model inference. The method I’ve found most useful for feature engineering in real-time is to store a user-level summary object in a NoSQL data store, and perform feature engineering on the fly for each request, with either custom application code or using Common Expression Language. For model inference, I’ve used algorithms with implementations available for both the model training and model serving environments.
Empowering our teams to build ML pipelines that can respond to prediction requests in milliseconds has opened up a variety of new use cases.
This post was originally published by Ben Weber at Towards Data Science - Medium Tagged