Reinforcement Learning based Recommender Systems

mediumThis post was originally published by Debmalya Biswas at Medium [AI]

Design personalized apps using a combination of Reinforcement Learning and NLP/Chatbots

Abstract. We present a Reinforcement Learning (RL) based approach to implement Recommender Systems. The results are based on a real-life Wellness app that is able to provide personalized health / activity related content to users in an interactive fashion. Unfortunately, current recommender systems are unable to adapt to continuously evolving features, e.g. user sentiment, and scenarios where the RL reward needs to computed based on multiple and unreliable feedback channels (e.g., sensors, wearables). To overcome this, we propose three constructs: (i) weighted feedback channels, (ii) delayed rewards, and (iii) reward boosting, which we believe are essential for RL to be used in Recommender Systems.

This paper has been presented in the “Advances in Artificial Intelligence for Healthcare” track at the 24th European Conference on Artificial Intelligence (ECAI), Sep 2020. (paper pdf) (ppt)

Image for post
Reinforcement Learning based Recommender Systems: based on the pic by Andrea Piacquadio from Pexels

1 Introduction

Health / Wellness apps have historically suffered from low adoption rates. Personalized recommendations have the potential of improving adoption, by making increasingly relevant and timely recommendations to users. While recommendation engines (and consequently, the apps based on them) have grown in maturity, they still suffer from the ‘cold start’ problem and the fact that it is basically a push-based mechanism lacking the level of interactivity needed to make such apps appealing to millennials.

The core of such chatbots is an intent recognition Natural Language Understanding (NLU) engine, which is trained with hard-coded examples of question variations. When no intent is matched with a confidence level above 30%, the chatbot returns a fallback answer. The user sentiment is computed based on both the (explicit) user response and (implicit) environmental aspects, e.g. location (home, office, market, …), temperature, lighting, time of the day, weather, other family members present in the vicinity, and so on; to further adapt the chatbot response.

RL refers to a branch of Artificial Intelligence (AI), which is able to achieve complex goals by maximizing a reward function in real-time. The reward function works similar to incentivizing a child with candy and spankings, such that the algorithm is penalized when it takes a wrong decision and rewarded when it takes a right one — this is reinforcement. The reinforcement aspect also allows it to adapt faster to real-time changes in the user sentiment. For a detailed introduction to RL frameworks, the interested reader is referred to [1].

Previous works have explored RL in the context of Recommender Systems [2, 3, 4, 5], and enterprise adoption also seems to be gaining momentum with the recent availability of Cloud APIs (e.g. Azure Personalizer [6, 7]) and Google’s RecSim [8]. However, they still work like a typical Recommender System. Given a user profile and categorized recommendations, the system makes a recommendation based on popularity, interests, demographics, frequency and other features. The main novelty of these systems is that they are able to identify the features (or combination of features) of recommendations getting higher rewards for a specific user; which can then be customized for that user to provide better recommendations [9].

The rest of the paper is organized as follows: Section 2 outlines the problem scenario and formulates it as an RL problem. In Section 3, we propose

‘Delayed Rewards’ in this context is different from the notion of Delayed RL [10], where rewards in the distant future are not considered as valuable as immediate rewards. This is very different from our notion of ‘Delayed Rewards’ where a received reward is only applied after its consistency has been validated by a subsequent action. Section 4 concludes the paper and provides directions for future research.

2 Problem Scenario

2.1 Wellness App

The app supports both push based notifications, where personalized health, fitness, activity, etc. related recommendations are pushed to the user; as well as interactive chats where the app reacts in response to a user query. We assume the existence of a knowledgebase KB of articles, pictures and videos, with the artifacts ranked according to their relevance to different user profiles / sentiments.

The Wellness app architecture is described in Fig. 1, which shows how the user and environmental conditions (comprising the user feedback) are:

1. gathered using available sensors to compute the ‘current’ feedback, including environmental context (e.g. webcam pic of the user can be used to infer the user sentiment to a chatbot response / notification, the room lighting conditions and other user present in the vicinity),

2. which is then combined with the user conversation history to quantify the user sentiment curve and discount any sudden changes in sentiment due to unrelated factors;

3. leading to the aggregate reward value corresponding to the last chatbot response / app notification provided to the user.

This reward value is then provided as feedback to the RL agent, to choose the next optimal chatbot response / app notification from the knowledgebase.

Image for post
Fig. 1. Wellness app architecture (Image by Author)

2.2 Interactive Chat — RL Formulation

We formulate the RL Engine for the above scenario as follows:

Action (a): An action a in this case corresponds to a KB article which is delivered to the user either as a push notification, or in response to a user query, or as part of an ongoing conversation.

Agent (A): is the one performing actions. In this case, the Agent is the App delivering actions to the users, where an action is selected based on its Policy (described below).

Environment: refers to the world with which the agent interacts, and which responds to the agent’s actions. In our case, the Environment corresponds to the User U interacting with the App. U responds to A’s actions, by providing different types of feedback, both explicit (in the form of a chat response) and implicit (e.g., change in facial expression).

Policy(𝜋): is the strategy that the agent employs to select the next based action. Given a user profile Up, (current) sentiment Us, and query Uq; the Policy function computes the product of the article scores returned by the NLP and Recommendation Engines respectively, selecting the article with the highest score as the next best action: (a) The NLP Engine (NE) parses the query and outputs a score for each KB article, based on the “text similarity” of the article to the user query. (b) Similarly, the Recommendation Engine (RE) provides a score for each article based on the reward associated with each article, with respect to the user profile and sentiment. The Policy function can be formalized as follows:

Image for post

Reward (r): refers the feedback by which we measure the success or failure of an agent’s recommended action. The feedback can e.g. refer to the amount of time that a user spends reading a recommended article. We consider a 2-step reward function computation where the feedback fa received with respect to a recommended action is first mapped to a sentiment score, which is then mapped to a reward.

Image for post

where r and s refer to the reward and sentiment functions, respectively. The RL formulation described above is illustrated in Fig. 2.

Image for post
Fig. 2. RL Formulation (Image by Author)

3 RL Reward and Policy Extensions

3.1 Weighted (Multiple) Feedback Channels

As described in Fig. 1, we consider a multi-feedback channel, with feedback captured from user (edge) devices / sensors, e.g. webcam, thermostat, smartwatch, or a camera, microphone, accelerometer embedded within the mobile device hosting the app. For instance, a webcam frame capturing the facial expression of the user, heart rate provided by the user smartwatch, can be considered together with the user provided text response “Thanks for the great suggestion”; in computing the user sentiment to a recommended action.

Let {fa1, fa2, … fan} denote the feedback received for action a. Recall that s(f) denotes the user sentiment computed independently based on the respective sensory feedback f. The user sentiment computation can be considered as a classifier outputting a value between 1–10. The reward can then be computed as a weighted average of the sentiment scores, denoted below:

Image for post

where the weights {wa1, wa2, … wan} allow the system to harmonize the received feedback, as some feedback channels may suffer from low reliability issues. For instance, if fi corresponds to a user typed response, fj corresponds to a webcam snapshot; then higher weightage is given to fi. The reasoning here is that the user might be ‘smiling’ in the snapshot, however the ‘smile’ is due to his kid entering the room (also captured in the frame), and not necessarily in response to the received recommendation / action. At the same time, if the sentiment computed based on the user text response indicates that he is ‘stressed’, then we give higher weightage to user explicit (text response) feedback in this case.

3.2 Delayed Rewards

To accommodate the ‘delayed rewards’ strategy, the rewards function is extended with a memory buffer that allows the rewards of last m actions from time (t+m) to t to be aggregated and applied retroactively at time (t+m). The delayed rewards function dr is denoted as follows:

Image for post

where |𝑡+𝑚 implies that the reward for the actions actions from time (t+m) to t, although computed individually; can only be applied at time (t+m). As before, the respective weights wi allow us to harmonize the effect of an inconsistent feedback, where the reward for an action 𝑎t time 𝑡𝑖 is applied based on the reward computed for a later action at time (𝑡+1)𝑖.

To effectively enforce the ‘delayed rewards’ strategy, the Policy 𝜋 is also extended to recommend an action of the same type, as the previous recommended action; if the delay flag d is set (d = 1): The “delayed” Policy 𝜋𝑑𝑡 is denoted below:

Image for post

The RL formulation extended with delayed reward / policy is illustrated in Fig. 3.

Image for post
Fig. 3. Delayed Reward — Policy based RL formulation (Image by Author)

3.3 Rewards Boosting

The boosted reward 𝑟𝑏𝑎𝑡 for an action at at time t is computed as follows:

Image for post

We leave it as future work to extend the ‘boost’ function to last n actions (instead of just the last action above). In this extended scenario, the system maintains a sentiment curve of the last n actions, and the deviation is computed with respect to a curve, instead of a discrete value. The expected benefit here is that it should allow the system to react better to user sentiment trends.

4 Conclusion

In this work, we considered the implementation of a RL based Recommender system, in the context of a real-life Wellness App. RL is a powerful primitive for such problems as it allows the app to learn and adapt to user preferences / sentiment in real-time. However, during the case-study, we realized that current RL frameworks lack certain constructs needed for them to be applied to such Recommender Systems. To overcome this limitation, we introduced three RL constructs that we had to implement for our Wellness app. The proposed RL constructs are fundamental in nature as they impact the interplay between Reward and Policy functions; and we hope that their addition to existing RL frameworks will lead to increased enterprise adoption.


Spread the word

This post was originally published by Debmalya Biswas at Medium [AI]

Related posts