Real World Modeling: Joint Estimation & Optimization

Real World Modeling: Joint Estimation & Optimization

As described in the previous post, one of Onera’s core product offerings involves a problem modeled at a 10,000 foot level as a knapsack problem. Each item is a SKU that a retailer has the option of selling and associated with each SKU is the corresponding revenue (or margin) that the retailer may generate while also having some possible risk associated with offering the item for sale. This risk most typically arises out of operational problems involved in selling this item resulting in possible negative customer satisfaction. The problem of course is to try to achieve maximum revenue while ensuring that the customer satisfaction stays above a threshold.

In reality, these problems have several flavors and models associate with them at each level. As you dig deeper into the problem, each layer reveals an interesting modeling challenge and a problem solving approach much like peeling layers of an onion. So this post dives a little bit deeper into the above problem formulation and models underneath which are typical of algorithmic challenges in the real-world. Specifically we will talk about what we think of as Separate Estimation & Optimization (SEO) methods and Joint Estimation & Optimization (JEO) methods.

At a high-level there are at least three sub-problems:

  1. Given set of items, a function that maps the items to expected revenue and a function that maps the items to customer satisfaction, find a subset of items that maximizes overall revenue while ensuring that the overall customer satisfaction is above a defined threshold
  2. Given historical data on items, revenue generated by the items, customer satisfaction scores of those items, learn functions that map items to expected revenue and items to expected customer satisfaction
  3. If there are items that are ‘new’, ‘unlearnable’ or lacking recent data, device a scheme that ensures that these items are offered by the retailer so as to not miss opportunities just because of historical behavior

Each problem above can be modeled as a classical problem in discrete algorithms or machine learning.

  • Modeling Problem 1: This can be solved as a classical mathematical programming problem, for example as an integer linear program. The history of the knapsack problem is as old as NP-completeness itself and there is a wide array of literature on solving it in theory and in practice. As illustrated by the previous post, there are ways to solve the problem that provide theoretical approximation guarantees that are in fact unbelievably good in practice as well. In a manner of hand-waving, although the problem is NP-complete, it is as easy to solve within the NP-complete class as it gets.
  • Modeling Problem 2: This is a standard problem in machine learning. You can pick your favorite learning algorithm and of course in vogue at the moment are neural networks. You pick a bunch of features for each item such as price, price status, brand, size, weight, product ontology and you use historical data to obtain demand (and revenue) predictions as well as customer satisfaction predictions.
  • Problem 3: This can be modeled as a multi-armed bandit problem which used to be quite popular in the days of ads optimization where very little information is available of a new ad to determine its performance of CTR and that you need to provide it some air-time to figure out crucial metrics. Each choice comes with a known objective function cost and a variance in the estimate and you trade-off between exploiting known good items while also exploring unknown items.

The typical approach to solve the overall problem in practice would be as follows. You would solve Problem 2 first. Again, pick your favorite learning algorithm, historical data and obtain functions that you believe fits them best. Then you will use the functions to solve Problem 1 by using an Integer Linear Program (ILP) solver. Finally you will layer on a solution to Problem 3 by trading off between the ILP solution and unoptimized, unknown variables in a manner that gets you a good trade-off between exploitation and exploration.

We think of these as SEO solutions: Separate Estimation and Optimization. In these solutions, the parameter estimation is agnostic of the optimization. We can also think of solving these problems with what we call JEO solutions: Joint Estimation and Optimization. This requires re-thinking how you would model these problems.

Notice that the more statistical approach to solving Problem 3 would be to use a semi-supervised learning approach where you use labels from Problem 2 to also generate data points and labels for Problem 3. So in practice it is possible to combine Problems 2 and 3 into a single estimation if you did not want to take the multi-armed bandit approach but that would still leave you with the exact same SEO type overall solution.

To come up with a JEO solution we do the following. We begin as before with a solution to Problem 2 and we use neural-network layer to estimate revenue and customer satisfaction functions. Now, we add a new layer into a neural network to perform the optimization part. Instead of your out-of-the-box activation functions such as sigmoid or tanh which just does a quick evaluation of the input data, we solve a constrained optimization problem in the next second layer. This provides estimates on which items a retailer should carry given the estimates from the previous layers on the demand (revenue) and customer satisfaction functions. To be specific the forward pass through the neural network actually finds exactly the same solution as the LP relaxation of the knapsack problem and thanks to the previous article, the solution has at most 1 fractional variable. In parallel in a neural network layer, we estimate functions that can provide synthetic data points and labels for observations that have not been made. This layer in essence solves one step of a semi-supervised learning for Problem 3. The loss function finally now combines the outputs of the previous layers to estimate the total revenue and ensures that total customer satisfaction is above threshold. If it turns out that the customer satisfaction is below threshold, then we re-apply the forward pass and continue to do so until we find a solution that is both feasible and at least in a single step optimized. We then like always compute the gradient associated with the loss function on the change to revenue based on carrying infinitesimal more (or less) of the items. These gradients are back-propagated into the backward pass to learn updated coefficients. Needless to say the loss functions and the functions that are used in the forward pass can be shown to be continuous and differentiable.

We can now carry out a second epoch through the neural network and continue to run it to our stopping criterion. Now why would someone want to re-architect a neural network to do this? The problem of course with standard neural networks is that simple functions can be very well defined and evaluated. But if throw in constrained optimization then practically neural networks out of the box are out the window. This is not a completely new observation and a paper by Amos and Kolter show something similar. If you think the solution to Problem 2 is perfect - then JEO methods may not perform much better than SEO methods. However, the problem with semi-supervised learning is that inherently it is needless to say, not magic. Adding extra data points and labels can come with a cost that while the hope is that the methods converge to a true underlying distribution that could have been truncated (as the case with us), the more labels that are added on the synthetic side comes with larger errors than those of supervised learning. These new synthetic points can then hurt the distribution of the labelled points. However, the whole idea with semi-supervised learning is that overall this is more helpful than harmful. But JEO methods try to exploit more the helpful side while trying to side step as much noise as possible. 

The idea with JEO is that you only add synthetic points on a need to know basis. As the optimization is exploring areas of interest, you add those synthetic points and let the estimates converge to new distributions. If there are areas that do not need to be simulated, then you do not generate synthetic data there. Therefore the optimization guides the parameter estimation and as always the parameter estimation guides the optimization and hence: joint estimation and optimization.

This is joint work with Jeremy Karp and R. Ravi of Carnegie Mellon.

Hey Srinath! Are you describing an actor critic architecture, where the actor has e.g. an OptNet layer?

回复

要查看或添加评论,请登录

Srinath Sridhar的更多文章

社区洞察

其他会员也浏览了