Streaming Machine Learning
This article builds on a few of the previous write-ups. While still independent, and definitely suitable for a general audience, you can review the article on the importance of building simple algorithms as a practical solution to many real-world problems, the article on one of the problems that Onera tackles and the last write-up on control systems. Read on even if you just skim the details, it will make sense - especially the problem, the streaming learning, the correctness properties, the product knob and extensions. It is simple and in fact thats the whole point.
Problem. We quickly recap the relevant problem that Onera tackles for its retail clients. For each product that a retailer sells, there is an associated customer satisfaction for selling that product. For this article, we will use a binary score for each customer who reports the satisfaction for each product that they bought. It can be 0 if that customer is unsatisfied with that purchased product and 1 if that customer was satisfied with that product.
Also, associated with each product that the retailer sells, we will have a control knob that can be tweaked. At one side of the control (say 0), we will have higher revenues or margins but lower customer satisfaction. As the retailer turns the control knob, we will have lower revenues or margins but higher customer satisfaction levels. There are several examples of such knobs: the price at which the retailer sells the product, the delivery speed at which the retailer fulfills the product, the aggressiveness of delivery promise date, or how many items are even made available to sell if fulfillment confidence is low, etc. It may help to pause reading to think about the control knob and the customer satisfaction for these examples.
The Control System. Now we will build the absolute simplest pure software control system borrowing ideas from real-time control systems. The control knob (such as the control surface of the Saturn V or the rotors on drones) can be moved one direction or another and as a result of that movement we have a response function which is the customer satisfaction (similar to the position of the rocket or the drone). The goal is to maintain the customer satisfaction at a target average rate t similar to the goal of keeping the Saturn V on trajectory or the drone hovering at a point in space by adjusting the controls.
At any point in time, the real (background) customer satisfaction score for each product is between 0 and 1 and the real response function f(x) as the control knob x is rotated can be modeled as a logistic function 1 / 1 + exp(-mx + c). Formally, the goal is to find x such that f(x) = t so that you set your control knob to x.
Needless to say if the function f is known, then this is a simple task: find f^-1(t). However, the first problem is that f is unknown as the only feedback we get is customers reporting a binary score as the retailer sells each item. The second and realistic problem is that the function f in fact can change over time and at any time behind the scenes. For example, prices need to be competitive compared to other retailers’ prices, target delivery date expectations vary wildly with season and even during the lifetime of a product. This is very similar to winds changing at any point and exact compensation of moving the rotors of the drone not known or hard-coded. Note that the guidance system as explained in the previous write-up is attempting to estimate f and the navigation system is determining the next x_(n+1) to set so that f(x_(n+1)) = t as shown in pages 5-10 to 5-12 of Apollo doc.
One approach is to say that you estimate f(x0) at time 0. You would do this by setting your control to x0 for a period of time and averaging what you observe as the customer reports their binary satisfaction. Then you can set the control to x1 (theoretically any value of x1 != x0 works), and estimate f(x1) by averaging the customer satisfaction over next time period. Using these points, then compute the inverse of f, and then determine x = f^-1(t). As you can see towards the end of page 5-11 and following page 5-13, in the Apollo guidance system this is very similar to the first approach that they consider. They create a matrix of rank 6 (3 variables for 3-d coordinates and 3 variables for 3-d velocities), and invert the matrix to solve for their function in the guidance system. In their case they end up with 6 equations, in our case 2 equations but the idea is the same. They throw that approach out because of two standard reasons: finding the inverse can be computationally intensive (not at all the case any more for these sizes on normal computers), and there may be numerical instabilities. Note we need to pick x0 and x1 that are reasonable and significantly different to prevent linear dependence, we need to pick time period over which you average for significance and its very sloppy to set it to arbitrary values to just estimate the function f.
Control System Correctness. Before going further, let us define what it means to build a well-behaved control system. We want is to ensure that if we are stationary, we remain stationary. That is, the drone does not drop out of the air for no reason when there is no wind. Second condition is that if we are off-course, we can provably improve. If we satisfy these two conditions, our control system is correct.
Streaming Machine Learning. We now build a system that satisfies the above properties using our streaming approach. Specifically, given the feedback of the last customer that walked out with the product, update the control x slightly JUST based on whether that last person was satisfied or dissatisfied. If they are satisfied, we are going to set x to x - a. If they are unsatisfied, we are going to adjust x to x + b. The question is what should a and b be?
- To satisfy the first condition of ensuring that we stay in position when there is no change to f we can do the following. We know at the previous time f(x) = t. We now want to show that at next time period with update x’ we stay in position, that is E[f(x’)] = t. This is equivalent to saying that we want E[x’] = x. That is,
- To satisfy the second condition of ensuring that if we are off-course we improve our position, we show one side of the inequality and the other side is similar. Lets say currently f(x) < t then we show that E[f(x’)] > f(x). Since f is monotonically increasing, this is equivalent to showing that E[x’] > x. So we want
Since f(x) < t, it is easy to verify (1 - t) / t < (1 - f(x)) / f(x) and therefore if we set a = b (1 - t) / t, we would satisfy the last inequality for the second condition as well as the equality we needed for the first condition.
Under-constrained System. We have infinitely different values of a and b that all work to produce a correct control system that will stay in position when there is no wind and will return to position when there is wind. For simplicity, we set a free variable y. Then a = y and b = yt / (1 - t). If y is fixed, then both a and b are fixed. As an example if you set y = 1, then a = 1 and b = t / (1 - t) and it gives you a perfectly valid control system that satisfies both the control properties above. If you set y = 10, you have a = 10, b = 10 t / (1 - t).
The Product Knob. When you have such a free variable, it gives you a product knob. How does changing y change the system? If y is large, then the system is very quick to adjust to new changes in f. If there is serious wind, then the drone takes aggressive control actions to quickly restore state. However, the drone is wobbly even when there is a little to no wind. On the other hand when y is small, then the drone stays stable on still conditions but it is also slower to recover when wind blows it off-course.
For example if you are building the fly-by-wire systems on the F-16, then you may want to expose such product knobs to the pilot. You could argue that the pilot is in the best position to judge if the plane is currently in a dog-fight or trying to shake off a surface to air missile mode, in which case the pilot turns this knob all the way to the aggressive side, since s(h)e wants the plane to aggressively respond. On the other hand, if the plane is cruising or trying to fly straight as fast as possible, maybe they turn the knob all the way to the other side.
Extensions
- Model-less: First note that our streaming learning algorithm above did not even use the fact that the response function is sigmoidal. In fact this method is nearly model-less. As long as the response function is any monotonic function this method works. It is extremely intuitive, insanely simple (arguably the simplest algorithm you can build to solve this problem) while still provably accurate.
- Elimination of Product Knob: It is arguable that the control system should not have a free variable. As an example a drone (although its only built for stability) does not expose such a variable (RC planes that do stunts may). One way to remove the free variable is to use a batch of data points to determine how far off your estimate of f() is from the reality. Then create an update step that is proportional to the difference between the estimate and reality. So if the conditions change, your estimate f’ will be off from background f and you switch to a more aggressive mode and on the other hand if estimate f’ is close to f, then you are on more conservative mode. If all this sounds familiar then of course it is and you start getting into ideas similar to gradient descent.
- Multi-Product: All of the above talks about a single product. But products maybe similar to each other - as an example all gift cards, or all scarves etc. One way to extend this method to the multi product case is to setup a similarity function s(p1, p2) that measures the similarity between the two products where s(p1, p1) = 1 for identical products and s(p1, p2) = 0 for two products that are completely different. One way to develop such a function is to model the distance between the products on the ontology. An even better one, would be to use the correlation with the customer satisfaction from the past. Now the update function for all products simply becomes: a = y s(p1, p2) and b = y s(p1, p2) t / (1 - t). If similarly s() is 0 for all products, we get back our old system. If you make s() more and more larger, we update more products even with the update of a single customer satisfaction score.
There are more intricacies and extensions beyond the scope of this article but the primary purpose was to demonstrate two points.
- How amazingly, almost unbelievably powerful a single update step can be that just looks at one data point in production at a time with zero training data and has provable guaranteed accurate performance.
- It is a neat and realistic model on how to build a simple (arguably the simplest) control system in pure software systems.
While we do value the KISS principle and try to keep software and algorithms as simple as possible, we do note that our production systems do take a different and more complicated approach for reasons beyond the scope of this article.