Types of bandits
In my previous post, I have tried to give an introduction to the concept behind the formulation of a multi-armed bandit problem with a fictional story of a gambling-addicted statistician. These problems fall under the umbrella of 'stochastic scheduling', which deals with the task of scheduling several jobs that are of stochastic (random) nature in terms of their attributes like processing times or due dates etc. Even in the case of bandit machines, their reward distribution is assumed to be of random nature despite a finite probability distribution, which makes it a stochastic bandit machine problem.
Also, the problem we discussed is a basic scenario with two primary assumptions:
1) we do not have prior information about the reward distributions for any of these machines.
2) rewards from each of these machines are independent and identically distributed (IID).
Independent in the sense that the reward from a machine during a particular round is not influenced by the rewards from the same machine in previous rounds or the rewards dispensed by other machines. Identically distributed in the sense that despite the random nature of the rewards of a machine, they belong to a probability distribution like Bernoulli's or normal distribution and the parameters of these distributions (success rate in case of Bernoulli's distribution, mean and standard deviation in case of normal distribution) are assumed to remain constant for a machine throughout the experiment.
To understand the above assumptions with an example, let us assume that there is a machine with a 60% probability of dispensing the reward among the five bandit machines. The first assumption says that we have no clue about this probability rate and the second assumption indicates that this machine will dispense rewards with the same probability distribution (In this case, a Bernoulli's distribution since the outcomes are either a reward or a no reward), with the constant success rate of 60% for the entire duration of this experiment.
Another implicit assumption of this problem is that once a machine is operated at any round of this experiment, we receive feedback only from that machine and the possible rewards had we chosen other machines at that round remain unknown. Tweaking some of these assumptions can lead us to a few different variants of this basic multi-armed bandit problem. Let us take a brief look at them.
If we tweak the first assumption such that we have some information regarding the machine before we start experimenting on it, then we have a 'contextual bandit machine'. One of the most commonly cited examples for this scenario is when a news website has to choose one among the multiple news articles to display a headline to a reader visiting the website. The reward in this scenario would be if the reader clicks on the displayed headline to check the entire article. If the user is not a first-time reader, then there is a very good chance that the website has some prior information about the user like his/her location, interests, previously viewed articles, etc. which were collected in the form of cookies from that user's previous sessions.
Next, the assumption that we only receive feedback from the machine we chose to operate at a particular round can be modified. This primary version is termed as 'bandit feedback' on which two other adaptations called - partial feedback and full feedback were developed. To understand partial feedback, let us look at the example of an eCommerce website that is trying to sell the same product to customers at different prices (of course, a totally absurd example). In this case, different prices displayed to customers can be thought of as different machines, but the reward would be a two-way game here; this website has to make the sale at the highest price possible to maximize their profit. Suppose, if the majority of users did not buy this product at the highest price available, but a certain section of users bought this when it was pitched to them at a lower price, then this eCommerce site has 'partial' feedback upon the completion of each round of this experiment. It certainly has some information about the higher prices that did not work, the lower prices which worked, and the characteristics of the users that bought this product, but they do not have the complete information about the exact price which the customer was willing to pay for this product in the first place. Probably he/she might have happily bought this item at a price that is a bit higher than the eventual lower price which was displayed to them, which makes it a partial, but not the full feedback.
And next, you might have guessed it by now, a full feedback scenario occurs when we have total information about the rewards dispensed by all the machines at each round of the experiment, despite us getting to choose only one among them. An example of this would be when a multi-armed bandit problem is formulated in the financial sector for the task of diversifying investment portfolios to maximize returns. Suppose there are a bunch of stocks available among which we will have to choose one to invest per each day as part of this experiment, then at the end of each round i.e., at the end of each day, we have complete information about the closing price of each of these stocks despite us investing in only one option.
Three more variants of the multi-armed bandit problem can be obtained by adjusting the assumption of IID rewards. As per this assumption, the probability distribution of the rewards dispensed by a bandit machine is supposed to remain constant, but what if this distribution changes over time? Let us consider the scenario where this problem is formulated in the manufacturing sector to increase the output from a set of machines. As per the practicality of this case, as we keep exploiting the best machine to maximize the reward, it might grow weary, and slowly its return might be diminished, which would be analogous to the success rate of the bandit machine decreasing steadily after certain rounds of experiment. This scenario is called a 'restless bandit problem'.
Till now, we have looked at the scenario of a fluctuating distribution of a machine due to natural consequences, but what if the owner of the casino who realized that you are exploiting the machine hides behind it and starts messing up with the machine's distribution intentionally through some internal settings to ensure that you do not get a reward at that round. This is called an 'adversarial' bandit problem, where in addition to cracking the code of bandit machines, we also need to deal with the adversary trying to thwart our attempts. If the casino owner begins to feel that too much tinkering with the settings might make us suspicious and so he decides to limit his meddling only to a certain proportion, then we do have an adversary, but he/she is constrained from plotting against us to the full extent, this type of problem is called a 'constrained-adversarial' bandit problem.
All these alterations of the original bandit problem resulted in several applications in many diverse fields, which makes us appreciate the simple mathematical formulation of this seemingly crazy problem.