Multi-Armed Bandits: Thompson Sampling Algorithm with Python Code
Learn about the Thompson Sampling (Bayesian Bandits) algorithm.
Introduction
In this series of posts, we experiment with different bandit algorithms to optimise our movie nights — more specifically, how we select movies and restaurants for food delivery!
For newcomers, the name bandit comes from slot machines (known as one-armed bandits). You can think of it as something that can reward you (or not) every time you interact with it (pull its arm). The objective is, given a bunch of bandits that give different rewards, to identify the one that gives the highest ones, as fast as possible. As we start playing and continuously collect data about each bandit, the bandit algorithm helps us choose between exploiting the one that gave us the highest rewards so far and exploring others.
You and your friend have been using bandit algorithms to optimise which restaurants and movies to choose for your weekly movie night. So far, you have tried different bandits algorithms like Epsilon-Greedy, Optimistic Initial Values and Upper Confidence Bounds (UCB). You’ve found the UCB1-Tuned algorithm to work slightly better than the rest, for both Bernoulli and Normal rewards, and have ended up using it for the last few months.
Even though your movie nights have been going great with the choices made by UCB1-Tuned, you miss the thrill of trying a new algorithm out.
“Have you heard of Thompson Sampling?” your friend asks.
Excited you pick up your phone and start reading about it.