zettelkasten

Search IconIcon to open search
Dark ModeDark Mode

The Explore Exploit Problem

The exploit/explore problem is a very straightforward problem: a problem of how to manage the tradeoff between exploring new things and exploiting old discoveries in an optimal way. The problem is simple and might have mathematical solutions under certain assumptions, but things get very complicated in many real-world situations.

Exploring gives a sense of novelty and might lead to discoveries of better things to exploit, yet exploiting existing information allows one to enjoy previous discoveries rather than exploring the unknown that could be worse off. There seems to be some balance point, therefore, between explore and exploit. As a result, the problem is finding out where the balance is.

It can be helpful to set up some axioms for approaching this problem:

  1. The rewards from exploration accrue over time as they are being exploited
  2. The certainty of information about an entity explored increases the more it’s explored
  3. The value of exploration and exploitation are dependent on the quota of exploration and exploitation

Some examples of explore/exploit problems

  • Whether and when to explore xor exploit restaurants in a given city
  • The pharmaceutical industry price control problem (R&D budget and drug price tradeoff)
  • Which slot machine to play

One approach to solving the problem is by using the Gittins index, a number that is assigned to every option at a point in time. The Gittins index changes depending on the information gathered from previous explorations.

The math for that method, however, is too complicated to be useful, so other heuristics are conceived.

Another approach to this problem uses a regret-minimisation framework. It is considered optimal if we do not regret exploring too much or too little, which makes sense. The Upper Confidence Bound algorithm does this. Using some statistics from previous trials, we can get a confidence interval of the value of the options; simply pick the one with the highest upper bound.

In practice, companies usually use B tests to explore and exploit at the same time.

Marvin Zelen’s adaptive trials algorithm can potentially speed up clinical trials and save lives compared to standard exploration.


References:

  1. Christian, Brian, and Tom Griffiths. Algorithms to Live by: The Computer Science of Human Decisions. First U.S. Edition, Henry Holt and Company, 2016.
  2. https://en.wikipedia.org/wiki/Gittins_index