# Bayesian Search, Gaussian Process beliefs, NFL

## Contents

## The Bayesian Search Game

As an introduction to the idea of Bayesian search please
(or download its source code) In this game you are to find maxima of a smooth 1D function. Humans certainly have some kind of prior about where optima might be depending on assumptions about the smoothness. And we update this prior to posteriors when we observe new data from the function. |

The same approach can be realized algorithmically. The `opponent'
algorithm in the game starts off with a flat Gaussian Process prior
over the function (which is actually the correct prior from which the
objective function was really sampled). When new data from the
function is observed the prior is updated to become a posterior
(eventually this is what I mean by *Bayesian* search). Now, this
posterior precisely describes the expected mean of the unobserved
function and any point but also the uncertainty (variance) about this
expectation. In some sense (the Bayesian sense) this is optimal: the
posterior really reflects what one can possible have learned about the
function given the few observations. This posterior is a very suitable
basis for heuristics to choose a new search point: for every point we
can, e.g., check if the expected mean is larger than our
best-so-far. Or, also accounting for the uncertainty, we can
explicitly compute the probability that the function value at some new
point is larger than the best-so-far, or what the expected improvement
is. (The improvement is defined as zero if the new function value is
less than the best-so-far, and the difference if it is larger.)
Maximizing such quantities leads to very efficient search heuristics.

This approach is also closely related to the POMDP
framework. (Partially Observable Markov Decision Process.) Here an
agent needs to take actions in a world of which he can't fully observe
the state. What he can do though is maintain a *belief* about
what the state of the world might be (in form of a distribution over
the states). Again, given some new observations this belief can be
updated in a Bayes-optimal way and in this sense the belief is all the
knowledge the agent could possibly have about the current state of the
world. POMDP theory shows that optimal behaviors can be based on the
belief (the optimal policy is a mapping from belief to action). I
guess the analogy to search is clear: The unknown state of the
world corresponds to the unknown objective function; the actions
correspond to picking a new search point.

This analogy at the same time also tells us what we'd need for a hypothetical optimal search strategy: the long-term reasoning about actions, i.e., choosing the current action not only for the immediate benefit but also for future benefits. In the POMDP case it is very intuitive that the optimal action often is one that gathers information, reduces the uncertainty in the belief and pays off later. The same is actually true in the search case. An optimal searcher would have to reason whether search points should be positions to maximize the immediate expected improvement or rather to learn as much as possible about the function to ensure later improvement. Formalizing all this is not the problem! Realizing this is notoriously difficult. The computational cost of such lookahead search point planning explodes already for a 2-step lookahead. What seems sensible though are heuristic approaches that lead to a trade-off between learning about the function and seeking for high values -- in the end this is often also the more practical approach to Reinforcement Learning problems rather than aiming for an optimal trade-off via the POMDP formalism. If you like, try yourself to do some lookahead planning of search points in the Bayesian Search Game! :-)

## NFL

Why do I mention No Free Lunch in this context? Roughly, if we test an algorithm over the set of all possible functions then the algorithm is in average as good as random search. Formulated slightly differently, if we assume an `unstructured prior' over the space of functions and average the algorithm's performance w.r.t. this prior, then this averages to random search performance. It is actually not that difficult to define what `unstructured prior' means such that this is a necessary and sufficient condition for No Free Lunch to hold (see Igel & Toussaint, 2004).
Formulated in this way, this nicely relates to the idea of Bayesian
search: If we start with an unstructured prior over the space of
functions, then the Bayesian update of this prior given an observation
is still completely unstructured on the yet unobserved points. That's
why we can't predict anything about yet unobserved points (no
generalization) and any algorithm is as uninformed as random search
when it comes to the yet unobserved sites. Further, NFL theorems can
inversely be interpreted that one always should have a *specific*
class of functions in mind -- or a specific prior over functions --
when designing a search algorithm, because a generically good search
algorithm doesn't exist. But then, if you have to start with a prior
over functions anyway, then the most natural thing to do is to update
this prior based on observations and maintain a proper belief over the
objective function. If this is computationally feasible it seems like
a waste not to do this. The belief, including the mean estimates and
its uncertainties, should always be a valuable resource when it comes
to deciding on new search points.

Since I like to think about things and NFL and search in terms of distributions I settled for myself on a slightly different formulation of NFL with a rather simple proof. This is really just a matter of taste. In this formulation the emphasis is not about set of functions or closure under permutation. It is solely on distributions over functions. Please note that the notion `a set of functions is closed under permutation' is very closely related to `in a distribution over functions the function values at any two points are mutually independent'. That is important to realize and I think mutual independence is much more intuitive than closure under permutation. [I only have a first draft yet of this alternative formulation; proving the necessity of the condition is not that easy... contact me if you want to see the draft.]

## Literature

Central to this topic are:- Jones, D., M. Schonlau, & W. Welch (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13, 455-492.
- Gutmann, H. (1999). A radial basis function method for global optimization.
- Jones, D. R. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization 21, 345-383.
- Poland, J. (2004). Explicit local models: Towards optimal optimization algorithms. Technical Report No. IDSIA-09-04.
- Ulmer, H., F. Streichert, & A. Zell (2003). Optimization by gaussian processes assisted evolution strategies. In Selected Papers of the International Conference on Operations Research (OR 2003), pp. 435-442. Springer.

- Carl Edward Rasmussen & Christopher K. I. Williams (2006). Gaussian Processes for Machine Learning. The MIT Press, 2006. ISBN 0-262-18253-X.
- Hansen, N. & A. Ostermeier (2001). Completely derandomized self-adaption in evolutionary strategies. Evolutionary Computation 9, 159-195.
- Igel, C. & M. Toussaint (2004). A No-Free-Lunch theorem for non-uniform distributions of target functions. Journal of Mathematical Modelling and Algorithms 3, 313-322.