Mar 8, 2010

Bandit Online Convex Optimization

The Setting

An abstraction that has proved very useful in Machine Learning and continues to attract the interest of researchers is the online convex optimization (OCO) framework. In OCO, the interaction between the learner and environment proceeds in rounds (let's say, for simplicity, that the number $T$ of rounds is fixed in adavnce). At round t,
  • Learner makes a move $x_t$
  • Adversary makes a move $f_t$
  • Learner suffers the "cost'' $f_t(x_t)$
The set of moves for the learner is some convex set $K \subset R^d$. The adversary's moves, on the other hand, are convex functions $f_t:K \to R$ in some class $\mathcal{F}$. The goal of the learner is to minimize its regret:

\[ \sum_{t=1}^T f_t(x_t) - \min_{x \in K} \sum_{t=1}^T f_t(x) \ .\]

As is clear from the name "adversary", we do not wish to assume anything about the way the functions $f_t$'s are generated. They are generated in a completely adversarial manner and the learner's task is to behave in a way such that regret is bounded no matter what the adversary does.

Readers of this blog have probably seen this quantity before. I will not try to justify why it is important to study this quantity here. The excellent book of Cesa-Bianchi and Lugosi is a good place to learn more about the motivation behind studying the regret.

The topic of this post is to discuss the "price of bandit information'' (borrowing the title of a paper by Dani et al.) in the OCO setting. In the description of OCO above, I was deliberately vague about what the learner gets to see at each round. After all, OCO is a kind of repeated game between the learner and the adversary and the information that the players receive is key to their being able to play well. In the "full information'' version of OCO, the learner gets to see the entire function $f_t$. In particular, it is OK if the learner's strategy involves accessing the gradient $\nabla f(x)$ or Hessian $\nabla^2f(x)$ of $f_t$ at any point $x \in K$. In the "bandit'' version of OCO, the learner is severely restricted: it only gets to see the cost $f_t(x_t)$. What is the price that the learner has to play for only having access to this restricted information?

The "bandit'' terminology comes from Statistics that has a long tradition of studying multi-armed bandit problems. The Statistics literature usually makes some stochastic assumptions about the environment. The adversarial version of multi-armed bandit problem was first studied by Auer, Cesa-Bianchi, Freund and Schapire. Although the "bandit'' terminology is widely used, people also use other descriptions like "partial feedback'' or "opaque feedback'' (e.g., see Kleinberg's thesis).

The full information OCO setting is relatively well understood. Using Zinkevich's online gradient descent algorithm one can get $O(\sqrt{T})$ regret rate that is known to be optimal. The bandit OCO problem, in contrast, is far from being completely understood. First of all, in the bandit setting, it is essential for the learner to randomize his moves. Therefore, we need to distinguish between oblivious and adaptive adversaries. An oblivious adversary is really just a sequence of functions. An adaptive adversary however changes its moves according to the play of the learner. For technical definitions, the reader can consult Kleinberg's thesis. For this post, we can assume an oblivious adversary. Even then, matching upper and lower bounds for the regret are not known.

Known results

The seminal papers on the bandit OCO problem are those of Flaxman, Kalai and McMahan and the almost concurrent work by Kleinberg. Both obtain $O(T^{3/4})$ expected regret against an adversary that plays bounded and Lipschitz continuous convex functions, i.e.

\[ f_t(x) \in [-M,M] \ ,\]
\[ |f_t(x) - f_t(y)| \le L\|x - y\|_2 \ .\]

Kleinberg's result requires more smoothness assumptions and only holds against oblivious adversaries. Flaxman et al.'s result actually holds against adaptive adversaries as well. To the best of my knowledge, there is no lower bound larger than the obvious $O(\sqrt{T})$ full-information lower bound. So, there's a gap between lower and upper bounds and closing it seems like an fundamental problem.

No further progress: lack of compelling applications?

At this point, I would like to point out that there has been essentially no progress on the general bandit OCO problem after the two seminal papers mentioned above. There was a lot of work on bandit online linear optimization (OLO), i.e. the adversary plays not just convex but linear functions. The surprising answer here was that the learner pays no price for bandit information as far the dependence on $T$ is concerned (of course, the dependence on other parameters, like the dimension $d$, does get worse). A long line of research culminated in the following two results: (i) it is possible to achieve $O(\sqrt{T})$ regret with high probability against an adaptive adversary using an inefficient algorithm, (ii) it is possible to achieve $O(\sqrt{T})$ expected regret against an oblivious adversary using an efficient algorithm (provided a self-concordant barrier for the set $K$ can be efficiently computed). These two results can be found here, and here. Except for the issue of obtaining a high probability result with an efficient algorithm (for partial progress, see the recent work of Abernethy and Rakhlin), the bandit OLO problem can be considered solved.

Why did later research consider the special case of bandit OLO and leave the bandit OCO problem untouched? Is it because bandit OLO has "real world'' applications such as network routing (online shortest paths is a special case of bandit OLO)? Does anyone know of a situation where the bandit OCO setting would make sense? A possible application of bandit OCO would be to repeated convex-concave games. Here, the learner tries to minimize a two-variable function $f(x,y)$ and the adversary tries to maximize it. The function $f(x,y)$ is convex in $x$ and concave in $y$. If we assume that the learner, having played $x_t$, does not get to see the move $y_t$ of the adversary but only the payoff $f(x_t,y_t)$, then this problem can be cast in the bandit OCO framework by setting $f_t(x) = f(x,y_t)$. Is there a "real world'' situation that corresponds to such a partial feedback convex-concave repeated game?

Some parting thoughts

It will be great to see some progress on the bandit OCO problem. Before I close this post, here are three observations about the problem.

First, can we prove, using a very high level argument, that bandit OCO ought to be no harder than bandit OLO? In the full information setting, such an argument is, in fact, possible and so we know that, a priori, the worst case regret in full-info OCO should be no larger than that in full-info OLO. I suspect that such a high level argument is not possible in the bandit setting. But it will be good to understand why the full-information argument breaks down in the bandit setting.

Second, existing bandit OLO algorithms fall roughly into two categories. The first kind of algorithm discretizes the body $K$ and then uses algorithms from Auer et al. treating the problem as a multi-armed bandit problem with a huge number of arms. The second category used projected gradient ideas more akin to Zinkevich's online gradient descent algorithm. Both kinds of algorithms use some technique for estimating the information (such as the gradient) that is needed to run the algorithm but is not revealed by the adversary. How do we tweak the estimation techniques in the bandit OLO papers to generalize them to bandit OCO? Is one kind of algorithm easier to tweak that the other?

Third, what is the connection of bandit OCO to deterministic and stochastic convex optimization? For example, there is the well-known first-order oracle model for studying the complexity of convex optimization. In this model, the optimization algorithm can query the oracle for the values of $f(x)$ and $\nabla f(x)$ at any point $x$ in the domain on the convex funcion $f$ that is being minimized. Are there results for the "zero-order'' oracle model where the algorithm can only query for $f(x)$? Are there other models considered in other fields (Statistics, Operations Research, Economics, etc.) that are related to bandit OCO?

1 comments:

Fares Hedayati said...

hi Ambuj
thanks for the post, I was wondering if you can explain convexication of online shortest path, is there any paper on this,
thanks
fares

Post a Comment