Jul 18, 2010

David Blackwell has died

(July 26, 2010) Update: Eran Shmaya links to a bunch of really nice posts on Blackwell, several of which I really enjoyed.
-------
On July 8th David Blackwell, professor emeritus at UC Berkeley's Dept. of Statistics, passed away. He was 91 years old.

On this blog, I have made my admiration for Prof. Blackwell pretty clear, devoting not only a post on Blackwell's Approachability Theorem, but also another post on the interesting story of his life and career in research. I'm particularly saddened that I never got a chance to meet him, especially so given that I've been a student at his university for four years.

For those interested, The New York Times has a really nice piece on Blackwell's life and accomplishments.

Jul 15, 2010

Markets, Equilibria and Regret Minimization

In a whirlwind post, Noam Nisan ties together a range of topics, including regret minimization, "socially concave" games, and market equilibria. It's quite a discussion and definitely worth a read.

Let me emphasize a subtle but important point that I think often gets lost. A correlated equilibrium (or a coarse correlated equilibrium) in a multiplayer game is most naturally defined via the notion of regret. A joint distribution on the strategy profile is a correlated equilibrium (or coarse correlated equilibrium) if each player has no internal regret (or, respectively, no external regret) when all jointly play according to this distribution. Hence, when all players are playing according to an internal (or external) regret minimizing strategy, the empirical distribution of the strategy profiles must, almost by definition, be a near-equilibrium, since each player has vanishing regret.

When it comes to computing Nash equilibria, however, the same trick doesn't work -- perhaps not surprising since this problem is known to be PPAD-complete in general. What distinguishes Nash equilibria is that each player must have no regret independent of their opponent's randomized actions; that is, a Nash equilibrium is a collection of uncorrelated distributions on each player's strategy. However, Noam points out that sometimes we get lucky and can achieve Nash equilibria via no-regret learning:
Just the fact that we have called what is reached (in the sense of average over time) an equilibrium does not mean that we have to like it from a game theoretic sense. Could we get a “normal” equilibrium from such corse equilibrium? Even-Dar, Mansour, and Nadav point out a family of games, termed socially concave games, where this is the case.

... And We're Back!

Posting has slowed to a crawl over the past few months. My apologies for this -- it's been a crazy period with deadlines, conferences and travel.

But the summer is in full swing, and research is blossoming. There is no shortage of topics I would like to discuss, and I have spoken to many who would like to get a post of their chest. I'll be lining up authors and topics, and hence posting shall commence anew. If any readers would like to contribute, please let me know.

May 24, 2010

Prediction Markets and Learning

Jenn Wortman Vaughan recently visited us here at Berkeley, and she presented some interesting work. In short, she and several coauthors found an interesting connection between the design of prediction markets, a mechanism for incentivizing individuals to reveal their beliefs about the outcome of a future event, and design of online learning algorithms.

Prediction markets have gotten quite popular in recent years, particularly because research has shown that, at least in many circumstances, they are quite accurate at predicting future events. The most well-known example of such "betting markets" is Intrade, where bettors can buy and sell contracts on a range of different events, such as "Barack Obama will win the US Presidential Election in 2012". Two bettors split the cost of a $100 contract and, after the outcome is revealed, this contract pays off $100 to the correct party. How the parties share the cost of the contract is determined by a market, and a bettor willing to pay $60 for the "Obama in 2012" contract presumably believes that Obama has at least a 60% chance of winning, since otherwise this contract would have negative expected payout.

It has been noted frequently that, in the 2004 US presidential election, the contract prices on Intrade correctly predicted the outcome of all 50 state contests. In fact, these markets were considered so accurate that DARPA wanted to try use them to help inform policy makers, setting up what was to be called the Policy Analysis Market. DARPA apparently wanted to use the markets to predict, among many others, future political events in the Middle East. The news headline, of course, read "Pentagon Web Site Promotes Betting on Terror". At this point the US Senate quickly pulled funding for the program.

I'll briefly describe the connection between learning and prediction markets now, but I'll save the more interesting details for a longer post. The question posed by Economists is how to design these markets so that they lead to robust predictions while also providing the right incentives to bettors. Many of the market mechanisms proposed in the Econ community involve a "market maker", a central agent that buys/sells contracts and sets their prices (this is quite different from the Intrade model). For a particular event, let's say an election, the result will be exactly one of $N$ outcomes, i.e. $N$ candidates. The market maker can then sell $N$ separate contracts, where the $i$th contract pays out USD1 to the purchaser of the contract if the $i$th candidate wins the election (and the remaining contracts pay out nothing). The market-maker's job, then, is to determine what the price of a contract should be given the current demand. Of course, the market maker would like to minimize the potential loss it could suffer, which is the difference between the money it earns selling contracts and the money it pays out once the election is over.

The surprising observation made by Jenn and her colleagues is that we can take the goal of the market maker and massage it into a regret-minimization problem in the "expert setting". The market maker must choose a price vector $p$ at every time period, and these prices should roughly behave like a probability distribution, i.e. $\sum_i p_i \approx 1$ -- if this were not the case then bettors could either buy (or sell) one of each contract at a total price greater than (or less than) USD1, guaranteeing an arbitrage opportunity. When contract $i$ is purchased, the market maker earns a "gain" of $p_i$. However, after the election, the market maker might have to pay out a sum equal to the number of contracts sold for the most popular contract type $i$. Notice, by letting the "contracts" become "experts" in a prediction setting, we are now looking at the difference between the "gain of the best expert" and "expected gain of the algorithm" which chooses a sequence of distributions over contracts/experts. The latter problem is exactly that of regret minimization in the expert setting.

This quick description is really just scratching the surface. There's more to discuss here, but I'll save it for a future post. Stay tuned!

Apr 18, 2010

Bandits robbing dark pools

Since Jake has been reminding me of my duties as a good machine learning citizen which including posting on the blog occasionally, I decided to write this post about something I recently worked on. In their UAI 2009 paper, Ganchev et al from Upenn introduced the dark pools problem. Dark pools are stock exchanges created to facilitate trade of large volumes of shares without putting the large institutional trader at a disadvantage. The thing that is dark about dark pools is that the liquidities of a share available are not revealed to the market (or tape to be more precise). The Wikipedia article on this topic provides an excellent overview.
Ganchev et al formulated the problem of a large trader wanting to sell a large volume of shares by trading a fraction at each of $K$ different dark pools. However, he only gets to see how many of the orders he placed at a particular dark pool actually got executed. Thus the true liquidities get revealed only when his order is larger than the true liquidity (and hence the number of shares he can actually sell is the true liquidity). Suppose this trader wants to maximize the total number of shares he can sell over a horizon of several trading periods.
In our recent paper, we showed that we can formulate the problem as an online convex optimization problem. This was interesting due to a variety of reasons. First, the total number of shares we can trade at each round is allowed to change adversarially. Since we have a hard constraint that the net allocation cannot exceed the total available volume, this seems like a problem of adversarially changing constraint sets. Traditional approaches a la Herbster and Warmuth would make it seem that we should incur a penalty based on how often the changes happen. However, we were able to avoid this penalty, and obtain an exponentiated gradient descent style algorithm that gets minimax optimal regret. The idea was simlpe, for each unit to be traded, we maintain a distribution over all the $K$ venues. When we get a volume constraint, we compute the expected number of shares to be traded at each venue according to our distributions on all units up to the constraint level.
So what do bandits have to do with all this? Well it turns out that the strategy I described above that is based on playing expected values at each venue assumes that it is cool to trade 1.278 shares at venue 1, 2.593 shares at venue 2 and so on. Since that would seem a little impractical, we need to restrict ourselves to playing integral values only. Now let us start with the special case where we have to trade only one share at one of $K$ venues at every round, and we found out if the liquidity at that venue was non-zero or not at that round. But this is just the $K$-armed bandit problem restated, which we know how to solve using the Exp3 algorithm. Can we build on this to get a solution for bigger volumes?
We were able to extend Exp3 to the case of arbitrary volumes, by using a probabilistic rounding between the floor and ceiling of the desired allocation level. Unfortunately this approach yields a $O(T^{2/3})$ regret bound. As always, we have a computationally inefficient way of getting to $O(\sqrt{T})$ regret by using Exp4. A lot of hours of thinking from me and some other smart people has definitely convinced me that this is a hard problem, much like the other scenarios where improving $O(T^{2/3})$ to $O(\sqrt{T})$ has proved to be hard.

Apr 14, 2010

MLcomp

Percy Liang and I have been developing a website, http://MLcomp.org, which many readers should find pretty interesting. In short, MLcomp is "for objectively comparing machine learning programs across various datasets", and we're launching it this week. Soon we'll be posting the brief description below on John Langford's blog, but it's up and running now so check it out!

------

Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We (Percy and Jake) believe there are currently a number of shortcomings:
  1. Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets. Great. But what about on other datasets? How sensitive is this result? What about compute time - does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster?
  2. Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset. Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing. Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options. And what if we wanted to compare on more than just one dataset and two algorithms?
  3. Incomplete View of State-of-the-Art: Basic question: What's the best algorithm for your favorite dataset? To find out, you could simply plow through fifty papers, get code from any author willing to reply, and reimplement the rest. Easy right? Well maybe not...
We've thought a lot about how to solve these problems. Today, we're launching a new website, MLcomp.org, which we think is a good first step.

What is MLcomp? In short, it's a collaborative website for objectively comparing machine learning programs across various datasets. On the website, a user can do any combination of the following:
  1. Upload a program to our online repository.
  2. Upload a dataset.
  3. Run any user's program on any user's dataset. (MLcomp provides the computation for free using Amazon's EC2.)
  4. For any executed run, view the results (various error metrics and time/memory usage statistics).
  5. Download any dataset, program, or run for further use.
An important aspect of the site is that it's collaborative: by uploading just one program or dataset, a user taps into the entire network of existing programs and datasets for comparison. While data and code repositories do exist (e.g., UCI, mloss.org), MLcomp is unique in that data and code interact to produce analyzable results.

MLcomp is under active development. Currently, seven machine learn task types (classification, regression, collaborative filtering, sequence tagging, etc.) are supported, with hundreds of standard programs and datasets already online. We encourage you to browse the site and hopefully contribute more! Please send comments and feedback to mlcomp.support (AT) gmail.com.

Apr 6, 2010

Publish your work in journals... so it will live forever?

Lance Fortnow believes a major reason to publish papers in journals, rather than just letting them sit on the Arxiv, is because otherwise *gasp* they might someday get lost! An example he gives:
I wanted to read a copy of Karp's NP-completeness paper which only appeared in the proceedings of a one-shot workshop in 1972. I ended up going to the library to dig it up. But library books get lost and many young people today don't even know where the library is. Later I found out Luca had scanned the paper for a course he taught. But how long will Luca's Berkeley pages last and what about all the papers that don't lead to Turing awards.

So publish your papers, best in a journal as well as a conference. Even if you don't think it matters for you in the short run, it can make a big difference for the community long into the future. What good is pushing the boundaries of science if those boundaries get snapped back because work gets lost.

I have heard this concern raised more than once, and you can say I'm a tad skeptical. I will point out, however, that this was brought up in each case by someone who finished their PhD when the word "Google" was spelled "Googol" (and without all the silly colors).

When I read papers now, I download a first copy which is stored in my downloads folder. When I want to look at it again, I download another copy because it's actually faster than finding the first. When I want to cite it correctly, I download it again. Each of these copies gets stored on a number of incremental backups. Let's say I had some reason to destroy all evidence of having ever read this paper, well I'd have a pretty difficult time of this.

I checked the Arxiv, and there are nearly 600,000 total papers stored there. If we conservatively estimated that the average paper is less than 1.5MB, then I could store the entire Arxiv library on my personal machine. So how many copies must be stored on Google, Yahoo, Bing, etc.? And after a year of being posted on the Arxiv, a typical paper might reside on 10,000 different hard drives, if I had to make a conservative estimate.

The fact is, we're in a world where tiny bundles of data (i.e. 1.5MB pdf files) cost effectively nothing to store and, in the aggregate, are very valuable to index and make available. Indeed, a main factor that will hinder the indexing and availability of a particular journal paper is the existence of a paywall. Compared to such pay-to-play journals, for long-term existence I'd trust my research at the Arxiv any day of the week.

There plenty of good reasons to upgrade an Arxiv paper to a journal paper, but is one of these reasons to assure it won't get lost? No way.

(Bonus Prize: Yahoo's homepage in 1996 and Lance's homepage in 1997, does it bring back memories?)

Mar 31, 2010

The fallacy of adaptivity

Due to the recent Spring Break posting has been a bit slow. But now fun has ended and the crank is turning again. I'm hoping to get back to at least $1 + \epsilon$ posts/week. (P.S. In case it's not yet clear, this blog is intended to be primarily populated by guest posts, so please contact me if interested)

I was talking to my good friend Alekh Agarwal before the break, and he made a important point that I'd like to expand a bit.

In adversarial online learning/decision models, you'll often hear researchers say "the bound holds even against an adaptive adversary". What does this mean? Essentially this says that the the sequence of observed data points can be chosen in an adaptive manner. In other words, the learner/algorithm may be presented with data points that depend on the learner's previous actions. We can describe this mathematically as follows. Let's say we're playing an online decision game in which, at each time $t$, the learner choses a $w_{t}$ followed by the adversary's choice $z_{t}$, and ultimate objective is to minimize some objective $\text{Obj}(w_{1},z_{1},\ldots,w_{T},z_{T})$. For a guarantee against an adaptive adversary, you have to bound this expression:

\[
\min_{w_1} \max_{z_1} \ldots \min_{w_T} \max_{z_T} \text{Obj}(w_{1},z_{1},\ldots,w_{T},z_{T})
\]

Against an oblivious adversary, which is a weaker opponent, the relevant quantity is
\[
\min_{\text{alg} A} \max_{z_1, \ldots, z_{T}} \text{Obj}(w_{1},z_{1},\ldots,w_{T},z_{T}).
\]

In the latter case, we can simply select an algorithm $A$ which chooses actions based on the previous data, i.e. $w_{t} \leftarrow A(z_{1}, \ldots, z_{t-1})$. A crucial point is that, if $A$ is a randomized algorithm, then the oblivious adversary cannot select $w_{t}$ with access to $A$'s coin tosses.

For many typical Machine Learning problems the adaptive setting might seem a silly assumption: my predictions affect future data? But there are a number of problems where robustness against adaptive opponents is not only desirable but critical. In the stock market, for example, big investment firms can affect price direction with large trades.
But as Alekh pointed out, these supposed adaptive guarantees are a bit of a lie -- or at least somewhat misleading, and here's why. Generally, the online learning/decision-making problems are analyzed in the "competitive" framework: we measure the performance of an algorithm relative to a comparator which has access to the data in advance. In the simplest "regret setting", this comparator is just the best fixed $w^{*}$ against a cumulative sequence of cost functions, and the objective is
\[
\sum_{t=1}^{T}\ell(w_{t},z_{t}) - \min_{w^{*}} \sum_{t=1}^{T}\ell(w^{*},z_{t}).
\]

So where's the problem? Well, the adversary may be adapting $z_{1}, z_{2}, \ldots, z_{T}$ to our $w_{1}, w_{2}, \ldots, w_{T}$, yes, but it's not adapting to the fixed choice sequence $w^{*}, w^{*}, \ldots$! In a sense, the comparator gets a free ride, since it's evaluated against the fixed (non-adaptive) sequence $z_{1}, z_{2}, \ldots, z_{T}$.
The point here is that, in the competitive framework, adaptive guarantees are only valuable or meaningful to a point, and that we fundamentally assume nature to be oblivious. This is going to be true of any model that relies on hindsight, where we feel free to say "given what I know now, I would have...". Of course, it's silly to consider such counterfactuals when "what I know now" can depend on my choices, but this is really all we can do given only hindsight.
By the way, can anyone think of a good name for this issue? I was thinking of something like The HindsightIsTwentyTwenty Fallacy or, perhaps, The Sports Critic Fallacy. Ideas?

Mar 15, 2010

Querying to Evade a Classifer

In my last post I discussed differential privacy, which is a property of learners that guarantees preservation of training data's privacy. This time I'll overview another direction for research in the security of ML.

In the evasion problem, a fixed classifier filters instances in $$R^d$$ as either positive (bad instances) or negative (good instances). The attacker has a target positive instance of interest that she wishes to approximate by an element of the negative class, and the goodness of an approximation is measured by a cost function which is a metric-like distance to the positive instance. Given access to the classifier's membership oracle, the attacker's goal is to find a near-minimal-cost negative instance. Concretely, what query-complexity is necessary and sufficient for finding a negative instance of cost at most $$\epsilon$$ above minimum?

A motivating example is email spam. A spammer wants to find a non-spam email that is as close as possible (e.g., in hamming distance) to a target spam message within the class of messages not filtered by Gmail. The spam message has some payload like a link to an online pharmacy. Queries to the membership oracle correspond to messages sent by the spammer to her own account on Gmail. The spammer's secondary objective, after finding the modification to her spam that goes unfiltered by Gmail, is to send as few messages as possible. Too many queries may arouse suspicion of malicious intent.

A sufficient task for evasion is the reverse engineering problem, which involves querying to learn the decision boundary. Whether reverse engineering is harder than, or indeed necessary for, evasion is an important question.

In their KDD'05 paper, Daniel Lowd & Christopher Meek posed the evasion problem and derived a reverse-engineering-based approach for linear classifiers and $$L_1$$ cost. Their algorithm's query complexity was linear in the instance dimension $$d$$. They also considered Boolean feature spaces.

In our forthcoming AISTATS'2010 paper, Nelson et al. consider the more general problem of evading classifiers that partition instance space with either a convex positive or convex negative class, for $$L_1$$ cost. In the former case we derive an algorithm that has query complexity linear in $$d$$ (slightly improving on Lowd & Meek's complexity in the linear case) with nearly matching lower bounds; for the convex negative class case we use a randomized query-based algorithm of Bertsimas & Vempala for deciding whether two convex sets intersect, using geometric random walks. The second algorithm has query complexity that is a degree 5 polynomial in $$d$$.

Notably our algorithms do not reverse engineer the classifier, which is known to take exponential in $$d$$ queries in the convex cases. Thus in the convex case, reverse engineering is sufficient but much harder than evasion.

What can be said about evading other families of classifiers? What other problems can you think of that are qualitatively like evasion or reverse engineering?

Mar 8, 2010

Bandit Online Convex Optimization

The Setting

An abstraction that has proved very useful in Machine Learning and continues to attract the interest of researchers is the online convex optimization (OCO) framework. In OCO, the interaction between the learner and environment proceeds in rounds (let's say, for simplicity, that the number $T$ of rounds is fixed in adavnce). At round t,
  • Learner makes a move $x_t$
  • Adversary makes a move $f_t$
  • Learner suffers the "cost'' $f_t(x_t)$
The set of moves for the learner is some convex set $K \subset R^d$. The adversary's moves, on the other hand, are convex functions $f_t:K \to R$ in some class $\mathcal{F}$. The goal of the learner is to minimize its regret:

\[ \sum_{t=1}^T f_t(x_t) - \min_{x \in K} \sum_{t=1}^T f_t(x) \ .\]

As is clear from the name "adversary", we do not wish to assume anything about the way the functions $f_t$'s are generated. They are generated in a completely adversarial manner and the learner's task is to behave in a way such that regret is bounded no matter what the adversary does.

Readers of this blog have probably seen this quantity before. I will not try to justify why it is important to study this quantity here. The excellent book of Cesa-Bianchi and Lugosi is a good place to learn more about the motivation behind studying the regret.

The topic of this post is to discuss the "price of bandit information'' (borrowing the title of a paper by Dani et al.) in the OCO setting. In the description of OCO above, I was deliberately vague about what the learner gets to see at each round. After all, OCO is a kind of repeated game between the learner and the adversary and the information that the players receive is key to their being able to play well. In the "full information'' version of OCO, the learner gets to see the entire function $f_t$. In particular, it is OK if the learner's strategy involves accessing the gradient $\nabla f(x)$ or Hessian $\nabla^2f(x)$ of $f_t$ at any point $x \in K$. In the "bandit'' version of OCO, the learner is severely restricted: it only gets to see the cost $f_t(x_t)$. What is the price that the learner has to play for only having access to this restricted information?

The "bandit'' terminology comes from Statistics that has a long tradition of studying multi-armed bandit problems. The Statistics literature usually makes some stochastic assumptions about the environment. The adversarial version of multi-armed bandit problem was first studied by Auer, Cesa-Bianchi, Freund and Schapire. Although the "bandit'' terminology is widely used, people also use other descriptions like "partial feedback'' or "opaque feedback'' (e.g., see Kleinberg's thesis).

The full information OCO setting is relatively well understood. Using Zinkevich's online gradient descent algorithm one can get $O(\sqrt{T})$ regret rate that is known to be optimal. The bandit OCO problem, in contrast, is far from being completely understood. First of all, in the bandit setting, it is essential for the learner to randomize his moves. Therefore, we need to distinguish between oblivious and adaptive adversaries. An oblivious adversary is really just a sequence of functions. An adaptive adversary however changes its moves according to the play of the learner. For technical definitions, the reader can consult Kleinberg's thesis. For this post, we can assume an oblivious adversary. Even then, matching upper and lower bounds for the regret are not known.

Known results

The seminal papers on the bandit OCO problem are those of Flaxman, Kalai and McMahan and the almost concurrent work by Kleinberg. Both obtain $O(T^{3/4})$ expected regret against an adversary that plays bounded and Lipschitz continuous convex functions, i.e.

\[ f_t(x) \in [-M,M] \ ,\]
\[ |f_t(x) - f_t(y)| \le L\|x - y\|_2 \ .\]

Kleinberg's result requires more smoothness assumptions and only holds against oblivious adversaries. Flaxman et al.'s result actually holds against adaptive adversaries as well. To the best of my knowledge, there is no lower bound larger than the obvious $O(\sqrt{T})$ full-information lower bound. So, there's a gap between lower and upper bounds and closing it seems like an fundamental problem.

No further progress: lack of compelling applications?

At this point, I would like to point out that there has been essentially no progress on the general bandit OCO problem after the two seminal papers mentioned above. There was a lot of work on bandit online linear optimization (OLO), i.e. the adversary plays not just convex but linear functions. The surprising answer here was that the learner pays no price for bandit information as far the dependence on $T$ is concerned (of course, the dependence on other parameters, like the dimension $d$, does get worse). A long line of research culminated in the following two results: (i) it is possible to achieve $O(\sqrt{T})$ regret with high probability against an adaptive adversary using an inefficient algorithm, (ii) it is possible to achieve $O(\sqrt{T})$ expected regret against an oblivious adversary using an efficient algorithm (provided a self-concordant barrier for the set $K$ can be efficiently computed). These two results can be found here, and here. Except for the issue of obtaining a high probability result with an efficient algorithm (for partial progress, see the recent work of Abernethy and Rakhlin), the bandit OLO problem can be considered solved.

Why did later research consider the special case of bandit OLO and leave the bandit OCO problem untouched? Is it because bandit OLO has "real world'' applications such as network routing (online shortest paths is a special case of bandit OLO)? Does anyone know of a situation where the bandit OCO setting would make sense? A possible application of bandit OCO would be to repeated convex-concave games. Here, the learner tries to minimize a two-variable function $f(x,y)$ and the adversary tries to maximize it. The function $f(x,y)$ is convex in $x$ and concave in $y$. If we assume that the learner, having played $x_t$, does not get to see the move $y_t$ of the adversary but only the payoff $f(x_t,y_t)$, then this problem can be cast in the bandit OCO framework by setting $f_t(x) = f(x,y_t)$. Is there a "real world'' situation that corresponds to such a partial feedback convex-concave repeated game?

Some parting thoughts

It will be great to see some progress on the bandit OCO problem. Before I close this post, here are three observations about the problem.

First, can we prove, using a very high level argument, that bandit OCO ought to be no harder than bandit OLO? In the full information setting, such an argument is, in fact, possible and so we know that, a priori, the worst case regret in full-info OCO should be no larger than that in full-info OLO. I suspect that such a high level argument is not possible in the bandit setting. But it will be good to understand why the full-information argument breaks down in the bandit setting.

Second, existing bandit OLO algorithms fall roughly into two categories. The first kind of algorithm discretizes the body $K$ and then uses algorithms from Auer et al. treating the problem as a multi-armed bandit problem with a huge number of arms. The second category used projected gradient ideas more akin to Zinkevich's online gradient descent algorithm. Both kinds of algorithms use some technique for estimating the information (such as the gradient) that is needed to run the algorithm but is not revealed by the adversary. How do we tweak the estimation techniques in the bandit OLO papers to generalize them to bandit OCO? Is one kind of algorithm easier to tweak that the other?

Third, what is the connection of bandit OCO to deterministic and stochastic convex optimization? For example, there is the well-known first-order oracle model for studying the complexity of convex optimization. In this model, the optimization algorithm can query the oracle for the values of $f(x)$ and $\nabla f(x)$ at any point $x$ in the domain on the convex funcion $f$ that is being minimized. Are there results for the "zero-order'' oracle model where the algorithm can only query for $f(x)$? Are there other models considered in other fields (Statistics, Operations Research, Economics, etc.) that are related to bandit OCO?

Feb 23, 2010

Theory vs Practice, Take 2

Suresh Venkatasubramanian has an entertaining post on justifying theoretical results with experiments and, in particular, how this can be quite a pain for theory-oriented researchers:
So you're looking at practical ramifications of some nice theory result, or you're trying to apply formal algorithmic tools to some application problem. If you're lucky, the problem doesn't have an existing base of heuristics to compare against, and so even a straight-up implementation of your ideas is a reasonable contribution.

Of course, you're rarely this lucky, and there's usually some mish-mash of heuristics to compare against. Some of them might be head-scratchingly weird, in the "why on earth should this work" category, and some are possibly more principled. At any rate, you go and implement your idea, and you suddenly realize to your horror that worst-case bounds don't mean s*** when your principled method is ten times slower than the crazy heuristics. So now what ?


It's a nice post for anyone who has done work on "both sides of the aisle" and is definitely worth a read. He goes on to discuss how theoretically-inclined researchers are not trained for experimental work, which is an interesting point. (There's also a link to a nice post by Michael Mitzenmacher)

At a conference like NIPS, where you get a mix of submissions on both theory and applications, you see a common phenomenon: clearly-theoretical papers with a mildly-conspicuous experimental section, and clearly-experimental papers with one or two somewhat awkward theoretical results. The authors' strategy is essentially the same in both cases: this makes the paper "reviewer-proof", since you never know which "type" or researcher will be judging it.

Having spoken to several researchers on both sides, it's apparent that there's pretty strong distrust in each direction. The theoreticians, especially ones who've done a bit of experimental work (and tried to reproduce results e.g.), believe that performance comparisons are so easily manipulated by "fudging" -- tuning your parameters better than your competitor methods, choosing the right datasets, choosing the correct plots, etc. -- that they can't be trusted. On the other hand, experimentalists believe that most theorems are either (a) trivial once you look at what the algorithm is doing or (b) have been endowed with enough assumptions to make the result useless.

Suresh argues that, regarding experimental validation, theoreticians don't "get the right kind of training to do it well", but I would take this even further: theoretically minded folks don't get the right kind of training to even appreciate experimental results or to know how to interpret and judge them. And vice-versa for experimentalists. One occasionally finds crossovers, but I think this is somewhat rare.

In the end, I don't see this as a bad thing. The fact that mathematicians and empiricists can attend the same conferences and have useful discussions about research problems is great. But there's no reason we ought to expect them to do each other's job.

Feb 16, 2010

The last predictor

As most of us reading the blog know, now is deadline season, with ICML, COLT, and I guess UAI coming up around the corner. That's my excuse for the slow posting.

In any event, something I've been pondering a while is the following: why, in every paper written on online learning or stochastic gradient descent, is the convergence in terms of either the average predictor or a weighted version thereof? This has been true since Nemirovski and Yudin's Problem Complexity and Method Efficiency in Optimization, and remains true in Nesterov's work, Cesa-Bianchi et al.'s work, and going back farther to Polyak and Juditsky. Intuitively, the reason one uses the average is to get some concentration of measure effects using something like Azuma's inequality.

Of course, averaging destroys sparsity, which one can get in online learning through composite gradient methods (see Lin Xiao's RDA or Yoram Singer and my forward-backward splitting work). And in practice, everyone doing stochastic gradients or online learning to train a predictor just takes the last weight vector anyway. Does this work? Is it dumb? It seems to work fine in practice.

Nedich and Bertsekas have done some work that shows that the last predictor in stochastic gradient descent converges, but their work requires a step size that decays faster than what one usually uses in online learning. At some level, the faster decay in Nedich and Bertsekas's work seems necessary. Consider the central limit theorem, from which we know that
\[ \sqrt{n} (\frac{1}{n} \sum_{i=1}^n X_i -\mu) \rightarrow N(0, 1) \]
in distribution. Most online methods, like Xiao's RDA, say, have a predictor at round $t$ which scales as something like
\[ w_t = -\frac{1}{\sqrt{t}} \sum_{i=1}^t g_i, \]
the same scaling as in the CLT. Does this mean that the last predictor will never converge to the minimum, but some random variable normally distributed around the minimizer? Or is a more careful analysis possible to justify what we (I at least) always do?