Feb 16, 2010

The last predictor

As most of us reading the blog know, now is deadline season, with ICML, COLT, and I guess UAI coming up around the corner. That's my excuse for the slow posting.

In any event, something I've been pondering a while is the following: why, in every paper written on online learning or stochastic gradient descent, is the convergence in terms of either the average predictor or a weighted version thereof? This has been true since Nemirovski and Yudin's Problem Complexity and Method Efficiency in Optimization, and remains true in Nesterov's work, Cesa-Bianchi et al.'s work, and going back farther to Polyak and Juditsky. Intuitively, the reason one uses the average is to get some concentration of measure effects using something like Azuma's inequality.

Of course, averaging destroys sparsity, which one can get in online learning through composite gradient methods (see Lin Xiao's RDA or Yoram Singer and my forward-backward splitting work). And in practice, everyone doing stochastic gradients or online learning to train a predictor just takes the last weight vector anyway. Does this work? Is it dumb? It seems to work fine in practice.

Nedich and Bertsekas have done some work that shows that the last predictor in stochastic gradient descent converges, but their work requires a step size that decays faster than what one usually uses in online learning. At some level, the faster decay in Nedich and Bertsekas's work seems necessary. Consider the central limit theorem, from which we know that
\[ \sqrt{n} (\frac{1}{n} \sum_{i=1}^n X_i -\mu) \rightarrow N(0, 1) \]
in distribution. Most online methods, like Xiao's RDA, say, have a predictor at round $t$ which scales as something like
\[ w_t = -\frac{1}{\sqrt{t}} \sum_{i=1}^t g_i, \]
the same scaling as in the CLT. Does this mean that the last predictor will never converge to the minimum, but some random variable normally distributed around the minimizer? Or is a more careful analysis possible to justify what we (I at least) always do?

2 comments:

Dave Lewis said...

Using a sparsity favoring algorithm and averaging over only the last few vectors might be able to get the best of both worlds: sparsity (since the weight vectors will have similar sparsity patterns) plus some degree of concentration effect. We're trying that out in a project I'm working on, but I would be very interested in any theoretical results.

Jussi Kujala said...

Pegasos SVM solver appears to have an analysis on using the last weight vector, didn't have much time to check it though, or read your post carefully

Post a Comment