Aug 30, 2010

Calibration and Weather Forecasting

I recently gave a few talks (at Harvard, Cornell and Yahoo! Research NYC) on some recent work with Peter Bartlett and Elad Hazan. The main topic was Blackwell's Approachability Theorem, but I also delved into the subject of "calibrated forecasting". I'll quickly describe this notion.

Imagine a forecaster that makes a sequence of probability predictions $p_1, p_2, p_3, \ldots \in [0,1]$ on a sequence of (as of yet unknown) binary outcomes $z_1, z_2, z_3, \ldots \in \{0,1\}$. How can we judge the accuracy of these predictions? A natural expectation is that, when the forecaster is predicting a probability $p$, the outcome will occur roughly $p\%$ of the time. In other words, if the forecaster is predicting the probability of rain, then we should hope that on the days when the forecaster says "50% chance of rain" that it does, indeed, rain about 50% of those days. This is exactly what we mean by a calibrated forecaster. More precisely, let's define
$$\rho_T^{\epsilon}(p) := \frac{\sum_{t=1}^T y_t \mathbb{I}[p_t \in (p-\epsilon, p+\epsilon)]}{\sum_{t=1}^T \mathbb{I}[p_t \in (p-\epsilon, p+\epsilon)]}.$$
($\mathbb{I}[\cdot]$ is the indicator function.) In words, $\rho_T^\epsilon(p)$ is the empirical frequency of the outcome on just the rounds when the predictions were "roughly" $p$. According to our intuitive notion of calibration, we'd like this frequency to be close to $p$. Thus, we say that our forecaster is calibrated if, for every $p$ and every $\epsilon$,
$$\lim\sup_{T \to \infty} |\rho_T^\epsilon(p) - p| \leq \epsilon .$$

Despite the natural way to pose the concept of calibration, it is actually quite a weak notion of "goodness" of a forecaster. Let's take the weather example again, and imagine a weather forecaster who, despite the available data and the meteorological conditions, always predicts the base probability of rain. If the forecaster happens to know in advance that it rains about 13.2% of the time, then to remain calibrated she might as well predict this value every day. Hence, a well-calibrated but entirely useless forecaster.

It should be noted that, despite this seemingly bad example, there are situations when calibration is more useful; in particular, when the outcome distribution is not stationary. As it turns out, one can achieve calibration even when the sequence $z_1, z_2, \ldots$ is adversarially chosen!

In any case, after giving this talk at Yahoo!, David Pennock mentioned to me that people have looked into weather forecasts to see if they do actually satisfy the calibration condition. Apparently weather forecasters are, in fact, horribly uncalibrated. Here's some data posted to David's blog a couple of years back:

Precip Prediction Actually Rained
0% 7.9% of the time
10% 5.3%
20% 10.8%
30% 19.2%
40% 26.5%
50% 27.8%
60% 46.2%
70% 58.0%
80% 58.1%
90% 63.6%
100% 66.7%


Kind of shocking actually, since maintaining calibration should be relatively simple. Does anyone have a good explanation for why this might be, i.e. in terms of the incentives of TV meteorologists?

(P.S. David also has a really nice post from 2006 on the general question of how to evaluate probabilistic predictions.)

Aug 20, 2010

Bayesian strategy for getting the minimax bound

Consider a prediction game that involves a player, and an adversary. At round $t$ the player reveals her belief about the outcome of variable $y_t$, before adversary unveil its true value. The player reveals her belief in form of a probability distribution $p_t(y)$. The adversary, then oblivious of the player's current move reveals the value of $y_t$. The player suffers a loss which is $\mbox{-}\ln p(y_t)$. The game is played $n$ times and the goal of the player is to minimize the difference between the cumulative loss and the best single cumulative loss given all the values of $y_t$. The player has access to some experts that give their advice in form of probability distributions over sequences of $y$, i.e. $p(y^n|\theta)$ where $y^n = y_1,y_2,...,y_n$. The goal is to minimize the regret

$
R_n(y^n,\pi) = \mbox{-}ln \hat{p}(y^n|\pi) - \min_{\theta}\left[\mbox{-}ln p\left(y^n|\theta\right)\right]
$

Here $\hat{p}(y^n|\pi)$ is the Bayesian strategy under prior $\pi$. Expanding $\hat{p}(y^n|\pi)$ we get the following:

$
R_n(y^n,\pi) = \max_{\theta} ln \frac{p(y^n|\theta)}{\int_{\theta} \pi(\theta)p(y^n|\pi)\,d\theta}
$
Since $y^n$ can be anything, we try to minimize the regret for the worst $y^n$, i.e. $B_n(\Theta) = \min_{\pi} \max_{y_n} R_n(y^n,\pi)$. Now if we consider other strategies too, the minimax regret is defined in the following way:

$
V_n(\Theta) = \min_{\hat{p}_n} \max_{y^n} \max_{\theta} ln \frac{p(y^n|\theta)}{\hat{p}_n(y^n) }
$

Obviously $V_n(\Theta) \leq B_n(\Theta)$. Bianci showed that the the maximum likelihood distribution normalized (MLN) is the strategy that achieves the minimax bound $V_n(\Theta)$. The MLN strategy is defined in the following way:

$
p(y^n) = \frac{\max_{\theta} p(y^n|\theta)}{\sum_{\hat{y}^n} \max_{\theta} p(\hat{y}^n|\theta)\, d\theta}
$

The most natural question is that does $V_n(\Theta)$ equal $B_n(\Theta)$, and which prior makes this gap vanished.
Limiting ourselves to regular parametric experts, we aim for deriving a Bayesian strategy with a regret close to optimal regret, i.e. to that of the minimax strategy. The Bayesian strategy is as simple as averaging experts under a prior. The hard part is finding a prior that can convert the class of experts to a probability distribution not that different from the maximum likelihood distribution normalized. Clarke and Barron showed in [1] that the when the data is i.i.d. under some smoothness assumption of the log likelihood function, the relative entropy between the true density generating the data and the density of the Bayesian mixture density is $ D\left(p_{\theta_0}^n || M_n\right) = \frac {d} {2} \ln n + c$ where $c$ is a constant independent of $n$. The smoothness assumptions include the positivity of the prior in the neighborhood of $\theta_{0}$, definite positivity of the information matrix at $\theta_{0}$, the twice differentiability of the likelihood and boundedness of the expected value of the first and second derivative of the absolute log-likelihood in the neighborhood of $\theta_{0}$, and finally the asymptotic concentration of the posterior distribution at $\theta_{0}$. If the last condition is not satisfied all the equalities become upper bounds.
$
D\left(p_{\tiny{\theta_0}}^n || \hat{p}^n\right) = E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{\theta_0})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right]
$
Relating the true density $p_{\tiny{\theta_0}}$ to the density of the maximum likelihood $p_{\tiny{ \tilde{\theta} }}$ we get:
$
D\left(p_{\tiny{\theta_0}}^n || \hat{p}^n\right) = E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right] +
E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})} {p(y^n| \tiny{\theta_0})}\right] $
$
= E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right] - \frac{d} {2} \Rightarrow
$
$
E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right] = \frac {d} {2} \log n + c + \frac {d} {2} = \frac {d} {2} \log n + \hat{c}
$
$
= \frac {d} {2} \log n + \frac {d} {2} \log \frac {1}{2\pi e} + \frac {1}{2} \log \det I(\theta_0) + \log \frac {1}{\pi(\theta_0)} + \frac {d} {2}
$
This means that in the online setting if the data were generated in an $i.i.d$ way, then the expected value of the regret would be $ \frac {d} {2} \log n + \hat{c}.$


These bounds are for stochastic online encoding. Rissanen showed that for a parametric class of experts with open and bounded parameter space and under some regularity conditions the minimax bound for encoding is
$
V_n(\Theta) = \frac{d}{2} \ln \frac {n}{2\pi} + \ln \int_{\Theta} \sqrt{\mbox{I}(\theta)}\,d\theta + o(1)
$
This bound could be easily obtained by taking the prior in Clarke and Borron to be proportional to the square root of Fisher information.

After this long introduction, let us consider the following two open problems:
Is it possible to derive an upper bound for regular parametric families in the adversarial setting, similar to that of the stochastic setting? i.e. $R_n(\Theta) = \frac{d}{2} \ln \frac {n}{2\pi} + \ln \int_{\Theta} \sqrt{\mbox{I}(\theta)}\,d\theta + o(1)$

Second question is that for what regular parametric families does there exist a prior that can achieve the minimax regret, in other words given $n>1$ does there exist a prior that averages the parametric family to give us the maximum likelihood normalized? As an example consider the Bernoulli distribution. We are given one bit at a time, 0 or 1, and our experts are $0\leq\theta \leq 1$. Given $n$ bits, our goal is to find a prior that can convert the advice of the experts to the MLN strategy. We use Frakas' lemma to prove the existence of this prior. Farkas' lemma says that for a $n\times m$ matrix $A$, $Ax = b$ has a non-negative solution i.e. $x \succeq 0$,
if and only if $A^{\mathrm{T}} y \succeq 0 \Rightarrow y^{\mathrm{T}}b \succeq 0 $. We use this lemma by discretizing our $\theta$ into $N$ tiny pieces and solving for a prior,
i.e., non-negative $x$ that makes $Ax$ equals to $b$. Here $b$ is the MLN probability with $b_m$ being equal to the following:
$$
{n \choose m} \left(\frac {m}{n}\right)^m \left(1 - \frac{m}{n}\right)^{(n-m)} / \sum_{m=0}^n{n \choose m} \left(\frac {m}{n}\right)^m \left(1 - \frac{m}{n}\right)^{(n-m)}
$$
And for $A_{m,i}$ we have the following:
$$
A_{m,i} = {n \choose m}\left(\frac {i}{N}\right)^m \left(1 - \frac{i}{N}\right)^{(n-m)}
$$
Pushing $N$ toward infinity we get the following problem, that is equivalent of the existence of a prior that can achieve the minimax bound:
For a given $n>0$, and a set of real-valued variables $y_0, y_1, ..., y_n$ the following polynomial is always non-negative for $0 \leq \theta \leq 1$
$$
\sum_{m=0}^n y_m {n\choose m} \left (\theta\right)^m \left(1 - \theta \right)^{(n-m)} \geq 0
$$
Prove the following
$$
\sum_{m=0}^n y_m {n\choose m} \left (\frac {m} {n}\right)^m \left(1 - \frac{m}{n} \right)^{(n-m)} \geq 0
$$

Distributed Learning Workshop at NIPS

Together with John Duchi, Ofer Dekel and John Langford, I will be organizing a workshop on Distributed Machine Learning at NIPS coming winter. The workshop is titled Learning On Cores, Clusters and Clouds.

The one day workshop will feature two keynote talks, 3 to 5 shorter talks along with poster spotlights and posters. We have been quite lucky to snag John Tsitsiklis and Carlos Guestrin as our keynote speakers.

We just posted our Call for papers online here. If you have any relevant work you have done in the area in the past, or have even partial progress on, I will strongly encourage you to submit to our workshop. We're hoping to have this workshop be a good platform for machine learning researchers to learn the tricks and techniques used elsewhere in distributed computation, and also identify key problems in machine learning where distributed learning is greatly needed.

The submission deadline is October 17, so there's plenty of time. We expect 4 page extended abstracts (NIPS format) with an unlimited abstract for your pleasure. So get on your cores, clusters and clouds and get cracking!

Hopefully we will get some high quality submissions, and even if you don't submit you should definitely attend!

Aug 8, 2010

The Journal of the ACM has a P/NP policy

Indeed:
No author may submit more than one paper on the P/NP or related long-standing questions in complexity theory in any 24 month period, except by invitation of the Editor-in-Chief. This applies to resubmissions of previously rejected manuscripts.

I had assumed that the majority of P/NP papers were pranks, but it appears that there are plenty of serious submissions too. The JACM justifies this policy as helping "to mitigate the burden of repeated resubmissions of incremental corrections of errors identified during editorial review".

Huh.

UPDATE (10:41pm PST): Coincidentally, this was posted before I heard the buzz surrounding the purported proof that $P \ne NP$. But I do hope Deolalikar hasn't submitted something similar to JACM in the previous 24 months...

Aug 4, 2010

Section D

I spent a good part of the end of July buried in 8 NIPS papers to review, and I'm sure the same goes for many of you as well. The merits, or lack thereof, regarding the assignment of this many papers per reviewer to finish in one month (or, as it were, one week, at least for procrastinators like me) is a perhaps a topic for another post. But I think it goes without saying that this does have a huge effect on the quality of the review process.

I will say that this round of NIPS reviews did reveal a rather striking trend among the NIPS Theory Papers, which were the majority of my review stack. Nearly all the papers maintained an outline that hewed very close to the following:

A) Introduction
B) Algorithm
C) Corresponding Theoretical Result
D) Experimental Section
E) Conclusions

In general, I have no beef with such an outline... except for the obligatory appearance of Section D. Section D drives me crazy.

Basically everyone who has submitted at least a couple of NIPS papers understands Section D: it's a form of "reviewer insurance". As a result of both the large reviewer base and the diverse selection of submissions at NIPS, theory papers often get matched with reviewers who are mostly unfamiliar with the relevant research area, and these reviewers are quite likely to be on the more applied side of the ML community. Such reviewers understandably skip past the theoretical sections and can only judge the paper by considering an experimental comparison with a benchmark dataset.

So throw in a Section D and no harm no foul, right? In some cases, maybe, but in many instances I think this practice is both disingenuous and misleading; it actually detracts from the goal of the paper. This is for two reasons. The first is that, by including a page or more of experimental results, the authors take away space that might have been used for, perhaps, giving more intuition for the theory. But even more important is that, by including an experimental section, the authors are essentially giving credit to the statement "this work should be judged by how it improves performance on a dataset". But in many cases, this is simply not true. Often a theory result will be an improvement in a single piece of a long algorithmic pipeline. Whether a modification in this piece does or does not improve the performance of the pipeline for some dataset is immaterial, as the change might have been the result of any number of ancillary factors. What's important is whether we have a better modular guarantee for this piece, and that's why we have theorems and proofs.

Jul 18, 2010

David Blackwell has died

(July 26, 2010) Update: Eran Shmaya links to a bunch of really nice posts on Blackwell, several of which I really enjoyed.
-------
On July 8th David Blackwell, professor emeritus at UC Berkeley's Dept. of Statistics, passed away. He was 91 years old.

On this blog, I have made my admiration for Prof. Blackwell pretty clear, devoting not only a post on Blackwell's Approachability Theorem, but also another post on the interesting story of his life and career in research. I'm particularly saddened that I never got a chance to meet him, especially so given that I've been a student at his university for four years.

For those interested, The New York Times has a really nice piece on Blackwell's life and accomplishments.

Jul 15, 2010

Markets, Equilibria and Regret Minimization

In a whirlwind post, Noam Nisan ties together a range of topics, including regret minimization, "socially concave" games, and market equilibria. It's quite a discussion and definitely worth a read.

Let me emphasize a subtle but important point that I think often gets lost. A correlated equilibrium (or a coarse correlated equilibrium) in a multiplayer game is most naturally defined via the notion of regret. A joint distribution on the strategy profile is a correlated equilibrium (or coarse correlated equilibrium) if each player has no internal regret (or, respectively, no external regret) when all jointly play according to this distribution. Hence, when all players are playing according to an internal (or external) regret minimizing strategy, the empirical distribution of the strategy profiles must, almost by definition, be a near-equilibrium, since each player has vanishing regret.

When it comes to computing Nash equilibria, however, the same trick doesn't work -- perhaps not surprising since this problem is known to be PPAD-complete in general. What distinguishes Nash equilibria is that each player must have no regret independent of their opponent's randomized actions; that is, a Nash equilibrium is a collection of uncorrelated distributions on each player's strategy. However, Noam points out that sometimes we get lucky and can achieve Nash equilibria via no-regret learning:
Just the fact that we have called what is reached (in the sense of average over time) an equilibrium does not mean that we have to like it from a game theoretic sense. Could we get a “normal” equilibrium from such corse equilibrium? Even-Dar, Mansour, and Nadav point out a family of games, termed socially concave games, where this is the case.

... And We're Back!

Posting has slowed to a crawl over the past few months. My apologies for this -- it's been a crazy period with deadlines, conferences and travel.

But the summer is in full swing, and research is blossoming. There is no shortage of topics I would like to discuss, and I have spoken to many who would like to get a post of their chest. I'll be lining up authors and topics, and hence posting shall commence anew. If any readers would like to contribute, please let me know.

May 24, 2010

Prediction Markets and Learning

Jenn Wortman Vaughan recently visited us here at Berkeley, and she presented some interesting work. In short, she and several coauthors found an interesting connection between the design of prediction markets, a mechanism for incentivizing individuals to reveal their beliefs about the outcome of a future event, and design of online learning algorithms.

Prediction markets have gotten quite popular in recent years, particularly because research has shown that, at least in many circumstances, they are quite accurate at predicting future events. The most well-known example of such "betting markets" is Intrade, where bettors can buy and sell contracts on a range of different events, such as "Barack Obama will win the US Presidential Election in 2012". Two bettors split the cost of a $100 contract and, after the outcome is revealed, this contract pays off $100 to the correct party. How the parties share the cost of the contract is determined by a market, and a bettor willing to pay $60 for the "Obama in 2012" contract presumably believes that Obama has at least a 60% chance of winning, since otherwise this contract would have negative expected payout.

It has been noted frequently that, in the 2004 US presidential election, the contract prices on Intrade correctly predicted the outcome of all 50 state contests. In fact, these markets were considered so accurate that DARPA wanted to try use them to help inform policy makers, setting up what was to be called the Policy Analysis Market. DARPA apparently wanted to use the markets to predict, among many others, future political events in the Middle East. The news headline, of course, read "Pentagon Web Site Promotes Betting on Terror". At this point the US Senate quickly pulled funding for the program.

I'll briefly describe the connection between learning and prediction markets now, but I'll save the more interesting details for a longer post. The question posed by Economists is how to design these markets so that they lead to robust predictions while also providing the right incentives to bettors. Many of the market mechanisms proposed in the Econ community involve a "market maker", a central agent that buys/sells contracts and sets their prices (this is quite different from the Intrade model). For a particular event, let's say an election, the result will be exactly one of $N$ outcomes, i.e. $N$ candidates. The market maker can then sell $N$ separate contracts, where the $i$th contract pays out USD1 to the purchaser of the contract if the $i$th candidate wins the election (and the remaining contracts pay out nothing). The market-maker's job, then, is to determine what the price of a contract should be given the current demand. Of course, the market maker would like to minimize the potential loss it could suffer, which is the difference between the money it earns selling contracts and the money it pays out once the election is over.

The surprising observation made by Jenn and her colleagues is that we can take the goal of the market maker and massage it into a regret-minimization problem in the "expert setting". The market maker must choose a price vector $p$ at every time period, and these prices should roughly behave like a probability distribution, i.e. $\sum_i p_i \approx 1$ -- if this were not the case then bettors could either buy (or sell) one of each contract at a total price greater than (or less than) USD1, guaranteeing an arbitrage opportunity. When contract $i$ is purchased, the market maker earns a "gain" of $p_i$. However, after the election, the market maker might have to pay out a sum equal to the number of contracts sold for the most popular contract type $i$. Notice, by letting the "contracts" become "experts" in a prediction setting, we are now looking at the difference between the "gain of the best expert" and "expected gain of the algorithm" which chooses a sequence of distributions over contracts/experts. The latter problem is exactly that of regret minimization in the expert setting.

This quick description is really just scratching the surface. There's more to discuss here, but I'll save it for a future post. Stay tuned!

Apr 18, 2010

Bandits robbing dark pools

Since Jake has been reminding me of my duties as a good machine learning citizen which including posting on the blog occasionally, I decided to write this post about something I recently worked on. In their UAI 2009 paper, Ganchev et al from Upenn introduced the dark pools problem. Dark pools are stock exchanges created to facilitate trade of large volumes of shares without putting the large institutional trader at a disadvantage. The thing that is dark about dark pools is that the liquidities of a share available are not revealed to the market (or tape to be more precise). The Wikipedia article on this topic provides an excellent overview.
Ganchev et al formulated the problem of a large trader wanting to sell a large volume of shares by trading a fraction at each of $K$ different dark pools. However, he only gets to see how many of the orders he placed at a particular dark pool actually got executed. Thus the true liquidities get revealed only when his order is larger than the true liquidity (and hence the number of shares he can actually sell is the true liquidity). Suppose this trader wants to maximize the total number of shares he can sell over a horizon of several trading periods.
In our recent paper, we showed that we can formulate the problem as an online convex optimization problem. This was interesting due to a variety of reasons. First, the total number of shares we can trade at each round is allowed to change adversarially. Since we have a hard constraint that the net allocation cannot exceed the total available volume, this seems like a problem of adversarially changing constraint sets. Traditional approaches a la Herbster and Warmuth would make it seem that we should incur a penalty based on how often the changes happen. However, we were able to avoid this penalty, and obtain an exponentiated gradient descent style algorithm that gets minimax optimal regret. The idea was simlpe, for each unit to be traded, we maintain a distribution over all the $K$ venues. When we get a volume constraint, we compute the expected number of shares to be traded at each venue according to our distributions on all units up to the constraint level.
So what do bandits have to do with all this? Well it turns out that the strategy I described above that is based on playing expected values at each venue assumes that it is cool to trade 1.278 shares at venue 1, 2.593 shares at venue 2 and so on. Since that would seem a little impractical, we need to restrict ourselves to playing integral values only. Now let us start with the special case where we have to trade only one share at one of $K$ venues at every round, and we found out if the liquidity at that venue was non-zero or not at that round. But this is just the $K$-armed bandit problem restated, which we know how to solve using the Exp3 algorithm. Can we build on this to get a solution for bigger volumes?
We were able to extend Exp3 to the case of arbitrary volumes, by using a probabilistic rounding between the floor and ceiling of the desired allocation level. Unfortunately this approach yields a $O(T^{2/3})$ regret bound. As always, we have a computationally inefficient way of getting to $O(\sqrt{T})$ regret by using Exp4. A lot of hours of thinking from me and some other smart people has definitely convinced me that this is a hard problem, much like the other scenarios where improving $O(T^{2/3})$ to $O(\sqrt{T})$ has proved to be hard.

Apr 14, 2010

MLcomp

Percy Liang and I have been developing a website, http://MLcomp.org, which many readers should find pretty interesting. In short, MLcomp is "for objectively comparing machine learning programs across various datasets", and we're launching it this week. Soon we'll be posting the brief description below on John Langford's blog, but it's up and running now so check it out!

------

Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We (Percy and Jake) believe there are currently a number of shortcomings:
  1. Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets. Great. But what about on other datasets? How sensitive is this result? What about compute time - does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster?
  2. Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset. Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing. Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options. And what if we wanted to compare on more than just one dataset and two algorithms?
  3. Incomplete View of State-of-the-Art: Basic question: What's the best algorithm for your favorite dataset? To find out, you could simply plow through fifty papers, get code from any author willing to reply, and reimplement the rest. Easy right? Well maybe not...
We've thought a lot about how to solve these problems. Today, we're launching a new website, MLcomp.org, which we think is a good first step.

What is MLcomp? In short, it's a collaborative website for objectively comparing machine learning programs across various datasets. On the website, a user can do any combination of the following:
  1. Upload a program to our online repository.
  2. Upload a dataset.
  3. Run any user's program on any user's dataset. (MLcomp provides the computation for free using Amazon's EC2.)
  4. For any executed run, view the results (various error metrics and time/memory usage statistics).
  5. Download any dataset, program, or run for further use.
An important aspect of the site is that it's collaborative: by uploading just one program or dataset, a user taps into the entire network of existing programs and datasets for comparison. While data and code repositories do exist (e.g., UCI, mloss.org), MLcomp is unique in that data and code interact to produce analyzable results.

MLcomp is under active development. Currently, seven machine learn task types (classification, regression, collaborative filtering, sequence tagging, etc.) are supported, with hundreds of standard programs and datasets already online. We encourage you to browse the site and hopefully contribute more! Please send comments and feedback to mlcomp.support (AT) gmail.com.

Apr 6, 2010

Publish your work in journals... so it will live forever?

Lance Fortnow believes a major reason to publish papers in journals, rather than just letting them sit on the Arxiv, is because otherwise *gasp* they might someday get lost! An example he gives:
I wanted to read a copy of Karp's NP-completeness paper which only appeared in the proceedings of a one-shot workshop in 1972. I ended up going to the library to dig it up. But library books get lost and many young people today don't even know where the library is. Later I found out Luca had scanned the paper for a course he taught. But how long will Luca's Berkeley pages last and what about all the papers that don't lead to Turing awards.

So publish your papers, best in a journal as well as a conference. Even if you don't think it matters for you in the short run, it can make a big difference for the community long into the future. What good is pushing the boundaries of science if those boundaries get snapped back because work gets lost.

I have heard this concern raised more than once, and you can say I'm a tad skeptical. I will point out, however, that this was brought up in each case by someone who finished their PhD when the word "Google" was spelled "Googol" (and without all the silly colors).

When I read papers now, I download a first copy which is stored in my downloads folder. When I want to look at it again, I download another copy because it's actually faster than finding the first. When I want to cite it correctly, I download it again. Each of these copies gets stored on a number of incremental backups. Let's say I had some reason to destroy all evidence of having ever read this paper, well I'd have a pretty difficult time of this.

I checked the Arxiv, and there are nearly 600,000 total papers stored there. If we conservatively estimated that the average paper is less than 1.5MB, then I could store the entire Arxiv library on my personal machine. So how many copies must be stored on Google, Yahoo, Bing, etc.? And after a year of being posted on the Arxiv, a typical paper might reside on 10,000 different hard drives, if I had to make a conservative estimate.

The fact is, we're in a world where tiny bundles of data (i.e. 1.5MB pdf files) cost effectively nothing to store and, in the aggregate, are very valuable to index and make available. Indeed, a main factor that will hinder the indexing and availability of a particular journal paper is the existence of a paywall. Compared to such pay-to-play journals, for long-term existence I'd trust my research at the Arxiv any day of the week.

There plenty of good reasons to upgrade an Arxiv paper to a journal paper, but is one of these reasons to assure it won't get lost? No way.

(Bonus Prize: Yahoo's homepage in 1996 and Lance's homepage in 1997, does it bring back memories?)