## Sep 1, 2011

### NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning

Posted by
Alekh Agarwal

Sasha Rakhlin are putting together a workshop at this NIPS on Computational Trade-offs in Statistical Learning. The goal of the workshop is to foster a better understanding of how computational constraints and affect the limits of achievable performance in statistical learning problems. We are inviting submissions to this workshop no later than October 17, with submission details here.

## May 30, 2011

### Calibration and Game Playing

Posted by
Jake Abernethy

I've been at the Technion for the past month visiting Shie Mannor. We've been discussing the notion of calibrate forecasting, which I have written about in a previous post. To briefly recall, we imagine a forecaster making a series of probability forecasts $p_1, p_2, p_3, \ldots \in \Delta_n$ from the set of distributions on, say, $n$ outcomes. After predicting $p_t$ the forecaster then observes $y_t \in [n]$, say the true outcome of the weather. The long term goal of the forecaster is to be "correct" in a kind of average sense. That is, if we consider all the times the forecaster says (roughly) 20% chance of rain, we should expect the rain to occur roughly 2 times out of 10 on these particular days. This is essentially what it means to be "calibrated" in an asymptotic sense. For more details see the older post.

There is much to dislike about this notion of "goodness" of a forecaster. As I said previously:

Despite the natural way to pose the concept of calibration, it is actually quite a weak notion of "goodness" of a forecaster. Let's take the weather example again, and imagine a weather forecaster who, despite the available data and the meteorological conditions, always predicts the base probability of rain. If the forecaster happens to know in advance that it rains about 13.2% of the time, then to remain calibrated she might as well predict this value every day. Hence, a well-calibrated but entirely useless forecaster.

Since being here in Israel I've had a change of heart -- Shie taught me a trick about calibration that rekindled my interest. He says it's "well-known" but it is new to me and so I'd like to share it.

The trick is this: you can use calibration as an optimal way to play repeated zero-sum games. Let's imagine we have a zero sum game with payoff matrix $M \in \mathbb{R}^{n \times m}$, where on each round $t$ player 1 choses a distribution $x_t \in \Delta_n$ over his actions, and player 2 then chooses an action $y_t \in \Delta_m$ (we can actually assume p2 plays deterministically). If the game is played repeatedly then the average payoff after $T$ rounds is $\frac 1 T \sum_{t=1}^T x_t M y_t$.

So let's assume player 1 has a calibrated forecasting algorithm for "guessing" p2's strategy, so let $y^?_t$ be the (randomized) guess of p2's action $y_t$. Then player 1 can "best respond" to this guess, that is by letting $x_t := \arg\max_{x \in \Delta_n} x M y^?_t$. Then under the calibrated forecasting assumption, this strategy is "optimal" in the limit, in that it earns at least the value of this zero-sum game. Why? While p1 might be guessing $y_t$ poorly, we know that because he is calibrated he's right "on average". So while he might be best responding to a bad forecast on each round, he is providing the best response "on average".

They key piece here, of course, is that we can achieve calibrated forecasting even when the outcome sequence is chosen by an adversary. For more details, see this paper by Sham Kakade and Dean Foster.

## Apr 21, 2011

### Inferring beliefs from prediction market data

Posted by
Jake Abernethy

Rajiv Sethi put up a nice post on his blog, asking to what extent we can infer the nature of traders' beliefs about a particular event from the transactions that occur in a prediction market. (Related: my earlier post on prediction markets) In responding to a post by Nate Silver at fivethirtyeight, who used data from prediction markets to discuss the electoral outcomes, Rajiv says:

Rajiv goes on to post an example order book (for the US presidential election outcome for 2012), and he discusses some conclusions that could be drawn from this data. In a followup post, Rajiv notes:

I'd like to make a couple of comments about this line of reasoning. First, I disagree with Rajiv that price of the last trade is a poor indicator of the consensus probability. The statement that the last trade "tells us nothing about the beliefs of traders who are not party to this transaction" is simply not true: the exchange mechanism was designed was to match orders whose desired prices overlap, and up until the last trade we can assume that the mechanism did its job. So the two parties to this transaction were not arbitrary -- each party represents the "marginal" trader on their respective side of the bet. The transaction does not just represent the beliefs of these two individuals, it also tells us that there was a group of less-confident traders waiting behind whose trades were not matched. This is very important information.

This interpretation of prices as probabilities is common and will be repeated frequently over the coming months. But what could the "perceived likelihood according to the market" possibly mean?

Markets don't have perceptions. Traders do, but there is considerable heterogeneity in trader beliefs at any point in time. Prediction market prices contain valuable information about this distribution of beliefs, but there is no basis for the common presumption that the price at last trade represents the beliefs of a hypothetical average trader in any meaningful sense. In fact, to make full use of market data to make inferences about the distribution of beliefs, one needs to look beyond the price at last trade and examine the entire order book...

All that the price at last trade can tell us about is the beliefs of the two parties to this transaction. If both are risk-averse or risk-neutral, they each must believe that entering their respective positions will yield a positive expected return. Hence the buyer must assign probability at least 62.5% to the event that the Democrat is elected, while the seller assigns a likelihood of at most 62.5% to this event.

This tells us nothing about the beliefs of traders who are not party to this transaction.

Rajiv goes on to post an example order book (for the US presidential election outcome for 2012), and he discusses some conclusions that could be drawn from this data. In a followup post, Rajiv notes:

In contrast, the order book, which is the collection of all unexpired bids and offers that cannot currently be matched against each other, contains a wealth of information about the distribution of trader beliefs. Under certain assumptions about the risk preferences of market participants, one can deduce a distribution of trader beliefs from this collection of standing orders. The imputed distribution may then be used to infer what the average trader (in a well-defined sense) perceives the likelihood of the underlying event to be. Furthermore, it can be used to gauge the extent of disagreement about this likelihood within the trading population.

(

**Update:**After emailing a link to this post to Rajiv, he responded and wanted to correct the record:I did not say that the price at last trade "is a poor indicator of the consensus probability," only that "there is no basis for the common presumption that the price at last trade represents the beliefs of a hypothetical average trader in any meaningful sense." It's an important difference. My follow up post shows that the price can be a pretty good approximation if the order book is reasonably symmetric.

Thanks, I stand corrected!)

Second, while the order book provides "promises" by other traders to take a side of the bet, these promises are very weak. For example, when a trader posts an order in the order book, this trader is exposed to the risk that news is released that moves the presumed likelihood of the outcome, and other traders with fast reactions can buy up these still-cheap orders. Hence, if we interpret posted orders as statements about beliefs, then these estimates will be biased according to this potential risk. Rajiv makes this point as well.

It should be noted also that traders can manipulate the order book maliciously. Since posting orders is free, a trader can lay down a large buy order and then, perhaps via a trading bot, pull the order after any change in the price. Depending on Intrade's mechanism, say if the company were to announce the clearing of every individual contract, this would allow traders to post orders in an entirely risk-free manner. (I don't know the precise details of the mechanism, but clearly some risk is required if the market can clear a large order all at once).

(During the health care debate, the market for contracts "Will Obamacare pass?" was likely manipulated, as you can see here. In this case, an individual trader made a huge bet against passage, moving the price more than 40 points. The market quickly recovered.)

This also gets at something that's bothered me for a long time: why is it that the "typical" mechanism for trading such betting contracts is to run an exchange with a posted order book? Despite the fact that this is how the stock market operates, I see no reason that this should be the gold standard. Recent research in prediction markets has led to the design of

*automated market makers*which possess some very nice properties. In particular, they don't involve an order book, and instead require the market maker to simply post the current prices -- the bets placed by traders will move the prices according to an advertised "potential function." Under simple conditions, it can be shown that the market maker has a bounded risk to operate this market, and that this risk scales according to the amount of "liquidty" it provides. For such a market, the liquidity is essentially the speed with which the prices move according to the size of a given trade.With an automated market maker of this type, much of Rajiv's concerns, i.e. those regarding how to translate between trades/orders and beliefs, are no longer relevant. The prices posted by the market maker contain all the relevant information about the status of the contracts.

(Interested individuals can have a look at my upcoming EC paper, with Jenn Wortman Vaughan and Yiling Chen, on designing automated market makers for combinatorial events spaces)

## Apr 1, 2011

### COLT call for open problems

Posted by
Jake Abernethy

I've been informed that the COLT PC this year is looking for more submissions for Open Problems which will appear at this year's conference. The deadline has been extended to May 11, more information is here (the extension is correct, despite the date the website currently states). Personally, I'm a fan of the open problems session -- I've coauthored 1,2,3 of them -- as it provides a venue for suggesting future research directions that does not require having a fully-baked result.

Please consider submitting something!

Call for open problems:

COLT 2011 will include an Open Problems session. A description of the problems posed will also appear on the conference web site (and in the electronic proceedings).

We encourage everyone to propose and present a clearly defined open research problem (or several closely related open problems), relevant to COLT. Authors should carefully prepare a brief writeup including a clear, self-contained description of the problem, their motivation for studying it, and the current state of this problem (including any known partial or conjectured solutions and relevant references). As last year, we encourage submissions of problems not conventionally in the scope of COLT, as long as there is a convincing reason to include it in the conference. You should be able to clearly express the problem in a 5-10 minute talk. A reward for solving an open problem is encouraged but not required.

If you would like to submit a problem, please email it to colt2011open@gmail.com by~~March 27, 2011~~May 11, 2011. Submissions should be at most 2 pages long and be in the COLT format.

## Mar 4, 2011

### SEO is very sophisticated (and.... we're back!)

Posted by
Jake Abernethy

Sorry all for the 2-month posting hiatus. I've been travelling and job searching for much of the past 10 weeks, so decided to put the blog on hold for a bit. Posting shall now recommence, with planned contributions from other authors (if I can convince them).

So I'm both amazed and a bit confused. The site in question advertises online colleges and universities, and it's clearly (mildly) spammy. I've dashed out all of the identifying information in order to avoid giving them what they are fishing for, which is a boost in their search engine rank. (If you'd like to check out the site, it's http colon slash slash onlinecolleges dot net)

To get started, I want to mention a new phenomenon that I've begun to see in my inbox. Every few weeks I get an email that looks like this:

Hi Jake,

My name is -----------, and I'm writing in regards to your site. I'm currently working with -----------.net, and we recently published an article entitled "10 Most --------- ----------- Computer Labs In the Country". You can review the article here: (http://-------------/computer-labs-in-the-country/)

If you find that this resource would be of interest to your audience, please feel free to share it with them at your discretion.

Either way, I'm glad to have come across your blog. If there's anything else on our site that interests you, please feel free to let me know. Thanks again for the great content!

Sincerely,

----------- -----

This is quite plainly the practice of Search Engine Optimization (SEO), known to be a rising trend on the web. The New York Times recently published two articles about SEO, and Slate also put out an entertaining piece. The main idea is that web traffic is driven by search queries, and for a large number of search terms it's relatively easy to game the ranking algorithm of Google, Yahoo, etc. -- as long as one can manage to get several links to the desired site embedded in enough relevant pages scattered around the web. So let us observe: the blog you are currently reading contains the word "computer", "online" and "learning" many many times -- hence it's a prime target boosting one's ranking for queries such as "online university".

So what about this surprises me? Up until now I was under the impression that SEO manifests itself like all other spam we see. A company hoping to boost the ranking of site X pays an SEO firm to spread links to X around the web via content farms, unmoderated blog comments, etc. -- but all of these practices can be done via scripting and web crawling, i.e. it's all

*automated*. I've actually done some work on detecting such "web spam", i.e. hosts and pages generated solely for the purpose of optimizing search rankings. When this research was published nearly 4 years ago I and my coauthors emphasized a key observation: "good" sites rarely link to "spammy" sites. This makes detecting spammy sites relatively easy, since they don't get a lot in-links from trusted sites. That is, sites where SEO companies can easily embed links can be quickly discovered and tagged as untrustworthy sources for ranking.But what we have here is different. It appears that an actual individual, a

*human being*, looked at this blog, determined my email address and emailed me to request a link. The hope here, of course, is that I'd be so flattered by this personal request that I might actually advertise their site. This suggests a rather different economic model from the usual approach to spam -- indeed, should we even think of this as spam? It also hints at the value of an endorsement from a trusted site, as the individual who contacted me had to invest nontrivial human time to find the blog and make the request, with a likely low chance of having me bite. Another email I received 2 months ago was even more explicit:One more thing, are you interested in receiving a payment for adding a link on one of your pages? The link will go to an education site.

My budget is $100 and can pay via PayPal. Let me know if you are interested and I can follow up with an email or a phone call.

(I ignored the request, although for $250 I might have considered it...)

This new approach, actually paying for endorsements, will likely prove very challenging for search engines in the future. Classifying spammy links is easy when they are all coming from a small set of nefarious pages/hosts. It looks like SEO has gotten quite a bit more sophisticated, however, as it now manages to get links and endorsements embedded in otherwise genuine content.

Do any of you out there have similar stories about link requests?

## Dec 26, 2010

### David Blackwell, in video

Posted by
Jake Abernethy

I've written several posts on the life and work on David Blackwell, who passed away this summer. As mentioned, I'm quite disappointed to have never met him since he had been an emeritus professor at UC Berkeley throughout my time at the school. So I was happy to find, on YouTube, a series of clips from an interview with Blackwell that was uploaded by the National Visionary Leadership Project 7 months ago. Most of the clips have been watched... fewer than 100 times (!), a shockingly low number for such a prominent figure.

Here are a few of the clips below, and more can be found on the site above. It's quite moving to see Blackwell get emotional about, among other things, the racial factors involved in his delayed appointment at Berkeley.

## Dec 16, 2010

### Robust statistics

Posted by
Daniel

I attended the Robust Statistical Learning workshop at NIPS this year, and it reminded me of an interesting paper of Olivier Catoni (pointed out on John Langford's blog back in March). There, Olivier shows (among other things) how to estimate the mean of a random variable $X$ from an independent sample $X_1,\ldots,X_n$ so that the deviations of the estimate are controlled only by the variance of $X$---all other moments beyond the first two need not even be finite! Such estimators can be useful when dealing with data contaminated with outliers and other such annoyances.

Roughly speaking, Olivier suggests an estimator $\theta(X_1,\ldots,X_n)$ with the property

\[ |\theta(X_1,\ldots,X_n) - \mathbb{E}(X)| \leq C \sqrt{\frac{\operatorname{var}(X)}{n} \cdot \log(1/\delta)} \]

with probability $\geq 1 - \delta$, for some constant $C > 0$ (the probability of the deviation bound not holding, $\delta$, can be thought of as a level of confidence). Such a bound is what one can expect when using the empirical mean of standard normal random variables, but the empirical mean is generally not so well-behaved with other distributions. The proposed estimator uses a form of soft-truncation, which is similar to some standard methods in robust statistics. This made me wonder (at the time) if a similar result could be proved using more elementary techniques; indeed, such is the case, and since I promised Jake to write something for this blog, I'll share this simple exercise with you now.

As everyone knows, Chebyshev's inequality provides a deviation bound for the empirical average depending only on the variance of $X$; however, the bound has a poor dependence on the level of confidence ($1/\delta$ as opposed to $\log(1/\delta)$):

\[

\left| \frac 1 n \sum_{i=1}^n X_i - \mathbb{E}(X) \right| \leq \sqrt{\frac{\operatorname{var}(X)}{n} \cdot \frac 1 {\delta}}

\]

with probability $\geq 1 - \delta$. This is not so bad if $\delta$ is only moderately small (e.g., $0.05$ or $0.01$), but if the bound is repeatedly applied many times, the failure probabilities could start to add up!

A simple fix to this problem is to exploit the independence of the sample; one easy way is to form several empirical averages (say, $k$ of them, each from $n / k$ of the sample points), and then take the median of these averages. If $\hat{\mu}_i$ is the $i$th such empirical average, then by Chebyshev's inequality (as above, with $\delta = 1/6$),

\[ \left| \hat{\mu}_i - \mathbb{E}(X) \right| \leq \sqrt{\frac{\operatorname{var}(X)}{n/k} \cdot \frac1{1/6}} \]

with probability $\geq 5/6$. Because $\hat{\mu}_1,\ldots,\hat{\mu}_k$ are independent, a simple Chernoff bound then tells us that, with probability $\geq 1 - \exp(-k/4.5)$, at least half of the above deviation bounds will hold (the constant $4.5$ can certainly be improved). Therefore, the median of $\hat{\mu}_1,\ldots,\hat{\mu}_k$ will also satisfy the deviation bound (again, with probability at least $1-\exp(-k/4.5)$). Now choosing $k := 4.5 \log(1/\delta)$ gives

\[ \left| \operatorname{median}(\hat{\mu}_1,\ldots,\hat{\mu}_k) - \mathbb{E}(X) \right| \leq \sqrt{\frac{27 \operatorname{var}(X)}{n} \cdot \log(1/\delta)} \]

with probability $\geq 1-\delta$, which is the type of guarantee we were hoping for.

I later learned that this trick was previously suggested by Nemirovski and Yudin (in fact, with a better estimator and tighter analysis), and is also commonly used in streaming algorithms; but perhaps this is still new to some in the machine learning community. Going forward, it would be interesting to see if similarly elementary techniques be used to create robust estimators for other statistical quantities of interest.

Roughly speaking, Olivier suggests an estimator $\theta(X_1,\ldots,X_n)$ with the property

\[ |\theta(X_1,\ldots,X_n) - \mathbb{E}(X)| \leq C \sqrt{\frac{\operatorname{var}(X)}{n} \cdot \log(1/\delta)} \]

with probability $\geq 1 - \delta$, for some constant $C > 0$ (the probability of the deviation bound not holding, $\delta$, can be thought of as a level of confidence). Such a bound is what one can expect when using the empirical mean of standard normal random variables, but the empirical mean is generally not so well-behaved with other distributions. The proposed estimator uses a form of soft-truncation, which is similar to some standard methods in robust statistics. This made me wonder (at the time) if a similar result could be proved using more elementary techniques; indeed, such is the case, and since I promised Jake to write something for this blog, I'll share this simple exercise with you now.

As everyone knows, Chebyshev's inequality provides a deviation bound for the empirical average depending only on the variance of $X$; however, the bound has a poor dependence on the level of confidence ($1/\delta$ as opposed to $\log(1/\delta)$):

\[

\left| \frac 1 n \sum_{i=1}^n X_i - \mathbb{E}(X) \right| \leq \sqrt{\frac{\operatorname{var}(X)}{n} \cdot \frac 1 {\delta}}

\]

with probability $\geq 1 - \delta$. This is not so bad if $\delta$ is only moderately small (e.g., $0.05$ or $0.01$), but if the bound is repeatedly applied many times, the failure probabilities could start to add up!

A simple fix to this problem is to exploit the independence of the sample; one easy way is to form several empirical averages (say, $k$ of them, each from $n / k$ of the sample points), and then take the median of these averages. If $\hat{\mu}_i$ is the $i$th such empirical average, then by Chebyshev's inequality (as above, with $\delta = 1/6$),

\[ \left| \hat{\mu}_i - \mathbb{E}(X) \right| \leq \sqrt{\frac{\operatorname{var}(X)}{n/k} \cdot \frac1{1/6}} \]

with probability $\geq 5/6$. Because $\hat{\mu}_1,\ldots,\hat{\mu}_k$ are independent, a simple Chernoff bound then tells us that, with probability $\geq 1 - \exp(-k/4.5)$, at least half of the above deviation bounds will hold (the constant $4.5$ can certainly be improved). Therefore, the median of $\hat{\mu}_1,\ldots,\hat{\mu}_k$ will also satisfy the deviation bound (again, with probability at least $1-\exp(-k/4.5)$). Now choosing $k := 4.5 \log(1/\delta)$ gives

\[ \left| \operatorname{median}(\hat{\mu}_1,\ldots,\hat{\mu}_k) - \mathbb{E}(X) \right| \leq \sqrt{\frac{27 \operatorname{var}(X)}{n} \cdot \log(1/\delta)} \]

with probability $\geq 1-\delta$, which is the type of guarantee we were hoping for.

I later learned that this trick was previously suggested by Nemirovski and Yudin (in fact, with a better estimator and tighter analysis), and is also commonly used in streaming algorithms; but perhaps this is still new to some in the machine learning community. Going forward, it would be interesting to see if similarly elementary techniques be used to create robust estimators for other statistical quantities of interest.

## Dec 14, 2010

### Online Dating + Learning

Posted by
Jake Abernethy

I just returned from NIPS 2010, the final year the conference will be held in Vancouver, and I think this was my best experience at the conference so far. But maybe I'm just happy to be done with the Hyatt for a few years.

While in Vancouver, I had an excellent chat with Paul Mineiro (he has a nice blog) who works for eHarmony, the online dating site. This follows another chat I had in August with Tom Quisel who works at okCupid. Both of these guys had really really interesting things to tell me.

For starters, I highly recommend the okTrends blog, published by some data scientists at okCupid. These folks take the treasure trove of data they have access to, and provide analysis on the aggregate. For example, they can tell you what properties of your profile photo lead to greater or fewer messages from other users. It is perhaps surprising that this doesn't present major legal troubles due to the privacy issues, although the fact that they are using "averages" mostly means that individual user information can not be inferred. (When they include photos, for example, they are ask permission.)

Online dating appears to be a growing very quickly; the Boston Globe recently reported that "22 percent of heterosexual couples surveyed met online." And while people tend to believe that Facebook and Google have a vast amount of information on their users, these dating sites know

*way*more about you. Paul mentioned that users of eHarmony fill out a survey that typically takes 4 hours to complete, and that's even before they've invested in the site. And Tom mentioned that many users are investing hours per day on the site.If you play with any of these sites, as I have, you'll encounter a vast number of interesting learning problems. And whereas data labeling and annotation is usually quite expensive, users of dating sites are extremely happy to answer questions and give more info, so long as it leads to better matches. (Users give away lots of information just by the profiles they browse as well as the messages they respond to or ignore.) While dating might seem like a simple problem of ranking the "most desired partners" for a given user, there is actually a challenging allocation problem: most people just want to be shown the "hotties" even though this isn't efficient. And there are learning/decision problems that I would never have even thought about. Paul mentioned that eHarmony actually uses the survey data to do price discrimination, i.e. by setting the plan prices based on the users demographics. This presents a nice online bandit problem, since the user provides only an "accept/reject" response to the offer, rather than "no but I'll pay X".

I think problems in online dating are ripe for machine learning researchers, and I hope this becomes a big application in the future. And think of the benefits to the field: when people ask you "what do you work on?", you can say "I help people find love" rather than "I teach computers to distinguish between handwritten digits."

## Nov 19, 2010

### Universal Portfolio Selection

Posted by
Jake Abernethy

Back in the old days of 1991, long before there was significant research in non-stochastic learning/decision-making models, Thomas Cover, the guy who wrote the book on Information Theory, published a pretty major online learning result. Cover used the term "universal portfolio strategy" to refer to an algorithm for sequentially adjusting a portfolio with the goal of doing as well as the "best" fixed portfolio in hindsight. This is an ambitious goal, but it so happens that this is achievable with a rather simple algorithm. It's simple enough that I can sketch the approach here in a blog post.

Assume we have $n$ stocks (securities) in which we may invest, and on each trading period $t$ we receive a vector of "price relatives" $x_t$, that is the ratios of the price on the previous period to the current period. So if $x_t(i) = 2$ then this implies that the price of stock $i$ doubled on period $t$. Cover considered the notion of a "constant rebalanced portfolio" (CRP) for a distribution $w \in \Delta_n$, which is the investment strategy that ensures that the total money invested is distributed according to exactly $w$ at the beginning of each trading period. This is similar to a basic mutual fund strategy. Because stock prices change between rounds, the wealth distribution changes also and thus rebalancing (i.e. buying and selling shares) is required, hence the name CRP. Given a sequence of price relatives $x_1, \ldots, x_T$, we can compute the multiplicative return of a given CRP $w$ by

$$\text{Mult. wealth ratio of } w \quad := \quad \prod_{t=1}^T (w \cdot x_t)$$

We will generally use the log of this quantity. Of course, if we had the price relatives in advance, we could compute the best log wealth ratio in hindsight:

$$\text{Log-return of opt CRP} := \sup_{w \in \Delta_n} \sum_{t=1}^T \log(w \cdot x_t).$$

We'll use the notation $w^*$ to refer to the optimal $w$ in hindsight.

The algorithm Cover proposed is stunningly simple: invest your money uniformly in

*every*CRP, and rebalance within each CRP accordingly. That is, for every $w \in \Delta_n$, invest an infinitesimal amount in $w$, and continue rebalance this infinitesimal amount to maintain the $w$ distribution. Actually maintaining this infinite array of portfolios is indeed computationally difficult, as it requires a complex integral (as noted by Cover). There's been much followup work on this issue, and I may try to get back to this in a later post.Proving a performance bound for this algorithm is also pretty straightforward, and it's worth noting that it differs from the usual online learning regret bounds. To start, we'll need the notion of an $\epsilon$-optimal CRP: $w$ is $\epsilon$-optimal as long as $w$ contains a $1-\epsilon$ fraction of $w^*$. (Observe: this is not a symmetric relationship). We make two observations. First, the set of $\epsilon$-optimal CRPs contains at least a $\epsilon^{n}$ fraction of our initial wealth. This is easy to see, just notice that the set of $\epsilon$-optimal CRPs can be viewed as a $\epsilon$-sized simplex around $w^*$. Since we invested uniformly over all CRPs, our initial wealth in this set is exactly the volume of this mini-simplex. Second, on any given round the wealth ratio in every $\epsilon$-optimal CRP is at least that of $w^*$, times a discount factor of $1-\epsilon$. This follows by the definition of $\epsilon$-optimal.

With these two facts in mind, we can lower bound the total wealth earned just in this set of $\epsilon$-optimal CRPs. As we have $\epsilon^{n}$ initially invested in this set, and we lose at most $1-\epsilon$ relative to $w^*$ on each round, we have

$$\text{Wealth of alg.} \geq \epsilon^{n}(1-\epsilon)^T \text{Wealth}(w^*)$$

Of course $\epsilon$ is just a parameter of the analysis, so why not set it to $1/T$. Taking the log wealth, and noticing that $(1-T^{-1})^T \to e$ as $T \to \infty$, we have

$$\text{Log wealth of alg} \geq \text{Log-wealth}(w^*) - n \log T - o(1)$$

Now for a big caveat. While this log-T regret bound looks pretty good, it actually says that our wealth will be

*polynomially-worse*in $T$ than the optimal portfolio, and the exponent here is $n$. Given roughly 250 trading days per year, it's doubtful that an investor is willing to gain $250^n$ less money than the best portfolio -- this feels like a pretty poor bound in practice. On the other hand, the wealth of the best CRP should grow exponentially in $T$ (assuming the stock market continues to increase in value), and hence in the long run this ought to be a good strategy. Also, folks who have run experiments using this approach have reported pretty good results. At the same time, I don't think it's made anyone rich.This has become a long post, so I'll save a discussion of the various followup results for another day.

## Nov 7, 2010

### Regret-minimization and Option Pricing

Posted by
Jake Abernethy

One of the most basic financial derivative contracts is the

*option*, which gives the purchaser of the contract the right to buy (or sell) a given security at a predetermined price at some future (usually fixed) date. In essence, an option contract is an insurance policy on the fluctuation on the price, and allows traders to bet on or against the volatility of the particular security. The so-called "European" call option for a stock S works like this: when party X purchases a call option from party Y (for a*premium*) at a*strike price*K with*expiration date*D, party X may*exercise*the option on date D if desired, and shall receive S from party Y at the price of K regardless of the market value of S. Of course, if the market value Value(S;D) is less than K, trader X will clearly not exercise the option, and hence the return from this investment is max(0, Value(S;D) - K).I've always found the concept of an option pretty nice: it's a way to pay a small fee, i.e. the premium, to have a probability-one guarantee on one's risk. This always felt very much like a regret bound, where an algorithm can have a precise loss guarantee for any future sequence of cost functions.

So I was quite glad to happen upon a paper by Peter DeMarzo, Ilan Kremer, and Yishay Mansour called Online Trading Algorithms and Robust Option Pricing which made use of our understanding of no-regret learning for the problem of pricing European call options (those mentioned above). The idea behind the paper is this: rather than pay a premium of P for a stock option with strike price K and expiration date D, why not instead use P towards a strategy to trade the underlying stock? There may exist some risky strategy that does well when the stock price rises above K, but does very poorly otherwise. In particular, imagine that we have a strategy with the guarantee that, given starting capital P, we always earn Value(S;D) - K whenever the Value(S;D) > K. Then we should have no reason to purchase a call option on this stock for a premium P, as we can achieve the same promise that the option provides using our own internal trading strategy. Hence, without even using our learning algorithm, we have a necessary upper bound on the price of this option.

The authors actually provide a trading strategy to achieve this guarantee which is essentially a modified "Hedge Algorithm". There are some caveats here too, such as a fixed a prior bound on the daily volatility of the underlying stock (which is probably necessary) and they do not worry about the unavoidable transaction costs associated with buying/selling stock. But the central idea is really interesting, and I hope to see more work along these lines.

## Oct 17, 2010

### Partha Niyogi has passed away

Posted by
Jake Abernethy

Having just returned from Australia and New Zealand after attending ALT2010, I am extremely sad to learn that Partha Niyogi has just died after a battle with brain cancer at the age of 43.

I spent two years at TTI Chicago, which is located at the University of Chicago where Partha was a professor for many years. It would be very easy to laud his excellent and extensive research record but instead I'd like to praise Partha the man.

While we never had a formal research project going, I must have taken part in at least 10 long visits to his office during my time there, and each time I walked out feeling great. Chatting with Partha was invariably fun, and Partha knew how to

*listen*. He had a way for making you feel comfortable, and he was never judgmental. And he possessed a keen and witty sense of humor, he could shoot the breeze about any topic. He was also very open and had no difficulty discussing his life outside of academia. Partha was just incredibly easy to like.But what I think I'll remember most about Partha is his ability to

*inspire*. I found myself fully engaged every time he gave a talk. He made research feel exciting, and he showed passion about his work. This is an exceptional gift.There are some nice comments about Partha over at Geomblog as well as at Computational Complexity.

## Sep 21, 2010

### Prediction and the Meaning of Probability

Posted by
Jake Abernethy

Early in Bruno de Finetti's book "Theory of Probability" (1974), he makes the intriguing statement that "probability does not exist". This is a philosophical distinction, but it gets at the heart of how he viewed the concept (or existence) of randomness. In the words of Robert Nau, paraphrasing de Finetti, "Probability exists only subjectively within the minds of individuals."

~~Imagine, on each round $t$, a forecaster providing a prediction $p_t$ followed by Nature providing an outcome $z_t$. Assume also that on each round the forecaster will put her money where her mouth is, and offer to buy OR sell a betting contract to Nature, one that pays off 1USD at the price $p_t$ in the event that $z_t = 1$. This seems pretty bad, of course, since Nature is the one choosing $z_t$! Not so -- we will restrict Nature and require that she must commit to a fixed buy/sell function of the following form: buy the contract whenever the price is below some known threshold $\alpha$, and otherwise sell. (Indeed, $\alpha$ can even be chosen in hindsight).~~

~~With this setup in mind, the statement "the forecaster is calibrated" is equivalent to "the forecaster will, in the long run, lose no money on average to Nature". More generally, we can see this as a notion of "goodness" of the forecaster in terms of a gambling strategy. If I wanted to use this forecaster to make bets on $z_t$, we could imagine an opponent gambler thinking "I believe the true probability is 20%, so I'm always going to buy a contract from you whenever you offer a price less than $ 0.20, and otherwise I'll sell you a contract." If our forecaster is actually calibrated, then we are certain to not lose any money (on average) to this opponent.~~

Indeed, one can define the notion of the "probability of an event" without appealing to some underlying assumption randomness. A probability can instead be considered in terms of

*gambling*, where $P(X)$ is the price one would be willing to buy OR sell some hypothetical contract that paid out exactly 1 dollar if the outcome $X$ occurs. From Wikipedia:This brings me back to my previous post on forecasting. Without an underlying assumption on the nature of the world, for example the i.i.d assumption, it becomes difficult to judge "how good is a forecaster?" In the calibration setting, on each round $t$ a forecaster guesses probability values $p_t$ and nature reveals outcomes $z_t \in \{0,1\}$. Of course, for a single pair $(p_t,z_t)$ we have no way to answer the question "was the forecaster right?" The question becomes even more remote we imagine that $z_t$ is also chosen by a potential adversary.Youmust set the price of a promise to pay $1 if there was life on Mars 1 billion years ago, and $0 if there was not, and tomorrow the answer will be revealed. You know thatyour opponentwill be able to choose either to buy such a promise from you at the price you have set, or require you to buy such a promise from your opponent, still at the same price. In other words: you set the odds, but your opponent decides which side of the bet will be yours. The price you set is the "operational subjective probability" that you assign to the proposition on which you are betting. This price has to obey the probability axioms if you are not to face certain loss, as you would if you set a price above $1 (or a negative price). By considering bets on more than one event de Finetti could justify additivity. Prices, or equivalently odds, that do not expose you to certain loss through aDutch bookare calledcoherent.

So let us now return to the notion of "calibration", which is a measure of the performance of a forecaster, from the previous post. The concept of calibration can be posed something like "the probability predictions roughly match the data frequencies". But while this might seem nice, it doesn't give us a way to judge how "good" the forecaster is, so it's somewhat hard to interpret. On the other hand, if we view this through the lens of de Finetti, using the idea of betting rates, we arrive at what I view is a much more natural interpretation. I'll now give a rough sketch of this idea.

Let's say you're a forecaster and a gambler, and on each round $t$ you predict a probability $p_t$ and also promise to buy or sell a contract, at the price of $p_t$, that pays off 1USD if the outcome $z_t = 1$. But you're worried that someone might come along and realize that your predictions have some inherent bias. (As mentioned in the last post, it has apparently been observed that when weather forecasters said a 50% chance of rain, it only rained 27% of the time!) Let's call an opponent a

*threshold bettor*if he plans to buy (or sell) your contracts whenever the price is above (or below) a fixed value $\alpha$.So if we pose the prediction problem in this way, we can say that a forecaster is calibrated if and only if she loses no money, on average and in the long run, to a threshold bettor. So the forecaster's predictions may not be good, but they are at least robust to gamblers seeking to exploit fine-grained biases.

**Older confusing explanation:**

(This idea came out of a discussion I had today with my advisor Peter Bartlett, who I'd like to thank)

Subscribe to:
Posts (Atom)