I attended the Robust Statistical Learning workshop at NIPS this year, and it reminded me of an interesting paper of Olivier Catoni (pointed out on John Langford's blog back in March). There, Olivier shows (among other things) how to estimate the mean of a random variable $X$ from an independent sample $X_1,\ldots,X_n$ so that the deviations of the estimate are controlled only by the variance of $X$---all other moments beyond the first two need not even be finite! Such estimators can be useful when dealing with data contaminated with outliers and other such annoyances.
Roughly speaking, Olivier suggests an estimator $\theta(X_1,\ldots,X_n)$ with the property
\[ |\theta(X_1,\ldots,X_n) - \mathbb{E}(X)| \leq C \sqrt{\frac{\operatorname{var}(X)}{n} \cdot \log(1/\delta)} \]
with probability $\geq 1 - \delta$, for some constant $C > 0$ (the probability of the deviation bound not holding, $\delta$, can be thought of as a level of confidence). Such a bound is what one can expect when using the empirical mean of standard normal random variables, but the empirical mean is generally not so well-behaved with other distributions. The proposed estimator uses a form of soft-truncation, which is similar to some standard methods in robust statistics. This made me wonder (at the time) if a similar result could be proved using more elementary techniques; indeed, such is the case, and since I promised Jake to write something for this blog, I'll share this simple exercise with you now.
As everyone knows, Chebyshev's inequality provides a deviation bound for the empirical average depending only on the variance of $X$; however, the bound has a poor dependence on the level of confidence ($1/\delta$ as opposed to $\log(1/\delta)$):
\[
\left| \frac 1 n \sum_{i=1}^n X_i - \mathbb{E}(X) \right| \leq \sqrt{\frac{\operatorname{var}(X)}{n} \cdot \frac 1 {\delta}}
\]
with probability $\geq 1 - \delta$. This is not so bad if $\delta$ is only moderately small (e.g., $0.05$ or $0.01$), but if the bound is repeatedly applied many times, the failure probabilities could start to add up!
A simple fix to this problem is to exploit the independence of the sample; one easy way is to form several empirical averages (say, $k$ of them, each from $n / k$ of the sample points), and then take the median of these averages. If $\hat{\mu}_i$ is the $i$th such empirical average, then by Chebyshev's inequality (as above, with $\delta = 1/6$),
\[ \left| \hat{\mu}_i - \mathbb{E}(X) \right| \leq \sqrt{\frac{\operatorname{var}(X)}{n/k} \cdot \frac1{1/6}} \]
with probability $\geq 5/6$. Because $\hat{\mu}_1,\ldots,\hat{\mu}_k$ are independent, a simple Chernoff bound then tells us that, with probability $\geq 1 - \exp(-k/4.5)$, at least half of the above deviation bounds will hold (the constant $4.5$ can certainly be improved). Therefore, the median of $\hat{\mu}_1,\ldots,\hat{\mu}_k$ will also satisfy the deviation bound (again, with probability at least $1-\exp(-k/4.5)$). Now choosing $k := 4.5 \log(1/\delta)$ gives
\[ \left| \operatorname{median}(\hat{\mu}_1,\ldots,\hat{\mu}_k) - \mathbb{E}(X) \right| \leq \sqrt{\frac{27 \operatorname{var}(X)}{n} \cdot \log(1/\delta)} \]
with probability $\geq 1-\delta$, which is the type of guarantee we were hoping for.
I later learned that this trick was previously suggested by Nemirovski and Yudin (in fact, with a better estimator and tighter analysis), and is also commonly used in streaming algorithms; but perhaps this is still new to some in the machine learning community. Going forward, it would be interesting to see if similarly elementary techniques be used to create robust estimators for other statistical quantities of interest.
Dec 16, 2010
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment