Consider a prediction game that involves a player, and an adversary. At round $t$ the player reveals her belief about the outcome of variable $y_t$, before adversary unveil its true value. The player reveals her belief in form of a probability distribution $p_t(y)$. The adversary, then oblivious of the player's current move reveals the value of $y_t$. The player suffers a loss which is $\mbox{-}\ln p(y_t)$. The game is played $n$ times and the goal of the player is to minimize the difference between the cumulative loss and the best single cumulative loss given all the values of $y_t$. The player has access to some experts that give their advice in form of probability distributions over sequences of $y$, i.e. $p(y^n|\theta)$ where $y^n = y_1,y_2,...,y_n$. The goal is to minimize the regret
$
R_n(y^n,\pi) = \mbox{-}ln \hat{p}(y^n|\pi) - \min_{\theta}\left[\mbox{-}ln p\left(y^n|\theta\right)\right]
$
Here $\hat{p}(y^n|\pi)$ is the Bayesian strategy under prior $\pi$. Expanding $\hat{p}(y^n|\pi)$ we get the following:
$
R_n(y^n,\pi) = \max_{\theta} ln \frac{p(y^n|\theta)}{\int_{\theta} \pi(\theta)p(y^n|\pi)\,d\theta}
$
Since $y^n$ can be anything, we try to minimize the regret for the worst $y^n$, i.e. $B_n(\Theta) = \min_{\pi} \max_{y_n} R_n(y^n,\pi)$. Now if we consider other strategies too, the minimax regret is defined in the following way:
$
V_n(\Theta) = \min_{\hat{p}_n} \max_{y^n} \max_{\theta} ln \frac{p(y^n|\theta)}{\hat{p}_n(y^n) }
$
Obviously $V_n(\Theta) \leq B_n(\Theta)$. Bianci showed that the the maximum likelihood distribution normalized (MLN) is the strategy that achieves the minimax bound $V_n(\Theta)$. The MLN strategy is defined in the following way:
$
p(y^n) = \frac{\max_{\theta} p(y^n|\theta)}{\sum_{\hat{y}^n} \max_{\theta} p(\hat{y}^n|\theta)\, d\theta}
$
The most natural question is that does $V_n(\Theta)$ equal $B_n(\Theta)$, and which prior makes this gap vanished.
Limiting ourselves to regular parametric experts, we aim for deriving a Bayesian strategy with a regret close to optimal regret, i.e. to that of the minimax strategy. The Bayesian strategy is as simple as averaging experts under a prior. The hard part is finding a prior that can convert the class of experts to a probability distribution not that different from the maximum likelihood distribution normalized. Clarke and Barron showed in [1] that the when the data is i.i.d. under some smoothness assumption of the log likelihood function, the relative entropy between the true density generating the data and the density of the Bayesian mixture density is $ D\left(p_{\theta_0}^n || M_n\right) = \frac {d} {2} \ln n + c$ where $c$ is a constant independent of $n$. The smoothness assumptions include the positivity of the prior in the neighborhood of $\theta_{0}$, definite positivity of the information matrix at $\theta_{0}$, the twice differentiability of the likelihood and boundedness of the expected value of the first and second derivative of the absolute log-likelihood in the neighborhood of $\theta_{0}$, and finally the asymptotic concentration of the posterior distribution at $\theta_{0}$. If the last condition is not satisfied all the equalities become upper bounds.
$
D\left(p_{\tiny{\theta_0}}^n || \hat{p}^n\right) = E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{\theta_0})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right]
$
Relating the true density $p_{\tiny{\theta_0}}$ to the density of the maximum likelihood $p_{\tiny{ \tilde{\theta} }}$ we get:
$
D\left(p_{\tiny{\theta_0}}^n || \hat{p}^n\right) = E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right] +
E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})} {p(y^n| \tiny{\theta_0})}\right] $
$
= E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right] - \frac{d} {2} \Rightarrow
$
$
E_{p_{\tiny{\theta_0}}} \left[ \log \frac {p(y^n| \tiny{ \tilde{\theta} (y^n)})}{\int_{ {\Theta}} \pi(\theta)p(y^n|\theta) \, d\theta} \right] = \frac {d} {2} \log n + c + \frac {d} {2} = \frac {d} {2} \log n + \hat{c}
$
$
= \frac {d} {2} \log n + \frac {d} {2} \log \frac {1}{2\pi e} + \frac {1}{2} \log \det I(\theta_0) + \log \frac {1}{\pi(\theta_0)} + \frac {d} {2}
$
This means that in the online setting if the data were generated in an $i.i.d$ way, then the expected value of the regret would be $ \frac {d} {2} \log n + \hat{c}.$
These bounds are for stochastic online encoding. Rissanen showed that for a parametric class of experts with open and bounded parameter space and under some regularity conditions the minimax bound for encoding is
$
V_n(\Theta) = \frac{d}{2} \ln \frac {n}{2\pi} + \ln \int_{\Theta} \sqrt{\mbox{I}(\theta)}\,d\theta + o(1)
$
This bound could be easily obtained by taking the prior in Clarke and Borron to be proportional to the square root of Fisher information.
After this long introduction, let us consider the following two open problems:Is it possible to derive an upper bound for regular parametric families in the adversarial setting, similar to that of the stochastic setting? i.e. $R_n(\Theta) = \frac{d}{2} \ln \frac {n}{2\pi} + \ln \int_{\Theta} \sqrt{\mbox{I}(\theta)}\,d\theta + o(1)$
Second question is that for what regular parametric families does there exist a prior that can achieve the minimax regret, in other words given $n>1$ does there exist a prior that averages the parametric family to give us the maximum likelihood normalized? As an example consider the Bernoulli distribution. We are given one bit at a time, 0 or 1, and our experts are $0\leq\theta \leq 1$. Given $n$ bits, our goal is to find a prior that can convert the advice of the experts to the MLN strategy. We use Frakas' lemma to prove the existence of this prior. Farkas' lemma says that for a $n\times m$ matrix $A$, $Ax = b$ has a non-negative solution i.e. $x \succeq 0$,
if and only if $A^{\mathrm{T}} y \succeq 0 \Rightarrow y^{\mathrm{T}}b \succeq 0 $. We use this lemma by discretizing our $\theta$ into $N$ tiny pieces and solving for a prior,
i.e., non-negative $x$ that makes $Ax$ equals to $b$. Here $b$ is the MLN probability with $b_m$ being equal to the following:
$$
{n \choose m} \left(\frac {m}{n}\right)^m \left(1 - \frac{m}{n}\right)^{(n-m)} / \sum_{m=0}^n{n \choose m} \left(\frac {m}{n}\right)^m \left(1 - \frac{m}{n}\right)^{(n-m)}
$$
And for $A_{m,i}$ we have the following:
$$
A_{m,i} = {n \choose m}\left(\frac {i}{N}\right)^m \left(1 - \frac{i}{N}\right)^{(n-m)}
$$
Pushing $N$ toward infinity we get the following problem, that is equivalent of the existence of a prior that can achieve the minimax bound:
For a given $n>0$, and a set of real-valued variables $y_0, y_1, ..., y_n$ the following polynomial is always non-negative for $0 \leq \theta \leq 1$
$$
\sum_{m=0}^n y_m {n\choose m} \left (\theta\right)^m \left(1 - \theta \right)^{(n-m)} \geq 0
$$
Prove the following
$$
\sum_{m=0}^n y_m {n\choose m} \left (\frac {m} {n}\right)^m \left(1 - \frac{m}{n} \right)^{(n-m)} \geq 0
$$