# Inner dual methods for structured output learning

One of the big challenges in making structured predictors more practical is that they can be really slow compared to the hack of assuming your variables are independent unstructured models.

There’s a line of research that partially addresses this expensiveness by avoiding repeated inference during the learning phase. I call it “inner-dual learning,” but it goes by many names.

The basic idea is that we do prediction by solving an optimization problem

$\arg\max_x f(x; \theta)$.

For example, the objective $f$ could be an energy and an entropy, which would make this optimization related to variational marginal inference in probabilistic models, or it could just be a linear objective over a structured output space, which is typical in combinatorial structured prediction. For various reasons, we often solve these inference objectives using a dual version of the inference optimization

$\arg\min_\alpha g(\alpha; \theta)$,

where

$\min_\alpha g(\alpha; \theta) = \max_x f(x; \theta)~\text{and}~g(\alpha; \theta) \ge f(x; \theta), \forall x, \alpha$

.

The learning optimization often takes the form of

$\min_\theta ~ R(\theta) + \max_x f(x; \theta) + \text{something}$

which is a saddle-point problem minimizing over parameters $\theta$ and maximizing over the predictions $x$. The “something” is typically a function of $\theta$ that doesn’t depend on $x$.

The standard way to solve saddle-point-y things like this is to repeatedly

1. solve the inner inference optimization,
2. take the (sub)gradient with respect to the parameters $\theta$, and

Having to solve the inference for each gradient step is painful, especially when you’re dealing with complex structured outputs. So the inner-dual idea is to replace that inner maximization with a minimization of its dual. (For real duals of concave functions, this forms an upper bound on the original learning objective!)

$\min_{\theta, \alpha} R(\theta) + g(\alpha; \theta) + \text{something}$.

The problem then becomes a joint minimization over the parameters and the dual variables! Moreover, we often have very fast message-passing style algorithms for solving the dual minimization. That means we can, in a principled way, interleave learning and inference, rather than using one as a subroutine for the other. The learning loop then becomes

1. make a single-pass update to the inner inference dual variables (e.g., pass messages),
2. take the (sub)gradient of the dual objective with respect to the parameters $\theta$, and