Inner dual methods for structured output learning

One of the big challenges in making structured predictors more practical is that they can be really slow compared to the hack of assuming your variables are independent unstructured models.

There’s a line of research that partially addresses this expensiveness by avoiding repeated inference during the learning phase. I call it “inner-dual learning,” but it goes by many names.

The basic idea is that we do prediction by solving an optimization problem

\arg\max_x f(x; \theta).

For example, the objective f could be an energy and an entropy, which would make this optimization related to variational marginal inference in probabilistic models, or it could just be a linear objective over a structured output space, which is typical in combinatorial structured prediction. For various reasons, we often solve these inference objectives using a dual version of the inference optimization

\arg\min_\alpha g(\alpha; \theta),


\min_\alpha g(\alpha; \theta) = \max_x f(x; \theta)~\text{and}~g(\alpha; \theta) \ge f(x; \theta), \forall x, \alpha


The learning optimization often takes the form of

\min_\theta ~ R(\theta) + \max_x f(x; \theta) + \text{something}

which is a saddle-point problem minimizing over parameters \theta and maximizing over the predictions x. The “something” is typically a function of \theta that doesn’t depend on x.

The standard way to solve saddle-point-y things like this is to repeatedly

  1. solve the inner inference optimization,
  2. take the (sub)gradient with respect to the parameters \theta, and
  3. update the parameters using your favorite gradient-based optimization routine.

Having to solve the inference for each gradient step is painful, especially when you’re dealing with complex structured outputs. So the inner-dual idea is to replace that inner maximization with a minimization of its dual. (For real duals of concave functions, this forms an upper bound on the original learning objective!)

\min_{\theta, \alpha} R(\theta) + g(\alpha; \theta) + \text{something}.

The problem then becomes a joint minimization over the parameters and the dual variables! Moreover, we often have very fast message-passing style algorithms for solving the dual minimization. That means we can, in a principled way, interleave learning and inference, rather than using one as a subroutine for the other. The learning loop then becomes

  1. make a single-pass update to the inner inference dual variables (e.g., pass messages),
  2. take the (sub)gradient of the dual objective with respect to the parameters \theta, and
  3. update the parameters using your favorite gradient-based optimization routine.

As far as I know, the first such inner-dual method was done in the seminal paper by Ben Taskar and folks at ICML ’05 (link). That paper is really well cited and has been read by tons of people, but seemingly few have picked up on this trick. Instead, I usually see it cited for the primal structured-output learning objective they introduced, which has been the foundation of a lot of related work. Part of the reason for the lack of recognition for this idea is that Ben et al. formulated the dual as a quadratic program that had to be passed into a general-purpose quadratic programming tool, which I imagine is super slow.

Years later in 2010, Ofer Meshi and folks (link), and Tamir Hazan and Raquel Urtasun (link) used the same idea but using the dual objectives from fast message-passing algorithms for graphical model inference as the inner dual. Later on in 2012, Alex Schwing and folks (link) used this for latent variable modeling, also using message-passing duals. In 2015, in work with Stephen Bach, Jordan Boyd-Graber, and Lise Getoor, we used the inner dual method twice to even more aggressively dualize expensive inferences during latent variable learning (link). We did this specifically for hinge-loss MRFs and with ADMM inference, but I’m working with my current students on extending this to be more general now. Also last year in 2015, Chen, Schwing et al. revisited the inner dual idea to train deep models with structured predictors attached to them (link).

I’m sure there are other examples of this trick in the literature, though one problem with finding it is that it hasn’t been consistently named. Both the Taskar and Hazan (et al.) papers don’t really give it a name, referring to the idea as just a dual objective; Meshi et al. refer to the idea as using a dual loss; Bach and I referred to it as inner dual (or in our case, since there were two dual inferences for latent variable learning, we called it paired dual learning); and Chen/Schwing et al. called it blending inference and learning. My preference is inner dual, as should be obvious by how I refer to it throughout this post. I think it captures the fact that we’re using the dual objective of the inner optimization. But pay attention for these other names of it!

While inner-dual methods seem to do a great job of alleviating the computational cost of learning, they still train models that are expected to run expensive inference at test time. That’s not great. There are related lines of research on training structured predictors that will do something cheap at test time, like truncate a message passing optimization, but it remains to be seen how to integrate these different approaches without nullifying their benefits.