A Small Puzzle about Saddle-Point Optimization

Here’s a little problem I’ve been wondering about for a while. Suppose you’re trying to find the solution to the saddle-point optimization
\min_{x} ~ \max_{y} ~ f(x) + g(y) + x^\top M y,
where x and y are vectors, functions f and g map from their respective vector spaces to scalar outputs, and M is a matrix. Assume that f is convex and g is concave. Let’s call the objective value of this problem L(x, y) = f(x) + g(y) + x^\top M y.

Suppose that some oracle gives you a functional
h(x) = \arg\max_{y} ~ g(y) + x^\top M y,
i.e., the solution to the inner maximization of the original saddle-point problem \arg\max_y L(x, y). We can then consider the optimization
\min_x ~ L(x, h(x)) = \min_x ~ f(x) + g(h(x)) + x^\top M h(x).

There’s a surprising (to me) result that the gradient of the function L(x) \equiv L(x, h(x)) is
\nabla_{x} L = \nabla_x f + h(x)^\top M^\top.

This gradient somehow ignores the gradient of h(x), which is clearly a function that depends on x. This gradient also happens to be the partial gradient with respect to x. Why does the dependence on y disappear? Let’s try to see why this is true.

Before we do that, let me give an example of where this form of optimization arises. The most prominent example is for Lagrangian relaxation. When you’re trying to minimize some function J(x) with a constraint $Ax = b$, you can form the Lagrangian problem
\min_x ~ \max_y ~ J(x) + y^\top (Ax - b),
which takes the general form we started with if f(x) = J(x), g(y) = -y^\top b, and M = A^\top. This general form also arises in structured prediction, for example when the inner maximization is the separation oracle of a structured support vector machine or the variational form of inference in a Markov random field.

Back to the general form, let’s try taking the gradient the traditional way, starting with
\nabla_x L = \nabla_x f + \nabla_x g(h(x)) + \nabla_x x^\top M h(x).

The second term can be expanded with some chain rule action:
\nabla_x g(h(x)) = \underbrace{\left(\frac{d~g(h(x))}{d~h(x)}\right)}_{1 \times |y|} \underbrace{\left(\frac{d~h(x)}{d~x}\right)}_{|y| \times |x|}. (I’m probably botching the transposes here.)

The third term can be expanded with product rule:
\nabla_x x^\top M h(x) = h(x)^\top M^\top + x^\top M \left(\frac{d~h(x)}{d~x}\right).

We also know something about h(x). Since it comes from maximizing L, we know that its gradient wrt y is zero, i.e.,
\nabla_y L(h(x)) = 0,
which means \nabla_{h(x)} g(h(x)) + x^\top M = 0, and \frac{d ~ g(h(x))}{d~h(x)} = - x^\top M.

The second term then can be replaced with
\left(\frac{d~g(h(x))}{d~h(x)}\right) \left(\frac{d~h(x)}{d~x}\right) = - x^\top M \left(\frac{d~h(x)}{d~x}\right).

This replacement directly cancels out a term in the product rule (third term). Leaving us with
\nabla_x L = \nabla_x f - x^\top M \left(\frac{d~h(x)}{d~x}\right) + h(x)^\top M^\top + x^\top M \left(\frac{d~h(x)}{d~x}\right) = \nabla_x f + h(x)^\top M^\top.

I suspect there’s an even more generalized form of this property, perhaps generalizing the bilinearity of the problem to some other kind of convex-concave relationship between x and y. Let me know if you know of anything along those lines.