Negative Free Energy and the Log Partition Function via Jensen’s Inequality

For years, I’ve been using the duality between the log partition function and the negative Gibbs free energy, hoping that one day I’d gain some intuition about why these two quantities are so intrinsically tied. Though it still seems somewhat mystical to me, I was messing around with some math a few weeks ago and came across one connection that helps a bit. The duality—which is used in lots of machine learning analyses—reasons about the log partition function \log Z(\theta), where

p(x) := \frac{1}{Z(\theta)} \exp\left(\langle \theta, \phi(x) \rangle \right)


Z(\theta) := \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right).

The famous duality is

\log Z(\theta) = \sup_{\mu \in {\cal M}} \langle \theta, \mu \rangle + H(\mu),

where {\cal M} is the marginal polytope, and H(\mu) is the entropy of the distribution with marginals \mu. The marginal polytope is the convex hull of all possible marginal vectors, or more generally, the convex hull of all possible expectations of features \phi. That is \mu is in \cal M if and only if there exists a distribution q whose expected feature value is \mu:

{\cal M} := \{ \mu : \exists q, \mu = \mathbb{E} \left[\phi(x) \right] \}

This duality is often justified as an instance of Fenchel duality, which comes from some notions of convex conjugacy. I’m convinced these concepts are useful and intuitive to some people, since people often say it’s a Fenchel dual and just leave it at that. But to me, this terminology is rote jargon that I haven’t built a good intuition for in my many years of staring at this duality.

A side note: the quantity on the right-hand side of the duality, which combines a log-potential energy and an entropy is known as the negative Gibbs free energy, and it is strongly connected to certain variational methods related to belief propagation.

Using the generalization of marginals to expectations, we can write an equivalent duality, which is what I will analyze with Jensen’s inequality. I believe Jensen’s inequality is much easier to think about at an intuitive level. (If you’ve done derivations for variational inference, you might see where I’m headed right away.)

\log Z(\theta) = \sup_{q \in \Delta} \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right] + H(q)

where \Delta is the simplex of all valid distributions. This is the version of the duality I’ll derive. Let’s start with the log partition function

\log Z(\theta) = \log \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right).

Multiplying the terms in the summation by q(x) / q(x) = 1, we get the equivalent form

\log Z(\theta) = \log \sum_x \frac{q(x)}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) = \log ~ \mathbb{E}_{q(x)} \left[ \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right].

Here’s where we use Jensen’s inequality in the usual trick from lots of variational analyses. We move the convex \log function inside the expectation:

\begin{aligned}  \log Z(\theta) &\ge \mathbb{E}_{q(x)} \left[ \log \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\  &= \mathbb{E}_{q(x)} \left[ \log \exp \left( \langle \theta, \phi(x) \rangle \right) - \log q(x) \right]\\  &= \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right) ] - \mathbb{E}_{q(x)} \left[ \log q(x) \right]\\  &\equiv \langle \theta, \mu \rangle + H(\mu)  \end{aligned}.

Thus, simplifying the application of Jensen’s to the log-partition gets us the lower bound on the log partition, which holds for any distribution q or any marginal vector \mu. The last remaining step to show the equality at the supremum is to show that there exists a setting of q that achieves equality. It’s pretty easy to guess what that is. Let q = p. Then
\begin{aligned}  & \mathbb{E}_{p(x)} \left[ \log \frac{1}{p(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\  &= \mathbb{E}_{p(x)} \left[ \log \frac{Z(\theta)}{\exp \left( \langle \theta, \phi(x) \rangle \right)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\  &= \mathbb{E}_{p(x)} \left[ \log Z(\theta) \right] = \log Z(\theta)  \end{aligned},
since Z(\theta) is constant with respect to x.

The problem with this analysis, as much as I like it, is that it depends on having an intuitive trust of Jensen’s inequality. I think I have that intuitive trust, and I think many of the folks I talk to about this stuff do too. Jensen’s inequality is really just an observation about the definition of convexity, and as such, I’m able to reason about it naturally. I don’t have the same feeling about convex conjugacy, though I suspect since this duality is so straightforward via Jensen’s that convex conjugacy is just as obvious once you “get it.”

I’ve never seen this analysis written out before in this context. It’s nearly identical to many derivations for variational inference methods, so it’s relatively elementary and merely at the level of an advanced homework exercise. But every time I’ve seen this duality discussed, it’s justified via Fenchel duality identities and convex conjugacy.

Let me know if I messed anything up here, or if you have any good pointers to help improve my intuition about the Fenchel dual view of this same identity.



  1. Pingback: Fenchel duality, entropy, and the log partition function | An Ergodic Walk

  2. Stephen Bach ( pointed out to me that Wainwright and Jordan include a proof of the duality using Jensen’s inequality in a later section of their famous Foundations of Machine Learning 2008 paper, buried in an analysis of mean-field methods. I still think this derivation is the first one that should be presented, especially to machine learning audiences. To me, Jensen’s inequality is very natural, and is just a restatement of the definition of convexity. Anand Sarwate wrote a post ( on how straightforward this duality is if you’re comfortable with Fenchel duality, but I just don’t have as good an intuition for Fenchel’s inequality as I do with Jensen’s. Maybe that’s my own thing that I just need to deal with…

    I guess it’s silly, because when our intuitions line up with how numbers work in high dimensional space, it’s usually just coincidence. In my experience the rate at which high-dimensional math and intuition agree suggests that they are completely uncorrelated.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s