# Machine Learning’s Poor Fit for Real Data

There’s a growing sentiment out there with all the wonderful things happening in artificial intelligence, machine learning, and data science that these technologies are ready to solve all the things (including how to kill all humans). The reality is there are still a bunch of significant hurdles between us and the AI dystopia/utopia. One big one that is the main impetus behind my research is the disconnect between the statistical foundations of machine learning and how real data works.

Machine learning technology is built on a foundation of formal theory. Statistical ideas, computer science algorithms, and information-theoretic concepts integrate to yield practical methods that analyze large, noisy data sets to train actionable and predictive models. The power of these methods has caused many to realize the value of data.

Yet, as data collection accelerates, weaknesses of existing machine learning methods reveal themselves. The nature of larger-scale data collection violates key assumptions in the foundation that made machine learning so effective. Most notably, statistical independence is no longer achievable with large-scale data. Data is being collected from highly interacting, entangled, complex systems. Human data describes people interacting in a single global social network; ecological data represents measurements of organisms inhabiting complex, shared ecosystems; and medical data measures the interconnected, biological systems that govern health.

## Origins in Experimental Statistics

The concept of statistical independence is a natural fit for laboratory experimentation. In laboratory experiments, scientists test hypotheses by running repeated experiments in closed environments. By design, the measurements taken during each experiment are independent. Because one experiment can’t affect another’s result, classical statistics can confidently quantify the effects of factors in the experiment, even in the presence of randomness.

For example, a typical pre-clinical laboratory drug trial would use a population of animal subjects, administering a drug to part of the population and giving no treatment to a separate control subpopulation. The two subpopulations would be managed to ensure that confounding factors, such as genetics, are equally distributed. The individual subjects would be kept separated in isolated environments. By preventing subjects from interacting with each other, any observations of the drug’s effects can be considered fully independent samples, and classical statistics would enable comparison of effects with quickly converging confidence intervals.

In modern data, measurements are taken from “the wild.” Modern data is analogous to a version of the experiment where the animal subjects interact, sharing food, contact, and communication. Generally, data collections describe large populations of interacting parts, with each measurement related through correlated paths. As modern data collection technology becomes faster and cheaper, the data necessarily becomes increasingly interdependent.

Illustration of the data cleaning task. A common view of complex data is that it can be “cleaned” to fit the structure expected in classical statistics and machine learning methods. But this cleaning typically decimates the nuanced information present in data from real-world, complex phenomena.

## The Myth of Clean Data

It’s tempting to interpret these nuances of real-world data to be simply nuisances. One may attempt to convince oneself that the discrepancy between classical statistical methods and real data can be remedied by data cleaning. Data cleaning—a key skill in modern data analysis—involves taking raw data, processing it to remove spurious outliers, undesired dependencies, and biases created by interacting measurements, then performing supposedly clean analysis on the supposedly clean data.

The clean data concept encourages deliberate omission of information, introduction of unjustified assumptions, and fabrication of facts, to turn real-world data, with all its complexities, into rectangular tables of independent samples. These manipulations undo many of the virtues of data-driven thinking.

The perception that machine learning methods require such destructive preprocessing is a major failure in the technology transfer from machine learning research to practical application. And the reasons for this failure manifest in the various costs associated with more nuanced machine learning methods. Methods that can reason about interdependent data require more computational cost—the amount of computer time and energy needed to learn, reason, and predict using these methods—and cognitive cost—the amount of expertise necessary to apply, understand, and interpret these methods.

## “Conclusion”

So what’s the point of my arguments here? I’m not super certain, but here are a few possible takeaway points:

1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected. But when our databases are significant fractions of the world population, we are much farther away from the controlled samples of good laboratory science. This means…
2. Data science as it’s currently practiced is essentially bad science. When we take a biased, dependent population of samples and try to generalize a conclusion from it, we need to be fully aware of how flawed our study is. That doesn’t mean things we discover using data analytics aren’t useful, but they need to be understood through the lens of the bias and complex dependencies present in the training data.
3. Computational methods should be aware of, and take advantage of, known dependencies. Some subfields of data mining and machine learning address this, like structured output learning, graph mining, relational learning, and more. But there is a lot of research progress needed. The data we’re mostly interested in nowadays comes from complex phenomena, which means we have to pay for accurate modeling with a little computational and cognitive complexity. How we manage that is a big open problem.

# Inner dual methods for structured output learning

One of the big challenges in making structured predictors more practical is that they can be really slow compared to the hack of assuming your variables are independent unstructured models.

There’s a line of research that partially addresses this expensiveness by avoiding repeated inference during the learning phase. I call it “inner-dual learning,” but it goes by many names.

The basic idea is that we do prediction by solving an optimization problem

$\arg\max_x f(x; \theta)$.

For example, the objective $f$ could be an energy and an entropy, which would make this optimization related to variational marginal inference in probabilistic models, or it could just be a linear objective over a structured output space, which is typical in combinatorial structured prediction. For various reasons, we often solve these inference objectives using a dual version of the inference optimization

$\arg\min_\alpha g(\alpha; \theta)$,

where

$\min_\alpha g(\alpha; \theta) = \max_x f(x; \theta)~\text{and}~g(\alpha; \theta) \ge f(x; \theta), \forall x, \alpha$

.

The learning optimization often takes the form of

$\min_\theta ~ R(\theta) + \max_x f(x; \theta) + \text{something}$

which is a saddle-point problem minimizing over parameters $\theta$ and maximizing over the predictions $x$. The “something” is typically a function of $\theta$ that doesn’t depend on $x$.

The standard way to solve saddle-point-y things like this is to repeatedly

1. solve the inner inference optimization,
2. take the (sub)gradient with respect to the parameters $\theta$, and
3. update the parameters using your favorite gradient-based optimization routine.

Having to solve the inference for each gradient step is painful, especially when you’re dealing with complex structured outputs. So the inner-dual idea is to replace that inner maximization with a minimization of its dual. (For real duals of concave functions, this forms an upper bound on the original learning objective!)

$\min_{\theta, \alpha} R(\theta) + g(\alpha; \theta) + \text{something}$.

The problem then becomes a joint minimization over the parameters and the dual variables! Moreover, we often have very fast message-passing style algorithms for solving the dual minimization. That means we can, in a principled way, interleave learning and inference, rather than using one as a subroutine for the other. The learning loop then becomes

1. make a single-pass update to the inner inference dual variables (e.g., pass messages),
2. take the (sub)gradient of the dual objective with respect to the parameters $\theta$, and
3. update the parameters using your favorite gradient-based optimization routine.

As far as I know, the first such inner-dual method was done in the seminal paper by Ben Taskar and folks at ICML ’05 (link). That paper is really well cited and has been read by tons of people, but seemingly few have picked up on this trick. Instead, I usually see it cited for the primal structured-output learning objective they introduced, which has been the foundation of a lot of related work. Part of the reason for the lack of recognition for this idea is that Ben et al. formulated the dual as a quadratic program that had to be passed into a general-purpose quadratic programming tool, which I imagine is super slow.

Years later in 2010, Ofer Meshi and folks (link), and Tamir Hazan and Raquel Urtasun (link) used the same idea but using the dual objectives from fast message-passing algorithms for graphical model inference as the inner dual. Later on in 2012, Alex Schwing and folks (link) used this for latent variable modeling, also using message-passing duals. In 2015, in work with Stephen Bach, Jordan Boyd-Graber, and Lise Getoor, we used the inner dual method twice to even more aggressively dualize expensive inferences during latent variable learning (link). We did this specifically for hinge-loss MRFs and with ADMM inference, but I’m working with my current students on extending this to be more general now. Also last year in 2015, Chen, Schwing et al. revisited the inner dual idea to train deep models with structured predictors attached to them (link).

I’m sure there are other examples of this trick in the literature, though one problem with finding it is that it hasn’t been consistently named. Both the Taskar and Hazan (et al.) papers don’t really give it a name, referring to the idea as just a dual objective; Meshi et al. refer to the idea as using a dual loss; Bach and I referred to it as inner dual (or in our case, since there were two dual inferences for latent variable learning, we called it paired dual learning); and Chen/Schwing et al. called it blending inference and learning. My preference is inner dual, as should be obvious by how I refer to it throughout this post. I think it captures the fact that we’re using the dual objective of the inner optimization. But pay attention for these other names of it!

While inner-dual methods seem to do a great job of alleviating the computational cost of learning, they still train models that are expected to run expensive inference at test time. That’s not great. There are related lines of research on training structured predictors that will do something cheap at test time, like truncate a message passing optimization, but it remains to be seen how to integrate these different approaches without nullifying their benefits.

# Scientific Hypothesis: We are the Best

It’s reviewing season for the summer conferences, so here’s something that’s on my mind as I’m doing my reviews.

One crappy thing that happens a lot in machine learning research is that researchers do non-scientific things like over-claiming, taking ownership, and bad experiment design. We end up with paper after paper, each claiming to present the best method with cherry-picked experiments that only demonstrate that the authors can draw prettier curves than other authors.

Sometimes authors use phrases like “our method” a lot in their description of the approach they’re demonstrating. Sometimes I even see tables or plots describing the final results from experiments where the legend entries are “our method,” “So and so’s method,” “SVM,” etc. This type of naming hints at a lack of objectivity.

Naming the proposed method is usually better, especially when the name actually describes the thing (so not an acronym that uses a letter from the middle of one of the words… c’mon people). Then the authors become scientists trying to understand the properties of some approach they discovered. And yes, they still get credit for discovering it, they just get fewer eye rolls.

This attitude also encourages poor experiment design. As computer scientists, we should want to understand the behavior of certain algorithms, so really good experiments would test many hypotheses about how the new algorithm performs under different conditions. We want to understand the strengths, weaknesses, and tradeoffs in comparison to other known methods. But many experiments in papers only test one hypothesis: “our method is the best method ever and you should purchase it.”

This problem is bad enough that I almost never trust the results of experiments in papers, or I always just think of them as synthetic sanity checks, even when they are using real data.

I’m certainly also quite guilty of this unscientific attitude and behavior. It’s very hard to avoid. On one hand, as scientists, we want to advance the world’s knowledge on machine learning, but on the other hand, as people who do science for a living, we want credit for advancing the world’s knowledge. That often leads to our papers reading more like patent applications than descriptions of scientific discovery. Yuck.

In conclusion, I’ve pointed out an annoyance and proposed no great solution for it. So I guess this qualifies as just ranting. But my method of pointing out this problem improves upon the state-of-the-art method by so-and-so et al. by 11%.

# Upcoming AISTATS paper

In about a week, our paper Unifying Local Consistency and MAX SAT Relaxations for Scalable Inference with Rounding Guarantees by Steve, me, and Lise will appear at AISTATS in San Diego. (Steve will be giving a talk on it Monday morning.)

The paper title is a mouthful, and it includes a pretty technical result, so here’s a weak attempt at explaining the intuition and take-away messages.

The paper is about MAP inference, or finding the most likely state in a probability distribution. Specifically, it’s about MAP inference in what we are calling logical Markov random fields (MRFs), which are MRFs whose potentials are defined by weighted logical clauses of a particular form. We show equivalences between two different approaches (and a third bonus one) for approximating MAP inference in these logical MRFs. These equivalences are theoretically interesting, but at least as importantly, they allow us to get the benefits of each approach, leading to fast MAP approximations that have constant-factor approximation guarantees.

What’s special about logical MRFs is that MAP inference is equivalent to finding the maximum weighted satisfying assignment to the logical variables, aka the MAX SAT problem. On the surface, this equivalence doesn’t seem exciting, because it equates one NP-hard problem to another. What makes it exciting is that there are some nice approximation algorithms for MAX SAT with quality guarantees, and these algorithms can therefore be applied to get quality guarantees for MAP. Unfortunately, the approximation method by the famous computer science duo, Michael Goemans and David Williamson, requires solving a linear program (LP) that scales not-so-well in practice, when using off-the-shelf linear programming algorithms.

Another approach for approximating MAP inference borrows from recent developments on doing fast inference in MRFs using local-consistency relaxations. At a very high level, these approaches relax the space of possible marginal probabilities (i.e., the marginal polytope) to a simpler space that only requires local consistency between marginal probabilities of variables and factors (i.e., the local marginal polytope). By solving the corresponding optimization over this simpler, relaxed set of constraints, many very fast message-passing algorithms have been discovered in the past few years. So it’s natural to try to use one of these local-consistency relaxation message-passing algorithms to do MAP inference in a logical MRF.

The main result we show in the paper is that these two seemingly different approaches are equivalent. This equivalence means that when we use these fast-in-practice local-consistency relaxation algorithms that pass messages to quickly find an approximate solution, we’re also able to get the solution to the linear-program subproblem of the MAX SAT approximation algorithm. Using the LP solution, we can perform a special rounding scheme that guarantees a constant-factor approximation.

The last bonus equivalence is that both of these approaches are also equivalent to the linear form of a hinge-loss Markov random field, which we’ve been studying over the past few years as a powerful class of probabilistic models with efficiently optimizable energy functions. The conversion from logic to hinge-loss MRFs (i.e., the principles behind probabilistic soft logic) had previously been motivated by connections to fuzzy logic, and now we have these other relationships to MAX SAT relaxation and local-consistency relaxation.

Behind the scenes, this last bonus piece is how we happened to find these equivalences. We initially were working with these hinge-loss MRFs and we had thought that these other approaches to doing inference in logically-defined MRFs seemed so different, that it’d be interesting to compare them. So we ran a few tests and discovered they were behaving similarly… very similarly; they were returning solutions with differences that were small enough to be numerical errors. This surprising behavior led to confusion, derivations, and finally understanding.

Finally, the super weird part: any (nondeterministic factored) discrete MRF can be converted to an equivalent logical MRF. This, in some sense, means the constant-factor quality guarantees that come from the equivalence between logical MRF inference and MAX SAT, also apply to any discrete MRF. But along the way of converting a discrete MRF to the restricted logical MRF form, the strength of this constant-factor guarantee must be weakened, especially since we know that MAP can’t be approximated to a constant factor unless P = NP. But it’s still an open question how this all fits together. Check out the paper, and maybe we can brainstorm about this in San Diego.

# Our NIPS 2014 Workshop papers

Two of our papers appeared at the NIPS 2014 workshops this past week.

• On the Collective Stability of Variational Inference,” by Ben London, me, and Lise Getoor, shows some of our theoretical analyses on bounding the curvature of variational entropy surrogates. Since a lot of inference methods get around the computational difficulty of computing and optimizing the entropy of graphical models, people often replace the entropy term with a convex surrogate that’s efficient to work with. In the paper, we specifically analyze two of these: tree reweighting, and region-based counting-number adjustment. We provide new, tighter bounds on how strongly convex tree entropies can be, and provide new conditions for the counting numbers of regions that guarantee strongly convex entropies (whereas the previous conditions only guaranteed strict convexity). In both cases, we identify conditions under which the modulus of strong convexity does not grow as the number of variables grows, which our previous theory suggests is necessary for generalization.
• Rounding Guarantees for Message-Passing MAP Inference with Logical Dependencies,” by Stephen H. Bach, me, and Lise Getoor, recounts our recent discovery that the local LP relaxation that is extremely popular for fast inference in discrete MRFs is actually equivalent to an old LP-based approximation algorithm for weighted MAX SAT, for a certain class of graphical models. This equivalence implies that the local LP relaxations, which our community has developed some blazing-fast algorithms to solve, automatically inherits the 3/4 approximation guarantees associated with the old MAX SAT method. And even better, the equivalence class also happens to include a huge subset of hinge-loss MRFs, which we’ve been looking at over the past few years.

Thanks to everyone who came to talk to us during the workshops! I was bouncing back and forth between the two workshops during the poster times, but the conversations I had were super interesting and helpful.

# Data Science Research for Social Good

In a few weeks, there will be a workshop at KDD on data science for social good, run by Rayid Ghani and folks. Rayid has also been heading the Data Science for Social Good fellowship program the past couple years.

I really love the whole mission behind the workshop and fellowship (and Stephen H. Bach, Lise Getoor, and I have submitted a short paper to the workshop on how probabilistic soft logic is particularly useful for problems that have social benefit). But I wanted to point out a distinction that I think will help bring in more of the best minds in machine learning research to applications with potential to do social good.

My view of the mission behind the existing effort to encourage social good applications of data science is that it is driven by the question, “What can state-of-the-art data science do to benefit society?”

This is an unarguably important question, and answers to this question should improve the lives of many immediately. The projects that are being tackled by the Data Science for Social Good Fellows should have measurable impact on the well-being of people now, in 2014. This type of immediate impact is crucially missing from most research, making it rather attractive to spend research effort on these problems.

On the other hand, what can easily be lost in the excitement over immediate tangible impact is the advancement of scientific knowledge, in algorithms or theory. When working on applied machine learning research, one typically explores what known, existing methods can be applied to novel problems.

Because of this tendency, I suggest emphasizing another important question, which exists in all novel applications to some degree, but is often paid less attention. The question is, “What can’t state-of-the-art data science do to benefit society?” I.e., what are we missing, that we need new algorithms, new models, or new theory to support?

We have developed a huge toolbox of machine learning over the past few decades, so the first question is a completely valid one to focus solely on, but it happens so often that the form of data relevant to new problems is intrinsically different from the assumed form of known tools. This discrepancy means that we often need new machine learning research to do data analysis, whether that is truly new, undiscovered research, or simply new research that has only recently been discovered.

For example, human data is inherently social and thus contains statistical dependencies that make it an ill-fit for traditional statistical tools that rely on independent samples. While machine learning research has developed methods to handle heavily dependent and structured data over the past decade, these methods are often overlooked in practice. The technology transfer has not been fully or successfully executed between researchers and practitioners.

Part of the reason for all the excitement over data science is that everyone is noticing the gap created as knowledge of machine learning and analytic tools is outpacing their application in the real world. Most of the energy now is in filling that gap by finding new applications, and this direction is completely justified. We have not filled this gap yet as a whole, but as we start using machine learning on new applications, we will start inverting parts of this gap by identifying situations where known machine learning tools are insufficient for the problems we want to solve. These situations are where a lot of important scientific research can happen, and I can think of few better settings where I want this to happen than the context of social good.

# Research in “the cloud” and the honeymoon phase

A few days ago, I watched a TED talk by physicist Uri Alon on the emotional experience of scientific research. It is a fun talk, and I thought the message is important.

See the video here.

Uri argues that science is presented in a manner that poorly sets up expectations for scientists: the presented story is that we scientists brilliantly devise a hypothesis, run some experiments, and verify this hypothesis. Then we hit print and science is done. Instead, Uri sheds light on the actual process, which involves a taxing trek through what he calls “the cloud.” The real story is that a scientist has a hypothesis, tests it, finds that it is wrong, and then enters this cloud, which is a place of uncertainty, despair, and confusion. The scientist has to challenge his or her basic assumptions, and only by doing so, will be able to produce a new hypothesis that truly advances human knowledge.

Though Uri is a physicist, I think this message is just as relevant for computer science. I can’t speak for all fields within computer science, but at least in machine learning research, this is almost always the way progress occurs.

One problem of applying Uri’s message to computer science is the nomenclature. The cloud (or is it the Cloud?) is a buzzword that is already somewhat defined (but not very well). So using it in this context could lead to confusion (which is a bit too meta to be useful). We need another term for the computer science version of the cloud. Or we could just use it and stop using “the cloud” to describe the Internet, because we already have a word for that.

I would add to the process Uri describes another important concept to prepare researchers for the emotional roller coaster of science: the honeymoon phase.

The honeymoon phase is what I call the period when I’ve come up with an idea, perhaps it’s an algorithm, a theorem, or an application, that I think will work. As I start incrementally testing the idea–coding up the algorithm, for example–it starts to seem more and more correct. A subroutine of the algorithm works exactly as it should on paper! The learning algorithm correctly learns from a synthetic data set! If we assume a lemma, the proof actually works out! These small victories invoke a sense of euphoria and often come with daydreams of how easily this new research will lead to publication and recognition.

In reality, the honeymoon phase is almost always followed by a discovery that something is wrong, which leads to a sharp turn directly into the cloud. This contrast from the highs of the honeymoon phase to the lows of the cloud is jarring.

Like the message from the TED talk, I believe acknowledging that this sharp transition is part of a shared experience could help limit the emotional distress cause by the process of research. I’m not sure if there is any particular strategy for intelligently handling the highs of the honeymoon phase better, and I’m hesitant to suggest to anyone not to enjoy it while it’s happening.

Next time on Terrible Emotions in Science: Rejection…

# Negative Free Energy and the Log Partition Function via Jensen’s Inequality

For years, I’ve been using the duality between the log partition function and the negative Gibbs free energy, hoping that one day I’d gain some intuition about why these two quantities are so intrinsically tied. Though it still seems somewhat mystical to me, I was messing around with some math a few weeks ago and came across one connection that helps a bit. The duality—which is used in lots of machine learning analyses—reasons about the log partition function $\log Z(\theta)$, where

$p(x) := \frac{1}{Z(\theta)} \exp\left(\langle \theta, \phi(x) \rangle \right)$

and

$Z(\theta) := \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right)$.

The famous duality is

$\log Z(\theta) = \sup_{\mu \in {\cal M}} \langle \theta, \mu \rangle + H(\mu)$,

where ${\cal M}$ is the marginal polytope, and $H(\mu)$ is the entropy of the distribution with marginals $\mu$. The marginal polytope is the convex hull of all possible marginal vectors, or more generally, the convex hull of all possible expectations of features $\phi$. That is $\mu$ is in $\cal M$ if and only if there exists a distribution $q$ whose expected feature value is $\mu$:

${\cal M} := \{ \mu : \exists q, \mu = \mathbb{E} \left[\phi(x) \right] \}$

This duality is often justified as an instance of Fenchel duality, which comes from some notions of convex conjugacy. I’m convinced these concepts are useful and intuitive to some people, since people often say it’s a Fenchel dual and just leave it at that. But to me, this terminology is rote jargon that I haven’t built a good intuition for in my many years of staring at this duality.

A side note: the quantity on the right-hand side of the duality, which combines a log-potential energy and an entropy is known as the negative Gibbs free energy, and it is strongly connected to certain variational methods related to belief propagation.

Using the generalization of marginals to expectations, we can write an equivalent duality, which is what I will analyze with Jensen’s inequality. I believe Jensen’s inequality is much easier to think about at an intuitive level. (If you’ve done derivations for variational inference, you might see where I’m headed right away.)

$\log Z(\theta) = \sup_{q \in \Delta} \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right] + H(q)$

where $\Delta$ is the simplex of all valid distributions. This is the version of the duality I’ll derive. Let’s start with the log partition function

$\log Z(\theta) = \log \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right)$.

Multiplying the terms in the summation by $q(x) / q(x) = 1$, we get the equivalent form

$\log Z(\theta) = \log \sum_x \frac{q(x)}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) = \log ~ \mathbb{E}_{q(x)} \left[ \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]$.

Here’s where we use Jensen’s inequality in the usual trick from lots of variational analyses. We move the convex $\log$ function inside the expectation:

\begin{aligned} \log Z(\theta) &\ge \mathbb{E}_{q(x)} \left[ \log \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{q(x)} \left[ \log \exp \left( \langle \theta, \phi(x) \rangle \right) - \log q(x) \right]\\ &= \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right) ] - \mathbb{E}_{q(x)} \left[ \log q(x) \right]\\ &\equiv \langle \theta, \mu \rangle + H(\mu) \end{aligned}.

Thus, simplifying the application of Jensen’s to the log-partition gets us the lower bound on the log partition, which holds for any distribution $q$ or any marginal vector $\mu$. The last remaining step to show the equality at the supremum is to show that there exists a setting of $q$ that achieves equality. It’s pretty easy to guess what that is. Let $q = p$. Then
\begin{aligned} & \mathbb{E}_{p(x)} \left[ \log \frac{1}{p(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{p(x)} \left[ \log \frac{Z(\theta)}{\exp \left( \langle \theta, \phi(x) \rangle \right)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{p(x)} \left[ \log Z(\theta) \right] = \log Z(\theta) \end{aligned},
since $Z(\theta)$ is constant with respect to $x$.

The problem with this analysis, as much as I like it, is that it depends on having an intuitive trust of Jensen’s inequality. I think I have that intuitive trust, and I think many of the folks I talk to about this stuff do too. Jensen’s inequality is really just an observation about the definition of convexity, and as such, I’m able to reason about it naturally. I don’t have the same feeling about convex conjugacy, though I suspect since this duality is so straightforward via Jensen’s that convex conjugacy is just as obvious once you “get it.”

I’ve never seen this analysis written out before in this context. It’s nearly identical to many derivations for variational inference methods, so it’s relatively elementary and merely at the level of an advanced homework exercise. But every time I’ve seen this duality discussed, it’s justified via Fenchel duality identities and convex conjugacy.

Let me know if I messed anything up here, or if you have any good pointers to help improve my intuition about the Fenchel dual view of this same identity.