Research in “the cloud” and the honeymoon phase

A few days ago, I watched a TED talk by physicist Uri Alon on the emotional experience of scientific research. It is a fun talk, and I thought the message is important.

See the video here.

Uri argues that science is presented in a manner that poorly sets up expectations for scientists: the presented story is that we scientists brilliantly devise a hypothesis, run some experiments, and verify this hypothesis. Then we hit print and science is done. Instead, Uri sheds light on the actual process, which involves a taxing trek through what he calls “the cloud.” The real story is that a scientist has a hypothesis, tests it, finds that it is wrong, and then enters this cloud, which is a place of uncertainty, despair, and confusion. The scientist has to challenge his or her basic assumptions, and only by doing so, will be able to produce a new hypothesis that truly advances human knowledge.


Though Uri is a physicist, I think this message is just as relevant for computer science. I can’t speak for all fields within computer science, but at least in machine learning research, this is almost always the way progress occurs.

One problem of applying Uri’s message to computer science is the nomenclature. The cloud (or is it the Cloud?) is a buzzword that is already somewhat defined (but not very well). So using it in this context could lead to confusion (which is a bit too meta to be useful). We need another term for the computer science version of the cloud. Or we could just use it and stop using “the cloud” to describe the Internet, because we already have a word for that.

I would add to the process Uri describes another important concept to prepare researchers for the emotional roller coaster of science: the honeymoon phase.

The honeymoon phase is what I call the period when I’ve come up with an idea, perhaps it’s an algorithm, a theorem, or an application, that I think will work. As I start incrementally testing the idea–coding up the algorithm, for example–it starts to seem more and more correct. A subroutine of the algorithm works exactly as it should on paper! The learning algorithm correctly learns from a synthetic data set! If we assume a lemma, the proof actually works out! These small victories invoke a sense of euphoria and often come with daydreams of how easily this new research will lead to publication and recognition.

In reality, the honeymoon phase is almost always followed by a discovery that something is wrong, which leads to a sharp turn directly into the cloud. This contrast from the highs of the honeymoon phase to the lows of the cloud is jarring.

Like the message from the TED talk, I believe acknowledging that this sharp transition is part of a shared experience could help limit the emotional distress cause by the process of research. I’m not sure if there is any particular strategy for intelligently handling the highs of the honeymoon phase better, and I’m hesitant to suggest to anyone not to enjoy it while it’s happening.

Next time on Terrible Emotions in Science: Rejection…


Negative Free Energy and the Log Partition Function via Jensen’s Inequality

For years, I’ve been using the duality between the log partition function and the negative Gibbs free energy, hoping that one day I’d gain some intuition about why these two quantities are so intrinsically tied. Though it still seems somewhat mystical to me, I was messing around with some math a few weeks ago and came across one connection that helps a bit. The duality—which is used in lots of machine learning analyses—reasons about the log partition function \log Z(\theta), where

p(x) := \frac{1}{Z(\theta)} \exp\left(\langle \theta, \phi(x) \rangle \right)


Z(\theta) := \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right).

The famous duality is

\log Z(\theta) = \sup_{\mu \in {\cal M}} \langle \theta, \mu \rangle + H(\mu),

where {\cal M} is the marginal polytope, and H(\mu) is the entropy of the distribution with marginals \mu. The marginal polytope is the convex hull of all possible marginal vectors, or more generally, the convex hull of all possible expectations of features \phi. That is \mu is in \cal M if and only if there exists a distribution q whose expected feature value is \mu:

{\cal M} := \{ \mu : \exists q, \mu = \mathbb{E} \left[\phi(x) \right] \}

This duality is often justified as an instance of Fenchel duality, which comes from some notions of convex conjugacy. I’m convinced these concepts are useful and intuitive to some people, since people often say it’s a Fenchel dual and just leave it at that. But to me, this terminology is rote jargon that I haven’t built a good intuition for in my many years of staring at this duality.

A side note: the quantity on the right-hand side of the duality, which combines a log-potential energy and an entropy is known as the negative Gibbs free energy, and it is strongly connected to certain variational methods related to belief propagation.

Using the generalization of marginals to expectations, we can write an equivalent duality, which is what I will analyze with Jensen’s inequality. I believe Jensen’s inequality is much easier to think about at an intuitive level. (If you’ve done derivations for variational inference, you might see where I’m headed right away.)

\log Z(\theta) = \sup_{q \in \Delta} \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right] + H(q)

where \Delta is the simplex of all valid distributions. This is the version of the duality I’ll derive. Let’s start with the log partition function

\log Z(\theta) = \log \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right).

Multiplying the terms in the summation by q(x) / q(x) = 1, we get the equivalent form

\log Z(\theta) = \log \sum_x \frac{q(x)}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) = \log ~ \mathbb{E}_{q(x)} \left[ \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right].

Here’s where we use Jensen’s inequality in the usual trick from lots of variational analyses. We move the convex \log function inside the expectation:

\begin{aligned}  \log Z(\theta) &\ge \mathbb{E}_{q(x)} \left[ \log \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\  &= \mathbb{E}_{q(x)} \left[ \log \exp \left( \langle \theta, \phi(x) \rangle \right) - \log q(x) \right]\\  &= \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right) ] - \mathbb{E}_{q(x)} \left[ \log q(x) \right]\\  &\equiv \langle \theta, \mu \rangle + H(\mu)  \end{aligned}.

Thus, simplifying the application of Jensen’s to the log-partition gets us the lower bound on the log partition, which holds for any distribution q or any marginal vector \mu. The last remaining step to show the equality at the supremum is to show that there exists a setting of q that achieves equality. It’s pretty easy to guess what that is. Let q = p. Then
\begin{aligned}  & \mathbb{E}_{p(x)} \left[ \log \frac{1}{p(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\  &= \mathbb{E}_{p(x)} \left[ \log \frac{Z(\theta)}{\exp \left( \langle \theta, \phi(x) \rangle \right)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\  &= \mathbb{E}_{p(x)} \left[ \log Z(\theta) \right] = \log Z(\theta)  \end{aligned},
since Z(\theta) is constant with respect to x.

The problem with this analysis, as much as I like it, is that it depends on having an intuitive trust of Jensen’s inequality. I think I have that intuitive trust, and I think many of the folks I talk to about this stuff do too. Jensen’s inequality is really just an observation about the definition of convexity, and as such, I’m able to reason about it naturally. I don’t have the same feeling about convex conjugacy, though I suspect since this duality is so straightforward via Jensen’s that convex conjugacy is just as obvious once you “get it.”

I’ve never seen this analysis written out before in this context. It’s nearly identical to many derivations for variational inference methods, so it’s relatively elementary and merely at the level of an advanced homework exercise. But every time I’ve seen this duality discussed, it’s justified via Fenchel duality identities and convex conjugacy.

Let me know if I messed anything up here, or if you have any good pointers to help improve my intuition about the Fenchel dual view of this same identity.

Inaugural Post and Announcement

I’ve been meaning to start a blog for a while now (in fact, I registered this WordPress sometime in 2011), so since I have a big announcement, this is about as good a time as any. I’ll start with the announcement:

I’m announcing that starting in December 2014, I’ll be joining Virginia Tech as an Assistant Professor in the Department of Computer Science.

I’m naturally thrilled about this next phase of my career. There are many things about Virginia Tech that make it a great place for me. Over the past couple years, I’ve been collaborating with a team comprised primarily of researchers from Virginia Tech, led by Professor Naren Ramakrishnan, on applying machine learning to predict civil unrest using public data. This interdisciplinary project has been a major push of VT’s Discovery Analytics Center (DAC), which I will be officially joining as a faculty member. DAC has a strong component of new faculty, including a few recent hires who share research interests with me. I look forward to discussing ideas on machine learning, graph mining, and structured prediction with folks like Dhruv Batra and Aditya Prakash.

As for the inauguration of this blog, one of the reasons I’ve been wanting to start blogging for a while is that every now and then I come up with some idea or an analysis of a well-known idea that is either too minor, interesting but not useful enough, or too flat-out wrong to publish in a scientific manuscript but would be fun to share with folks. I’ll consider this an experiment to see if writing in this context is interesting for me and for readers.