# Contraction-based convergence proofs

Every now and then, teaching gives us some really valuable experiences. I’m teaching graduate level AI, and I had a really great experience learning something I had never paid much attention to before. Specifically, I read a section of Russell and Norvig’s book that deals with the convergence of Bellman’s utility updates for Markov decision processes.

When it comes to convergence proofs, the kind I’ve seen usually reason about the high-dimensional curvature of some objective function, or they do nasty things like analyzing the unrolled recursion tree. Yuck! For this reason, when I saw a section on a convergence proof, my immediate instinct was to run away and skip it for my and my students’ sakes.

But the convergence of Bellman’s updates can be proven using the concept of contraction, which says that an operator $f$ on two inputs $x$ and $x'$ makes them “closer,” according to some measure of distance $d$.

$d(x, x') < d(f(x), f(x'))$

This property means that repeating the operator, e.g., $f(f(f(f(x))))$ and $f(f(f(f(x'))))$, makes any two different initializations $x$ and $x'$ get closer and closer, and if the contraction is strong enough, makes them converge to the same value, giving a proof of convergence and uniqueness for the update $\latex f$. Beautiful, no?

The work comes in proving the contraction bound. There’s freedom in choosing the distance metric. The stronger the contraction, the faster the convergence. The Bellman updates contract by a constant factor, which I imagine is about as fast a convergence as you can get. And I imagine there can be some traps when things contract, but the amount of contraction becomes infinitesimal; in that case, this argument doesn’t actually show convergence.

I’m now curious what other convergent algorithms can be proven to converge using a contraction argument. Or more importantly, whether I can reach into my bag of algorithms-that-I-can’t-yet-prove-convergence-of and find one that I can finally solve using this approach.

I’m also curious if some type of contraction is a necessary condition for a convergent operator with a unique fixed point. I’m sure I could find that by doing some reading. But first, I have to put out a million tiny fires (this is how I describe my life now as a professor).

# A quick Keynote tip

A few months ago, I discovered that Keynote has connection lines, which dynamically connect objects that you draw. I used to draw graphs and diagrams in Omnigraffle for this reason, but with native Keynote connection lines, you can animate them directly, etc. The ability to animate graphs is super important, and I’m a huge proponent in using animation in talks (in the right places) to help audiences visualize some of the absurdly complex things we present.

But it still wasn’t perfect, because Keynote doesn’t have built-in keyboard shortcuts to insert connection lines, so you have to select the objects you want, go to the pulldown menu, and select the command. This process is not good for avoiding repetitive-stress injury.

Well, I recently discovered that Mac OS lets you define keyboard shortcuts for any application. Maybe this is common knowledge, but in System Preferences, the Keyboard module lets you add App Shortcuts.

I set mine to cmd-L for a “Straight Connection Line” and cmd-option-L for a “Curved Connection Line.” And now I draw graphs natively in Keynote.

Unfortunately, Magic Move doesn’t work great with connection lines when they have to change orientation. They disappear as the nodes move and reappear once the nodes settle into their location on the next slide. That would be a really cool and useful animation to help visualize graphs. Oh well. Can’t win ’em all.

# On the NIPS Experiment and Review Process

A lot of attention has been focused on the NIPS experiment on its own reviewing system, Eric Price’s blog post on it, and unfortunately my flippant tweet pointing out some inaccuracies in the blog post. So I’ll try to clarify what I meant here.

Edit: folks looking for more info on the NIPS experiment should check out Neil Lawrence’s initial blog post on it and stay tuned, as he and Corinna Cortes will be eventually releasing a more thorough writeup on the results.

Eric Price’s Post

In the spirit of Eric’s post, here’s a TL;DR: the minor inaccuracies in Eric’s post should not detract from the main message that the results from the NIPS experiment are concerning, but we should be careful to get the details right before trying to explain the scary results with faulty information.

The three details that popped out at me as inaccurate were the idea that the program committee was split into two independent committees, that the committee only discussed papers with an average score between 6.0 and 6.5, and that area chairs could not see if a paper was a duplicate. In Eric’s defense, a lot of these details and more were clarified in the comments and discussion below his post, so what I’m writing here is somewhat redundant (e.g., he points out in the comments that NIPS does not have a fixed acceptance rate).

On the first point. I’m not completely sure how the program chairs implemented the duplication. But I don’t think the concept of splitting the program committee in half is correct or makes sense the way NIPS reviewing is organized. Most of the area chairing, fostering discussion, quality control of reviews, etc., is done independently by each area chair, so there no real concept of split independent committees. I’m not even sure what that would mean. The committee is in some sense already split 92 ways, for each area chair, but reviewers may be assigned papers with different chairs, so it’s not really independent. The way Eric’s post describes it invokes an image of two huge conference rooms of area chairs talking about the papers, which is a model that doesn’t scale to the absurd size of the NIPS conference nowadays.

As for the issue of which papers were discussed, I believe this was simply a recantation of a half-joking description of what the review process was, but the tongue-in-cheek tone is lost in writing. First, the reviewers discussed any paper that wasn’t immediately an obvious, high-confidence consensus. Then the area chairs examined all these reviews, papers, discussions, joining in the discussion in most cases. Then the area chairs met in pairs to go over all their papers together (except conflicted papers), making pair judgement calls on each and identifying controversial or tricky decisions that needed to be discussed at the larger meetings. After that, the chairs met in groups of four with one of the program chairs to go over these tricky decisions. Nowhere in this process does the average score even come up other than as a very rough way for area chairs to sort the papers, all of which they have to individually consider. But we also had much better heuristics to sort by, like controversiality, spread of scores, and low confidence scores. I’ll write more about my own personal experience with this process later.

Lastly, area chairs, who had access to author identities (because we are partially responsible for preventing conflicts of interest), could in fact see if a paper was a duplicate. To make the experiment work on CMT, the program chairs had to create a fake additional author, whose name was something like NIPS Dup1. So it was pretty obvious. It’s not clear how this affected the experiment. Different area chairs might have reacted differently to it. I did my best to ignore this bit of information, to preserve the experiment, but there’s no way that my awareness of this didn’t affect my behavior. Did I give these papers more attention because I knew my area chairing would be compared to someone else’s? Or did I give them less attention because I wanted to focus on my actual assigned papers? I wanted to do neither, but who knows how successful I was. This fact surely contaminates the experimental results and any conclusions we might be able to definitely make about it.

I think the reason I was irked by these nitpicky details was partially because the discussion in the comments seemed to suggest that they were the problem with NIPS and the reason for the inconsistency. But I was hasty to criticize on Twitter, because Eric’s post really brings up a few important points and interesting discussion and hypotheses. It is truly great that people are talking about it, and lots of non-machine-learning communities have been made aware of the NIPS experiment through Eric’s post. I’d like some more caution about overgeneralizing from the experimental results, like Eric’s TL;DR does, but I suppose that’s inevitable, so we might as well get right to it with the first public writeup on the results. Hopefully people read on past that, since the rest of his analysis is pretty level-headed and thoughtful.

My own experience as area chair

Since this was my first year on the senior program committee, I got a new perspective on the NIPS review process that might be helpful to share to others. It wasn’t a huge difference from what I’d seen as a reviewer in the past, but since I was responsible for chairing 21 papers, I got a somewhat larger sample size. But it was still only 21 papers, which is much less than the number of papers that the duplication experiment used, and my sample is super biased toward my subject areas. So take this with a HUGE grain of salt. It’s just an anecdote, not a scientific study.

My initial experience with the process was disappointing. I watched the reviewers enter their reviews, and more often than not, these reported opinions that were unsubstantiated, unsupported in the review writeups, late, and unproofread. As a rough guess of the ratio here, each paper had three reviewers initially, and I would say about 3 out of 4 papers had one thoughtful review. After that, the reviewers with crappy reviews became very hard to reach. They didn’t participate in the discussion without a ton of prodding, they didn’t respond to the author rebuttals, and when they did respond, they surprisingly tended to stick to their original, seemingly unsubstantiated opinions.

So that was terrible. The good news is all that happened afterwards. Like I mentioned above, we area chairs met in pairs. (Even though I have only good things to say about the area chairs I met with, I’ll leave names off to preserve the anonymity of the review process.) My AC partner met with me on Skype for (I think) about two hours, maybe three, and we talked through each of our papers. The easy decisions, where all reviewers agreed, reported high confidence, and demonstrated high confidence through their words, were very quick discussions, but most of our time was spent debating about the more difficult decisions. Throughout this discussion, it was clear to me how thoughtfully my partner had considered the reviews and the papers, understanding when reviewers were making valid arguments and when they were being unfair. In many ways, I was reassured that another area chair was trying as hard as I was to find the signal in the noise.

The next stage was a meeting with four area chairs and a program chair. Again, we met on Skype, and this time went through a filtered list of the papers our respective AC pairs had decided needed further discussion. This list mostly included papers on the borderline of acceptance or ones where the reviewers were unwilling to agree on a decision. Reading the reviews and looking at the papers, we did our best to make a decision during the meeting, and again, anytime we couldn’t reach a decision, it was marked for further discussion and consideration as a borderline paper by the next level of the hierarchy: the program chairs.

After that point, I’m in the dark as to what the PCs’ decision process was. I know at some point they had to cut off acceptances for reasons of physical venue space, but it’s my understanding that the acceptance rate before that point is already pretty close to the usual 25%-ish rate.

So what now?

So my experience with the whole process could be summarized by saying that I saw some really disappointing reviewers, but was rescued from losing faith in NIPS by the thoughtfulness of the area chairs I worked with and the program chairs. But even within the sea of bad reviews, there were some standout individuals who anchored (mixed maritime metaphor…) these discussions and decisions in reason, fairness, and perspective. So I think as NIPS continues expanding to accommodate the growing interest in its topics, we’ll have to figure out how to address the growing proportion of bad reviewers that we’ll need to recruit to handle its scale. Maybe the answer is better reviewer training, or maybe the answer is more work for the more competent reviewers, or maybe there is no answer.

One important aspect to consider is, when we talk about peer review being broken, unscalable, or any other complaint, that the primary purpose of these huge processes is not to assign credit for peoples’ work, it’s to decide what content to present in a conference, and despite all the noise, and all the flaws in the system, the quality of the conferences I’ve been attending has always been consistently high. More specifically, to not generalize beyond what I’m discussing in this post, NIPS 2014 was a great conference. Of course, in the real world, assigning credit is a super-important part of this whole deal. (Luckily, the process for journals and for funding decisions tends to happen at a smaller scale, so my guess is it’s less noisy and less crappy, but that certainly isn’t perfect either.) So something does have to be fixed, but it’s not as broken as it may feel at times.

(There are people experimenting with new models of publication to try to address these issues, e.g., see the talks from the ICML 13 Workshop on Peer Review, and the International Conference on Learning Representations (ICLR))

Lastly, I’ll conclude with two questions I’ve been asking myself and my colleagues lately. Of all the peer review experiences you’ve had as a reviewer, a chair, or an author, how often do all the reviewers understand the paper and make a valid decision based on this understanding? For me, the answer is nearly zero percent of the time. Reviewers almost never understand the papers I submit, whether they accept or reject, and when I’m a reviewer or chair, reviewers almost always have different interpretations of what’s going on in the paper, which means they can’t all be correct. So peer review is broken, right? Maybe, but as a second question, how often is the final decision the right decision? For me, the answer is pretty close to always. E.g., the papers I’ve had rejected for dumb reasons have always needed a lot of improvement in retrospect, or maybe the papers reviewers don’t get aren’t written or thought through well enough. Maybe despite reviewers not knowing what they’re reading, their B.S. detectors still work fine. But we shouldn’t just settle for that if it’s true. I dunno. Lots to think about…

# Our NIPS 2014 Workshop papers

Two of our papers appeared at the NIPS 2014 workshops this past week.

• On the Collective Stability of Variational Inference,” by Ben London, me, and Lise Getoor, shows some of our theoretical analyses on bounding the curvature of variational entropy surrogates. Since a lot of inference methods get around the computational difficulty of computing and optimizing the entropy of graphical models, people often replace the entropy term with a convex surrogate that’s efficient to work with. In the paper, we specifically analyze two of these: tree reweighting, and region-based counting-number adjustment. We provide new, tighter bounds on how strongly convex tree entropies can be, and provide new conditions for the counting numbers of regions that guarantee strongly convex entropies (whereas the previous conditions only guaranteed strict convexity). In both cases, we identify conditions under which the modulus of strong convexity does not grow as the number of variables grows, which our previous theory suggests is necessary for generalization.
• Rounding Guarantees for Message-Passing MAP Inference with Logical Dependencies,” by Stephen H. Bach, me, and Lise Getoor, recounts our recent discovery that the local LP relaxation that is extremely popular for fast inference in discrete MRFs is actually equivalent to an old LP-based approximation algorithm for weighted MAX SAT, for a certain class of graphical models. This equivalence implies that the local LP relaxations, which our community has developed some blazing-fast algorithms to solve, automatically inherits the 3/4 approximation guarantees associated with the old MAX SAT method. And even better, the equivalence class also happens to include a huge subset of hinge-loss MRFs, which we’ve been looking at over the past few years.

Thanks to everyone who came to talk to us during the workshops! I was bouncing back and forth between the two workshops during the poster times, but the conversations I had were super interesting and helpful.

# Data Science Research for Social Good

In a few weeks, there will be a workshop at KDD on data science for social good, run by Rayid Ghani and folks. Rayid has also been heading the Data Science for Social Good fellowship program the past couple years.

I really love the whole mission behind the workshop and fellowship (and Stephen H. Bach, Lise Getoor, and I have submitted a short paper to the workshop on how probabilistic soft logic is particularly useful for problems that have social benefit). But I wanted to point out a distinction that I think will help bring in more of the best minds in machine learning research to applications with potential to do social good.

My view of the mission behind the existing effort to encourage social good applications of data science is that it is driven by the question, “What can state-of-the-art data science do to benefit society?”

This is an unarguably important question, and answers to this question should improve the lives of many immediately. The projects that are being tackled by the Data Science for Social Good Fellows should have measurable impact on the well-being of people now, in 2014. This type of immediate impact is crucially missing from most research, making it rather attractive to spend research effort on these problems.

On the other hand, what can easily be lost in the excitement over immediate tangible impact is the advancement of scientific knowledge, in algorithms or theory. When working on applied machine learning research, one typically explores what known, existing methods can be applied to novel problems.

Because of this tendency, I suggest emphasizing another important question, which exists in all novel applications to some degree, but is often paid less attention. The question is, “What can’t state-of-the-art data science do to benefit society?” I.e., what are we missing, that we need new algorithms, new models, or new theory to support?

We have developed a huge toolbox of machine learning over the past few decades, so the first question is a completely valid one to focus solely on, but it happens so often that the form of data relevant to new problems is intrinsically different from the assumed form of known tools. This discrepancy means that we often need new machine learning research to do data analysis, whether that is truly new, undiscovered research, or simply new research that has only recently been discovered.

For example, human data is inherently social and thus contains statistical dependencies that make it an ill-fit for traditional statistical tools that rely on independent samples. While machine learning research has developed methods to handle heavily dependent and structured data over the past decade, these methods are often overlooked in practice. The technology transfer has not been fully or successfully executed between researchers and practitioners.

Part of the reason for all the excitement over data science is that everyone is noticing the gap created as knowledge of machine learning and analytic tools is outpacing their application in the real world. Most of the energy now is in filling that gap by finding new applications, and this direction is completely justified. We have not filled this gap yet as a whole, but as we start using machine learning on new applications, we will start inverting parts of this gap by identifying situations where known machine learning tools are insufficient for the problems we want to solve. These situations are where a lot of important scientific research can happen, and I can think of few better settings where I want this to happen than the context of social good.

# Research in “the cloud” and the honeymoon phase

A few days ago, I watched a TED talk by physicist Uri Alon on the emotional experience of scientific research. It is a fun talk, and I thought the message is important.

See the video here.

Uri argues that science is presented in a manner that poorly sets up expectations for scientists: the presented story is that we scientists brilliantly devise a hypothesis, run some experiments, and verify this hypothesis. Then we hit print and science is done. Instead, Uri sheds light on the actual process, which involves a taxing trek through what he calls “the cloud.” The real story is that a scientist has a hypothesis, tests it, finds that it is wrong, and then enters this cloud, which is a place of uncertainty, despair, and confusion. The scientist has to challenge his or her basic assumptions, and only by doing so, will be able to produce a new hypothesis that truly advances human knowledge.

Though Uri is a physicist, I think this message is just as relevant for computer science. I can’t speak for all fields within computer science, but at least in machine learning research, this is almost always the way progress occurs.

One problem of applying Uri’s message to computer science is the nomenclature. The cloud (or is it the Cloud?) is a buzzword that is already somewhat defined (but not very well). So using it in this context could lead to confusion (which is a bit too meta to be useful). We need another term for the computer science version of the cloud. Or we could just use it and stop using “the cloud” to describe the Internet, because we already have a word for that.

I would add to the process Uri describes another important concept to prepare researchers for the emotional roller coaster of science: the honeymoon phase.

The honeymoon phase is what I call the period when I’ve come up with an idea, perhaps it’s an algorithm, a theorem, or an application, that I think will work. As I start incrementally testing the idea–coding up the algorithm, for example–it starts to seem more and more correct. A subroutine of the algorithm works exactly as it should on paper! The learning algorithm correctly learns from a synthetic data set! If we assume a lemma, the proof actually works out! These small victories invoke a sense of euphoria and often come with daydreams of how easily this new research will lead to publication and recognition.

In reality, the honeymoon phase is almost always followed by a discovery that something is wrong, which leads to a sharp turn directly into the cloud. This contrast from the highs of the honeymoon phase to the lows of the cloud is jarring.

Like the message from the TED talk, I believe acknowledging that this sharp transition is part of a shared experience could help limit the emotional distress cause by the process of research. I’m not sure if there is any particular strategy for intelligently handling the highs of the honeymoon phase better, and I’m hesitant to suggest to anyone not to enjoy it while it’s happening.

Next time on Terrible Emotions in Science: Rejection…

# Negative Free Energy and the Log Partition Function via Jensen’s Inequality

For years, I’ve been using the duality between the log partition function and the negative Gibbs free energy, hoping that one day I’d gain some intuition about why these two quantities are so intrinsically tied. Though it still seems somewhat mystical to me, I was messing around with some math a few weeks ago and came across one connection that helps a bit. The duality—which is used in lots of machine learning analyses—reasons about the log partition function $\log Z(\theta)$, where

$p(x) := \frac{1}{Z(\theta)} \exp\left(\langle \theta, \phi(x) \rangle \right)$

and

$Z(\theta) := \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right)$.

The famous duality is

$\log Z(\theta) = \sup_{\mu \in {\cal M}} \langle \theta, \mu \rangle + H(\mu)$,

where ${\cal M}$ is the marginal polytope, and $H(\mu)$ is the entropy of the distribution with marginals $\mu$. The marginal polytope is the convex hull of all possible marginal vectors, or more generally, the convex hull of all possible expectations of features $\phi$. That is $\mu$ is in $\cal M$ if and only if there exists a distribution $q$ whose expected feature value is $\mu$:

${\cal M} := \{ \mu : \exists q, \mu = \mathbb{E} \left[\phi(x) \right] \}$

This duality is often justified as an instance of Fenchel duality, which comes from some notions of convex conjugacy. I’m convinced these concepts are useful and intuitive to some people, since people often say it’s a Fenchel dual and just leave it at that. But to me, this terminology is rote jargon that I haven’t built a good intuition for in my many years of staring at this duality.

A side note: the quantity on the right-hand side of the duality, which combines a log-potential energy and an entropy is known as the negative Gibbs free energy, and it is strongly connected to certain variational methods related to belief propagation.

Using the generalization of marginals to expectations, we can write an equivalent duality, which is what I will analyze with Jensen’s inequality. I believe Jensen’s inequality is much easier to think about at an intuitive level. (If you’ve done derivations for variational inference, you might see where I’m headed right away.)

$\log Z(\theta) = \sup_{q \in \Delta} \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right] + H(q)$

where $\Delta$ is the simplex of all valid distributions. This is the version of the duality I’ll derive. Let’s start with the log partition function

$\log Z(\theta) = \log \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right)$.

Multiplying the terms in the summation by $q(x) / q(x) = 1$, we get the equivalent form

$\log Z(\theta) = \log \sum_x \frac{q(x)}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) = \log ~ \mathbb{E}_{q(x)} \left[ \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]$.

Here’s where we use Jensen’s inequality in the usual trick from lots of variational analyses. We move the convex $\log$ function inside the expectation:

\begin{aligned} \log Z(\theta) &\ge \mathbb{E}_{q(x)} \left[ \log \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{q(x)} \left[ \log \exp \left( \langle \theta, \phi(x) \rangle \right) - \log q(x) \right]\\ &= \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right) ] - \mathbb{E}_{q(x)} \left[ \log q(x) \right]\\ &\equiv \langle \theta, \mu \rangle + H(\mu) \end{aligned}.

Thus, simplifying the application of Jensen’s to the log-partition gets us the lower bound on the log partition, which holds for any distribution $q$ or any marginal vector $\mu$. The last remaining step to show the equality at the supremum is to show that there exists a setting of $q$ that achieves equality. It’s pretty easy to guess what that is. Let $q = p$. Then
\begin{aligned} & \mathbb{E}_{p(x)} \left[ \log \frac{1}{p(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{p(x)} \left[ \log \frac{Z(\theta)}{\exp \left( \langle \theta, \phi(x) \rangle \right)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{p(x)} \left[ \log Z(\theta) \right] = \log Z(\theta) \end{aligned},
since $Z(\theta)$ is constant with respect to $x$.

The problem with this analysis, as much as I like it, is that it depends on having an intuitive trust of Jensen’s inequality. I think I have that intuitive trust, and I think many of the folks I talk to about this stuff do too. Jensen’s inequality is really just an observation about the definition of convexity, and as such, I’m able to reason about it naturally. I don’t have the same feeling about convex conjugacy, though I suspect since this duality is so straightforward via Jensen’s that convex conjugacy is just as obvious once you “get it.”

I’ve never seen this analysis written out before in this context. It’s nearly identical to many derivations for variational inference methods, so it’s relatively elementary and merely at the level of an advanced homework exercise. But every time I’ve seen this duality discussed, it’s justified via Fenchel duality identities and convex conjugacy.

Let me know if I messed anything up here, or if you have any good pointers to help improve my intuition about the Fenchel dual view of this same identity.

# Inaugural Post and Announcement

I’ve been meaning to start a blog for a while now (in fact, I registered this WordPress sometime in 2011), so since I have a big announcement, this is about as good a time as any. I’ll start with the announcement:

I’m announcing that starting in December 2014, I’ll be joining Virginia Tech as an Assistant Professor in the Department of Computer Science.

I’m naturally thrilled about this next phase of my career. There are many things about Virginia Tech that make it a great place for me. Over the past couple years, I’ve been collaborating with a team comprised primarily of researchers from Virginia Tech, led by Professor Naren Ramakrishnan, on applying machine learning to predict civil unrest using public data. This interdisciplinary project has been a major push of VT’s Discovery Analytics Center (DAC), which I will be officially joining as a faculty member. DAC has a strong component of new faculty, including a few recent hires who share research interests with me. I look forward to discussing ideas on machine learning, graph mining, and structured prediction with folks like Dhruv Batra and Aditya Prakash.

As for the inauguration of this blog, one of the reasons I’ve been wanting to start blogging for a while is that every now and then I come up with some idea or an analysis of a well-known idea that is either too minor, interesting but not useful enough, or too flat-out wrong to publish in a scientific manuscript but would be fun to share with folks. I’ll consider this an experiment to see if writing in this context is interesting for me and for readers.