# Data Science Research for Social Good

In a few weeks, there will be a workshop at KDD on data science for social good, run by Rayid Ghani and folks. Rayid has also been heading the Data Science for Social Good fellowship program the past couple years.

I really love the whole mission behind the workshop and fellowship (and Stephen H. Bach, Lise Getoor, and I have submitted a short paper to the workshop on how probabilistic soft logic is particularly useful for problems that have social benefit). But I wanted to point out a distinction that I think will help bring in more of the best minds in machine learning research to applications with potential to do social good.

My view of the mission behind the existing effort to encourage social good applications of data science is that it is driven by the question, “What can state-of-the-art data science do to benefit society?”

This is an unarguably important question, and answers to this question should improve the lives of many immediately. The projects that are being tackled by the Data Science for Social Good Fellows should have measurable impact on the well-being of people now, in 2014. This type of immediate impact is crucially missing from most research, making it rather attractive to spend research effort on these problems.

On the other hand, what can easily be lost in the excitement over immediate tangible impact is the advancement of scientific knowledge, in algorithms or theory. When working on applied machine learning research, one typically explores what known, existing methods can be applied to novel problems.

Because of this tendency, I suggest emphasizing another important question, which exists in all novel applications to some degree, but is often paid less attention. The question is, “What can’t state-of-the-art data science do to benefit society?” I.e., what are we missing, that we need new algorithms, new models, or new theory to support?

We have developed a huge toolbox of machine learning over the past few decades, so the first question is a completely valid one to focus solely on, but it happens so often that the form of data relevant to new problems is intrinsically different from the assumed form of known tools. This discrepancy means that we often need new machine learning research to do data analysis, whether that is truly new, undiscovered research, or simply new research that has only recently been discovered.

For example, human data is inherently social and thus contains statistical dependencies that make it an ill-fit for traditional statistical tools that rely on independent samples. While machine learning research has developed methods to handle heavily dependent and structured data over the past decade, these methods are often overlooked in practice. The technology transfer has not been fully or successfully executed between researchers and practitioners.

Part of the reason for all the excitement over data science is that everyone is noticing the gap created as knowledge of machine learning and analytic tools is outpacing their application in the real world. Most of the energy now is in filling that gap by finding new applications, and this direction is completely justified. We have not filled this gap yet as a whole, but as we start using machine learning on new applications, we will start inverting parts of this gap by identifying situations where known machine learning tools are insufficient for the problems we want to solve. These situations are where a lot of important scientific research can happen, and I can think of few better settings where I want this to happen than the context of social good.

# Research in “the cloud” and the honeymoon phase

A few days ago, I watched a TED talk by physicist Uri Alon on the emotional experience of scientific research. It is a fun talk, and I thought the message is important.

See the video here.

Uri argues that science is presented in a manner that poorly sets up expectations for scientists: the presented story is that we scientists brilliantly devise a hypothesis, run some experiments, and verify this hypothesis. Then we hit print and science is done. Instead, Uri sheds light on the actual process, which involves a taxing trek through what he calls “the cloud.” The real story is that a scientist has a hypothesis, tests it, finds that it is wrong, and then enters this cloud, which is a place of uncertainty, despair, and confusion. The scientist has to challenge his or her basic assumptions, and only by doing so, will be able to produce a new hypothesis that truly advances human knowledge.

Though Uri is a physicist, I think this message is just as relevant for computer science. I can’t speak for all fields within computer science, but at least in machine learning research, this is almost always the way progress occurs.

One problem of applying Uri’s message to computer science is the nomenclature. The cloud (or is it the Cloud?) is a buzzword that is already somewhat defined (but not very well). So using it in this context could lead to confusion (which is a bit too meta to be useful). We need another term for the computer science version of the cloud. Or we could just use it and stop using “the cloud” to describe the Internet, because we already have a word for that.

I would add to the process Uri describes another important concept to prepare researchers for the emotional roller coaster of science: the honeymoon phase.

The honeymoon phase is what I call the period when I’ve come up with an idea, perhaps it’s an algorithm, a theorem, or an application, that I think will work. As I start incrementally testing the idea–coding up the algorithm, for example–it starts to seem more and more correct. A subroutine of the algorithm works exactly as it should on paper! The learning algorithm correctly learns from a synthetic data set! If we assume a lemma, the proof actually works out! These small victories invoke a sense of euphoria and often come with daydreams of how easily this new research will lead to publication and recognition.

In reality, the honeymoon phase is almost always followed by a discovery that something is wrong, which leads to a sharp turn directly into the cloud. This contrast from the highs of the honeymoon phase to the lows of the cloud is jarring.

Like the message from the TED talk, I believe acknowledging that this sharp transition is part of a shared experience could help limit the emotional distress cause by the process of research. I’m not sure if there is any particular strategy for intelligently handling the highs of the honeymoon phase better, and I’m hesitant to suggest to anyone not to enjoy it while it’s happening.

Next time on Terrible Emotions in Science: Rejection…

# Negative Free Energy and the Log Partition Function via Jensen’s Inequality

For years, I’ve been using the duality between the log partition function and the negative Gibbs free energy, hoping that one day I’d gain some intuition about why these two quantities are so intrinsically tied. Though it still seems somewhat mystical to me, I was messing around with some math a few weeks ago and came across one connection that helps a bit. The duality—which is used in lots of machine learning analyses—reasons about the log partition function $\log Z(\theta)$, where

$p(x) := \frac{1}{Z(\theta)} \exp\left(\langle \theta, \phi(x) \rangle \right)$

and

$Z(\theta) := \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right)$.

The famous duality is

$\log Z(\theta) = \sup_{\mu \in {\cal M}} \langle \theta, \mu \rangle + H(\mu)$,

where ${\cal M}$ is the marginal polytope, and $H(\mu)$ is the entropy of the distribution with marginals $\mu$. The marginal polytope is the convex hull of all possible marginal vectors, or more generally, the convex hull of all possible expectations of features $\phi$. That is $\mu$ is in $\cal M$ if and only if there exists a distribution $q$ whose expected feature value is $\mu$:

${\cal M} := \{ \mu : \exists q, \mu = \mathbb{E} \left[\phi(x) \right] \}$

This duality is often justified as an instance of Fenchel duality, which comes from some notions of convex conjugacy. I’m convinced these concepts are useful and intuitive to some people, since people often say it’s a Fenchel dual and just leave it at that. But to me, this terminology is rote jargon that I haven’t built a good intuition for in my many years of staring at this duality.

A side note: the quantity on the right-hand side of the duality, which combines a log-potential energy and an entropy is known as the negative Gibbs free energy, and it is strongly connected to certain variational methods related to belief propagation.

Using the generalization of marginals to expectations, we can write an equivalent duality, which is what I will analyze with Jensen’s inequality. I believe Jensen’s inequality is much easier to think about at an intuitive level. (If you’ve done derivations for variational inference, you might see where I’m headed right away.)

$\log Z(\theta) = \sup_{q \in \Delta} \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right] + H(q)$

where $\Delta$ is the simplex of all valid distributions. This is the version of the duality I’ll derive. Let’s start with the log partition function

$\log Z(\theta) = \log \sum_{x} \exp \left( \langle \theta, \phi(x) \rangle \right)$.

Multiplying the terms in the summation by $q(x) / q(x) = 1$, we get the equivalent form

$\log Z(\theta) = \log \sum_x \frac{q(x)}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) = \log ~ \mathbb{E}_{q(x)} \left[ \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]$.

Here’s where we use Jensen’s inequality in the usual trick from lots of variational analyses. We move the convex $\log$ function inside the expectation:

\begin{aligned} \log Z(\theta) &\ge \mathbb{E}_{q(x)} \left[ \log \frac{1}{q(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{q(x)} \left[ \log \exp \left( \langle \theta, \phi(x) \rangle \right) - \log q(x) \right]\\ &= \mathbb{E}_{q(x)} \left[ \langle \theta, \phi(x) \rangle \right) ] - \mathbb{E}_{q(x)} \left[ \log q(x) \right]\\ &\equiv \langle \theta, \mu \rangle + H(\mu) \end{aligned}.

Thus, simplifying the application of Jensen’s to the log-partition gets us the lower bound on the log partition, which holds for any distribution $q$ or any marginal vector $\mu$. The last remaining step to show the equality at the supremum is to show that there exists a setting of $q$ that achieves equality. It’s pretty easy to guess what that is. Let $q = p$. Then
\begin{aligned} & \mathbb{E}_{p(x)} \left[ \log \frac{1}{p(x)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{p(x)} \left[ \log \frac{Z(\theta)}{\exp \left( \langle \theta, \phi(x) \rangle \right)} \exp \left( \langle \theta, \phi(x) \rangle \right) \right]\\ &= \mathbb{E}_{p(x)} \left[ \log Z(\theta) \right] = \log Z(\theta) \end{aligned},
since $Z(\theta)$ is constant with respect to $x$.

The problem with this analysis, as much as I like it, is that it depends on having an intuitive trust of Jensen’s inequality. I think I have that intuitive trust, and I think many of the folks I talk to about this stuff do too. Jensen’s inequality is really just an observation about the definition of convexity, and as such, I’m able to reason about it naturally. I don’t have the same feeling about convex conjugacy, though I suspect since this duality is so straightforward via Jensen’s that convex conjugacy is just as obvious once you “get it.”

I’ve never seen this analysis written out before in this context. It’s nearly identical to many derivations for variational inference methods, so it’s relatively elementary and merely at the level of an advanced homework exercise. But every time I’ve seen this duality discussed, it’s justified via Fenchel duality identities and convex conjugacy.

Let me know if I messed anything up here, or if you have any good pointers to help improve my intuition about the Fenchel dual view of this same identity.

# Inaugural Post and Announcement

I’ve been meaning to start a blog for a while now (in fact, I registered this WordPress sometime in 2011), so since I have a big announcement, this is about as good a time as any. I’ll start with the announcement:

I’m announcing that starting in December 2014, I’ll be joining Virginia Tech as an Assistant Professor in the Department of Computer Science.

I’m naturally thrilled about this next phase of my career. There are many things about Virginia Tech that make it a great place for me. Over the past couple years, I’ve been collaborating with a team comprised primarily of researchers from Virginia Tech, led by Professor Naren Ramakrishnan, on applying machine learning to predict civil unrest using public data. This interdisciplinary project has been a major push of VT’s Discovery Analytics Center (DAC), which I will be officially joining as a faculty member. DAC has a strong component of new faculty, including a few recent hires who share research interests with me. I look forward to discussing ideas on machine learning, graph mining, and structured prediction with folks like Dhruv Batra and Aditya Prakash.

As for the inauguration of this blog, one of the reasons I’ve been wanting to start blogging for a while is that every now and then I come up with some idea or an analysis of a well-known idea that is either too minor, interesting but not useful enough, or too flat-out wrong to publish in a scientific manuscript but would be fun to share with folks. I’ll consider this an experiment to see if writing in this context is interesting for me and for readers.