Data Science Research for Social Good

In a few weeks, there will be a workshop at KDD on data science for social good, run by Rayid Ghani and folks. Rayid has also been heading the Data Science for Social Good fellowship program the past couple years.

I really love the whole mission behind the workshop and fellowship (and Stephen H. Bach, Lise Getoor, and I have submitted a short paper to the workshop on how probabilistic soft logic is particularly useful for problems that have social benefit). But I wanted to point out a distinction that I think will help bring in more of the best minds in machine learning research to applications with potential to do social good.

My view of the mission behind the existing effort to encourage social good applications of data science is that it is driven by the question, “What can state-of-the-art data science do to benefit society?”

This is an unarguably important question, and answers to this question should improve the lives of many immediately. The projects that are being tackled by the Data Science for Social Good Fellows should have measurable impact on the well-being of people now, in 2014. This type of immediate impact is crucially missing from most research, making it rather attractive to spend research effort on these problems.

On the other hand, what can easily be lost in the excitement over immediate tangible impact is the advancement of scientific knowledge, in algorithms or theory. When working on applied machine learning research, one typically explores what known, existing methods can be applied to novel problems.

Because of this tendency, I suggest emphasizing another important question, which exists in all novel applications to some degree, but is often paid less attention. The question is, “What can’t state-of-the-art data science do to benefit society?” I.e., what are we missing, that we need new algorithms, new models, or new theory to support?

We have developed a huge toolbox of machine learning over the past few decades, so the first question is a completely valid one to focus solely on, but it happens so often that the form of data relevant to new problems is intrinsically different from the assumed form of known tools. This discrepancy means that we often need new machine learning research to do data analysis, whether that is truly new, undiscovered research, or simply new research that has only recently been discovered.

For example, human data is inherently social and thus contains statistical dependencies that make it an ill-fit for traditional statistical tools that rely on independent samples. While machine learning research has developed methods to handle heavily dependent and structured data over the past decade, these methods are often overlooked in practice. The technology transfer has not been fully or successfully executed between researchers and practitioners.

Part of the reason for all the excitement over data science is that everyone is noticing the gap created as knowledge of machine learning and analytic tools is outpacing their application in the real world. Most of the energy now is in filling that gap by finding new applications, and this direction is completely justified. We have not filled this gap yet as a whole, but as we start using machine learning on new applications, we will start inverting parts of this gap by identifying situations where known machine learning tools are insufficient for the problems we want to solve. These situations are where a lot of important scientific research can happen, and I can think of few better settings where I want this to happen than the context of social good.

Inaugural Post and Announcement

I’ve been meaning to start a blog for a while now (in fact, I registered this WordPress sometime in 2011), so since I have a big announcement, this is about as good a time as any. I’ll start with the announcement:

I’m announcing that starting in December 2014, I’ll be joining Virginia Tech as an Assistant Professor in the Department of Computer Science.

I’m naturally thrilled about this next phase of my career. There are many things about Virginia Tech that make it a great place for me. Over the past couple years, I’ve been collaborating with a team comprised primarily of researchers from Virginia Tech, led by Professor Naren Ramakrishnan, on applying machine learning to predict civil unrest using public data. This interdisciplinary project has been a major push of VT’s Discovery Analytics Center (DAC), which I will be officially joining as a faculty member. DAC has a strong component of new faculty, including a few recent hires who share research interests with me. I look forward to discussing ideas on machine learning, graph mining, and structured prediction with folks like Dhruv Batra and Aditya Prakash.

As for the inauguration of this blog, one of the reasons I’ve been wanting to start blogging for a while is that every now and then I come up with some idea or an analysis of a well-known idea that is either too minor, interesting but not useful enough, or too flat-out wrong to publish in a scientific manuscript but would be fun to share with folks. I’ll consider this an experiment to see if writing in this context is interesting for me and for readers.