Data Science Research for Social Good

In a few weeks, there will be a workshop at KDD on data science for social good, run by Rayid Ghani and folks. Rayid has also been heading the Data Science for Social Good fellowship program the past couple years.

I really love the whole mission behind the workshop and fellowship (and Stephen H. Bach, Lise Getoor, and I have submitted a short paper to the workshop on how probabilistic soft logic is particularly useful for problems that have social benefit). But I wanted to point out a distinction that I think will help bring in more of the best minds in machine learning research to applications with potential to do social good.

My view of the mission behind the existing effort to encourage social good applications of data science is that it is driven by the question, “What can state-of-the-art data science do to benefit society?”

This is an unarguably important question, and answers to this question should improve the lives of many immediately. The projects that are being tackled by the Data Science for Social Good Fellows should have measurable impact on the well-being of people now, in 2014. This type of immediate impact is crucially missing from most research, making it rather attractive to spend research effort on these problems.

On the other hand, what can easily be lost in the excitement over immediate tangible impact is the advancement of scientific knowledge, in algorithms or theory. When working on applied machine learning research, one typically explores what known, existing methods can be applied to novel problems.

Because of this tendency, I suggest emphasizing another important question, which exists in all novel applications to some degree, but is often paid less attention. The question is, “What can’t state-of-the-art data science do to benefit society?” I.e., what are we missing, that we need new algorithms, new models, or new theory to support?

We have developed a huge toolbox of machine learning over the past few decades, so the first question is a completely valid one to focus solely on, but it happens so often that the form of data relevant to new problems is intrinsically different from the assumed form of known tools. This discrepancy means that we often need new machine learning research to do data analysis, whether that is truly new, undiscovered research, or simply new research that has only recently been discovered.

For example, human data is inherently social and thus contains statistical dependencies that make it an ill-fit for traditional statistical tools that rely on independent samples. While machine learning research has developed methods to handle heavily dependent and structured data over the past decade, these methods are often overlooked in practice. The technology transfer has not been fully or successfully executed between researchers and practitioners.

Part of the reason for all the excitement over data science is that everyone is noticing the gap created as knowledge of machine learning and analytic tools is outpacing their application in the real world. Most of the energy now is in filling that gap by finding new applications, and this direction is completely justified. We have not filled this gap yet as a whole, but as we start using machine learning on new applications, we will start inverting parts of this gap by identifying situations where known machine learning tools are insufficient for the problems we want to solve. These situations are where a lot of important scientific research can happen, and I can think of few better settings where I want this to happen than the context of social good.