Stephen Bach (http://stephenbach.net) pointed out to me that Wainwright and Jordan include a proof of the duality using Jensen’s inequality in a later section of their famous Foundations of Machine Learning 2008 paper, buried in an analysis of mean-field methods. I still think this derivation is the first one that should be presented, especially to machine learning audiences. To me, Jensen’s inequality is very natural, and is just a restatement of the definition of convexity. Anand Sarwate wrote a post (http://ergodicity.net/2014/10/31/fenchel-duality-entropy-and-the-log-partition-function/) on how straightforward this duality is if you’re comfortable with Fenchel duality, but I just don’t have as good an intuition for Fenchel’s inequality as I do with Jensen’s. Maybe that’s my own thing that I just need to deal with…
I guess it’s silly, because when our intuitions line up with how numbers work in high dimensional space, it’s usually just coincidence. In my experience the rate at which high-dimensional math and intuition agree suggests that they are completely uncorrelated.

]]>