Optimization of Expected Value

Framework

Say we have a function $f(x)$ that maps some high dimensional space to a real value. In a conventional optimization approach, we would try to minimize $f(x)$ , and the relevant gradient would be $\nabla f(x)$ . An alternative approach seeks to minimize the expected value of $f$ with respect to a probability distribution. In particular, assume we have a distribution over the domain of $f$ that’s parametrized by $\theta$ . Unlike the direct approach, we can compute the gradient of this expectation without differentiating through $f$ .

$\begin{aligned} \nabla_\theta \mathbb{E}_{p_\theta}[f(x)] &= \nabla_\theta \sum_x p_\theta(x) f(x) \\ &= \sum_x \nabla_\theta p_\theta(x) f(x) \\ &= \sum_x \frac{\nabla_\theta p_\theta(x)}{p_\theta(x)} p_\theta(x) f(x) \\ &= \sum_x \nabla_\theta [ \ln p_\theta(x) ] p_\theta(x) f(x) \\ &= \mathbb{E}_{p_\theta}[\nabla_\theta \ln p_\theta(x) f(x)] \end{aligned}$

So as long as we can draw samples from $p_\theta$ , we have an unbiased estimator.

$\nabla_\theta \mathbb{E}_{p_\theta}[f(x)] \approx \frac{1}{N} \sum_{i = 1}^N \nabla_\theta \ln p_\theta(x_i) f(x_i)$

So we can do gradient descent in order to improve our distribution.

Using a Multivariate Gaussian

Suppose our distribution is a multivariate Gaussian with mean $\mu$ and covariance matrix $\Sigma$ .

$\begin{aligned} p_{\mu, \Sigma}(x) &= (2 \pi)^{-\frac{k}{2}} \det(\Sigma)^{-\frac{1}{2}} e^{-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)} \\ \ln p_{\mu, \Sigma}(x) &= -\frac{k}{2} \ln(2 \pi) - \frac{1}{2} \ln \det \Sigma - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \\ \nabla_\mu \ln p_{\mu, \Sigma}(x) &= (x - \mu)^T \Sigma^{-1} \\ \nabla_\Sigma \ln p_{\mu, \Sigma}(x) &= - \frac{1}{2} \Sigma^{-1} - \frac{1}{2} \Sigma^{-1} x x^T \Sigma^{-1} \end{aligned}$