Approximate MAP Inference

by Nicholas Ruozzi

Recall that the MAP inference task is to compute the maximizing assignment of a (conditional) probability distribution. \begin{eqnarray} {\arg \max}_{x} p(x) \end{eqnarray} We saw how to compute the MAP assignment when the probability distribution factorizes over a tree structured graphical model by performing variable elimination starting at the leaves of the tree and working our way to the root. However, if the graph is not a tree, the treewidth may depend on the number of nodes in the graph. In this case, no elimination order may lead to an efficient MAP inference scheme. In this section, we will take a closer look at MAP inference in loopy MRFs and develop methods to perform approximate MAP inference.

We begin with the simple observation that the maximum value of a function is always larger than its average value. Or, more generally, given a function $f(x)$ and a probability distribution $q(x)$, \begin{eqnarray} {\arg \max}_{x} f(x) \geq \sum_x q(x)f(x). \end{eqnarray} The previous inequality is tight (i.e., is satisfied with equality) whenever $q(x)$ is a probability distribution that puts all of its mass on some assignment $x^* \in \arg\max_x f(x)$. Combining this with the observation that $\log(x)$ is a monotonic increasing function, we can reformulate the MAP problem as \begin{eqnarray} {\arg \max}_{x} p(x) = {\arg \max_{q\in Q}} \sum_x q(x) \log p(x) \end{eqnarray} where $Q$ is the set of all probability distributions over the random vector $x$.

Now, suppose that $p$ is a probability distribution that factorizes over the set of cliques $C$ of some graph $G = (V, E)$. That is, $p(x) = \frac{1}{Z} \prod_{c\in C} \psi_c(x_c)$. Applying the above argument, the MAP problem can be expressed as follows. \begin{align*} {\arg \max}_{x} \log p(x) &= {\arg \max_{q\in Q}} \sum_x q(x) \log p(x)\\ & = {\arg \max_{q\in Q}} \sum_x q(x) \log \left(\frac{1}{Z}\prod_{c\in C} \psi_c(x_c)\right)\\ & = {\arg \max_{q\in Q}} \left[-\log Z + \sum_x \sum_{c\in C} q(x) \log \psi_c(x_c)\right]\\ & = {\arg \max_{q\in Q}} \sum_x \sum_{c\in C} q(x) \log \psi_c(x_c)\\ & = {\arg \max_{q\in Q}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \end{align*} where $q_c(x_c) \equiv \sum_{x':x'_c = x_c} q(x)$ is the marginal distribution of $q$ over the variables in the clique $c\in C$.

The Marginal Polytope

At first glance, the new optimization problem \begin{eqnarray} {\arg \max_{q\in Q}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \end{eqnarray} appears to be more difficult than the original MAP optimization problem as, instead of optimizing over all possible assignments to the random vector $x$, we must now optimize over $Q$, the set of all possible probability distributions over $x$. As we observed earlier, we do not need to optimize over the entire set $Q$: it would suffice to optimize over only those distributions that place all of their mass on a single assignment. If $x\in \{1,\ldots,d\}^n$, then there are $d^n$ possible distributions with this property. We also observe that the optimization problem only requires certain marginals of the distribution $q$ and not the entire joint distribution. Every marginal distribution that arises from some joint distribution $q$ must satisfy several constraints. First, the marginal distributions must agree on their overlap. That is, if $i \in c$ and $i \in c'$, then \begin{align*} \sum_{x'_c:x'_i = x_i} q_c(x'_c) = q_i(x_i) = \sum_{x'_{c'}:x'_i = x_i} q_{c'}(x'_{c'}). \end{align*} Second, to be a probability distribution, each $q_i(x_i)$ must sum to one. \begin{align*} \sum_{x_i} q_i(x_i) = 1 \end{align*}

The marginal polytope, $M$, is defined to be the collection of all marginal distributions, $q_{i\in V}, q_{c\in C}$, satisfying the above constraints that place their mass on a single assignment. More formally, $(q_{i\in V}, q_{c\in C}) \in M$ if \begin{align*} &\text{For all }c\in C, i\in c, x_i\in\{1,\ldots,d\},& \sum_{x'_c:x'_i = x_i} q_c(x'_c) = q_i(x_i)\\ &\text{For all }i\in V,& \sum_{x_i} q_i(x_i) = 1\\ &\text{For all }i\in V, x_i\in\{1,\ldots,d\},& q_i(x_i)\in \{0,1\}\\ &\text{For all }c\in C, x_c\in\{1,\ldots,d\}^{|c|},& q_c(x_c)\in \{0,1\}. \end{align*}

Theorem: $(q_{i\in V}, q_{c\in C}) \in M$ if and only if there exists a probability $q'\in Q$ that places all of its mass on a single assignment and whose marginals are given by $(q_{i\in V}, q_{c\in C})$.

The proof of the theorem is relatively straightforward. First, given some $(q_{i\in V}, q_{c\in C}) \in M$, we can construct $q'\in Q$ as follows. For each $i$, choose $y_i \in \arg \max_{x_i} q_i(x_i)$. As $q_i$ is maximized at the choice of $x_i \in \{1,\dots, d\}$ such that $q_i(x_i) = 1$, $y_i$ is uniquely determined. Let $q'$ be the probability distribution that places all of its mass on the assignment $y$. We can easily verify that marginals of $q'$ are exactly those given bu $q_{i\in V}$ and $q_{c\in C}$.

For the other direction, given $q' \in Q$ we select $q_i(x_i) = q'_i(x_i)$ for all $i\in V$ and all $x_i\in\{1,\ldots,d\}$ and $q_c(x_c) = q'_c(x_c)$ for all $c\in C, x_c\in\{1,\ldots,d\}^{|c|}$. This choice of marginals is in $M$ essentially by definition.

The above argument allows us to reformulate the MAP problem as \begin{eqnarray} {\arg \max_{q\in Q}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) = {\arg \max_{(q_{i\in V}, q_{c\in C}) \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c). \end{eqnarray} For simplicity we often write this as ${\arg \max_{q \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c)$.

The Local Marginal Polytope

Finally, we have all of the ingredients that we need in order to construct an approximate version of the MAP optimization problem. The optimization problem \begin{eqnarray} {\arg \max_{(q_{i\in V}, q_{c\in C}) \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \end{eqnarray} is sometimes referred to as an integer programming problem. Integer programming problems are constrained optimization problems in which you are asked to maximize a linear function over vectors of integers subject to a series of linear constraints. Integer programming problems are in general NP-hard. However, linear programming problems, in which you are asked to maximize a linear function over vectors of real numbers subject to a series of linear constraints, are known to be solvable in polynomial time. We can always obtain a linear programming problem from an integer programming problem by relaxing the integrality constraint. Speifically, consider the local marginal polytope, $L$, which is obtained by relaxing the integrality constraint in the marginal polytope. A collection of marginals $(q_{i\in V}, q_{c\in C}) \in L$ if \begin{align*} &\text{For all }c\in C, i\in c, x_i\in\{1,\ldots,d\},& \sum_{x'_c:x'_i = x_i} q_c(x'_c) = q_i(x_i)\\ &\text{For all }i\in V,& \sum_{x_i} q_i(x_i) = 1\\ &\text{For all }i\in V, x_i\in\{1,\ldots,d\},& q_i(x_i)\in [0,1]\\ &\text{For all }c\in C, x_c\in\{1,\ldots,d\}^{|c|},& q_c(x_c)\in [0,1]. \end{align*}

Notice that the only difference between $M$ and $L$ are the final two constraints. As $M \subseteq L$ we must have that \begin{eqnarray} {\max_{(q_{i\in V}, q_{c\in C}) \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \leq {\max_{(q_{i\in V}, q_{c\in C}) \in L}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c). \end{eqnarray} Sometimes these two optimization problems are equivalent, though this is typically not the case. However, we are often willing to settle for an upper bound on the MAP problem in practice as linear programming problems can be solved in polynomial time while the exact MAP inference problem does not typically admit a polynomial time solution for arbitrary MRFs.

Reparameterizaitons and the MAP Problem

We can also approximate the MAP problem using reparameterizations of the objective graphical model. See the lecture slides for more details.
Creative Commons License