第六章　变分推断

Chapter 6 Variational Inference

6.1 引言 / Introduction

中文

前几章自底向上地发展了估计理论：从线性高斯情形出发，逐步扩展到非线性和非理想情形，最终得到一系列估计器（EKF、IEKF、SPKF、批量 MAP……）。本章换一个”自顶向下”的视角——从一个统一的概率目标函数出发，推导出这些方法，并进一步解锁参数辨识的能力。

核心问题：真实后验 $p (x ∣ z)$ 对非线性模型而言往往不是高斯分布，计算它在计算上不可行。**变分推断（Variational Inference）**的思路是：

不去计算真实后验，而是在高斯分布族中寻找最接近真实后验的那个高斯分布 $q (x) = N (μ, Σ)$ 。

“最接近”的度量是KL 散度（Kullback-Leibler divergence）。限于高斯近似的变分推断，称为高斯变分推断（Gaussian Variational Inference, GVI）。

直觉：想象你想知道一条崎岖山谷（真实后验）的形状，但地图（计算）太贵。变分推断的策略是：用一个椭球（高斯分布）去近似这个山谷，找到覆盖它最好的椭球。MAP 只找山谷最低点，GVI 还要找最能代表整个山谷形状的椭球。

English

The previous chapters developed estimation from the bottom up: starting with linear-Gaussian and extending to nonlinear and non-ideal cases. This chapter takes a top-down view: starting from a single probabilistic objective and recovering (and extending) many earlier results.

Motivation: For nonlinear models, the true Bayesian posterior $p (x ∣ z)$ is non-Gaussian and generally intractable. Gaussian variational inference (GVI) seeks the Gaussian $q (x) = N (μ, Σ)$ that best approximates the true posterior by minimizing the KL divergence between them.

Key advantages of GVI over MAP:

Returns a full Gaussian approximation (mean + covariance), not just a point estimate
The covariance accounts for the nonlinearity, not just a Laplace approximation
Can jointly estimate model parameters (system matrices, biases, covariances)

6.2 高斯变分推断 / Gaussian Variational Inference

6.2.1 损失函数泛函 / Loss Functional

中文

KL 散度有两种方向，各有不同性质：

$\text{KL}(p \| q) = E_p[\ln p(\mathbf{x}|\mathbf{z}) - \ln q(\mathbf{x})], \tag{6.4}$

$\text{KL}(q \| p) = E_q[\ln q(\mathbf{x}) - \ln p(\mathbf{x}|\mathbf{z})]. \tag{6.5}$

我们选择 $KL (q ∥ p)$ ，原因是期望是对 $q$ （我们自己的估计）取的，可以用采样或解析方法计算；而 $KL (p ∥ q)$ 的期望对真实后验 $p$ 取，不可计算。

展开 $KL (q ∥ p)$ ，丢掉不依赖于 $q$ 的常数项，定义损失函数泛函（loss functional）：

$V(q) = \underbrace{E_q[\phi(\mathbf{x})]}_{\text{数据拟合}} + \underbrace{\frac{1}{2}\ln|\boldsymbol{\Sigma}^{-1}|}_{\text{熵惩罚}}, \tag{6.7}$

其中 $ϕ (x) = - ln p (x, z)$ 是联合分布的负对数似然。

注意我们用精度矩阵（信息矩阵） $Σ^{- 1}$ 而非协方差 $Σ$ 来描述 $q$ ，原因是精度矩阵在批量估计中具有稀疏性。

损失函数的直觉：

第一项 $E_{q} [ϕ (x)]$ ：鼓励高斯分布 $q$ 集中在数据似然高的区域（数据拟合）；
第二项 $\frac{1}{2} ln ∣ Σ^{- 1} ∣$ ：等于负熵，惩罚过于集中（过于自信）的分布；
两者平衡产生最优高斯近似。

$V (q)$ 是所谓**证据下界（ELBO）**的负值。

English

We choose $KL (q ∥ p)$ because the expectation is over $q$ — our own Gaussian approximation — rather than the intractable true posterior $p (x ∣ z)$ .

After expanding and dropping the constant term $ln p (z)$ , the loss functional to minimize is:

$V (q) = E_{q} [ϕ (x)] + \frac{1}{2} ln ∣ Σ^{- 1} ∣, ϕ (x) = - ln p (x, z) .$

The first term fits the data; the second is the negative Gaussian entropy, penalizing overconfidence. The balance between them yields the Gaussian that is closest in KL divergence to the true posterior. $V (q)$ is the negative ELBO.

6.2.2 优化方案 / Optimization Scheme

中文

目标是对 $q (x)$ 的参数 $(μ, Σ^{- 1})$ 求 $V (q)$ 的最小值。用 Stein 引理（见附录）将损失泛函对参数的导数化简为对 $ϕ (x)$ 本身的导数期望：

$\frac{\partial V(q)}{\partial \boldsymbol{\mu}^T} = \boldsymbol{\Sigma}^{-1} E_q[(\mathbf{x}-\boldsymbol{\mu})\phi(\mathbf{x})], \tag{6.8a}$

$\frac{\partial^2 V(q)}{\partial \boldsymbol{\mu}^T \partial \boldsymbol{\mu}} = E_q\left[\frac{\partial^2 \phi(\mathbf{x})}{\partial \mathbf{x}^T \partial \mathbf{x}}\right] = \boldsymbol{\Sigma}^{-1(i+1)}, \tag{6.23a}$

$\boldsymbol{\Sigma}^{-1(i+1)} \delta\boldsymbol{\mu} = -E_{q^{(i)}}\left[\frac{\partial \phi(\mathbf{x})}{\partial \mathbf{x}^T}\right], \tag{6.23b}$

$\boldsymbol{\mu}^{(i+1)} = \boldsymbol{\mu}^{(i)} + \delta\boldsymbol{\mu}. \tag{6.23c}$

这是一个 Newton 型迭代格式：用当前估计 $q^{(i)}$ 计算期望，求解线性方程组得到均值更新量，同时直接得到新的逆协方差。

与 MAP 的联系：若期望用均值点处的函数值近似（相当于只用 1 个采样点），则 (6.23) 退化为 MAP 的 Newton 迭代。因此，MAP = GVI 的一个粗糙近似（只使用了 $q$ 的均值）。

收敛性保证：利用代价函数的泰勒展开可以证明，每次迭代使 $V (q)$ 不增：

$V(q^{(i+1)}) - V(q^{(i)}) \approx -\frac{1}{2}\delta\boldsymbol{\mu}^T \boldsymbol{\Sigma}^{-1(i+1)} \delta\boldsymbol{\mu} - \frac{1}{2}\text{tr}(\boldsymbol{\Sigma}^{(i)}\delta\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}^{(i)}\delta\boldsymbol{\Sigma}^{-1}) \leq 0. \tag{6.14}$

English

Using Stein’s lemma, the gradient of $V (q)$ simplifies to expectations of derivatives of $ϕ (x)$ . The iterative update scheme is:

$Σ^{- 1 (i + 1)} = E_{q^{(i)}} [\frac{\partial ^{2} ϕ}{\partial x ^{T} \partial x}], Σ^{- 1 (i + 1)} δ μ = - E_{q^{(i)}} [\frac{\partial ϕ}{\partial x ^{T}}], μ^{(i + 1)} = μ^{(i)} + δ μ .$

If the expectations are evaluated only at the mean $μ^{(i)}$ (one-point approximation), this reduces exactly to MAP Newton’s method with the Laplace covariance. Thus, MAP is a special (degenerate) case of GVI. Using more cubature points gives a better approximation of the expectations and hence a better-quality posterior estimate.

Local convergence is guaranteed: the descent step decreases $V (q)$ as long as either $δ μ \neq = 0$ or $δ Σ^{- 1} \neq = 0$ .

6.2.3 自然梯度下降解释 / Natural Gradient Descent

中文

可以证明，以上优化方案等价于对参数向量 $α = [μ; vec (Σ^{- 1})]$ 进行自然梯度下降（Natural Gradient Descent, NGD）：

$\delta\boldsymbol{\alpha} = -\mathcal{I}_\alpha^{-1} \frac{\partial V(q)}{\partial \boldsymbol{\alpha}^T}, \tag{6.19}$

其中 $I_{α}$ 是 Fisher 信息矩阵（FIM）。自然梯度将普通梯度按信息几何”预调节”（precondition），使每步更新在参数空间中更高效。

English

The GVI update is equivalent to natural gradient descent on the variational parameters, with the Fisher information matrix acting as a preconditioner. This interpretation provides a geometric understanding of why the method converges efficiently.

6.3 精确稀疏性 / Exact Sparsity

中文

对于大规模状态估计（如轨迹长度为 $K$ 的机器人问题），直接使用 (6.23) 需要 $O (N^{3})$ 的计算量（ $N$ 是总状态维数），代价太高。本节展示如何利用似然函数的因子化结构来大幅降低计算复杂度。

English

For large-scale problems (e.g., long robot trajectories), directly applying (6.23) is $O (N^{3})$ per iteration in the total state dimension $N$ . Factored joint likelihoods permit a dramatic reduction in cost.

6.3.1 因子化联合似然 / Factored Joint Likelihood

中文

许多实际估计问题中，联合似然可以因子化：

$\phi(\mathbf{x}) = \sum_{k=1}^K \phi_k(\mathbf{x}_k), \tag{6.32}$

其中每个因子 $ϕ_{k}$ 只涉及状态的一个子集 $x_{k}$ 。例如：

运动因子： $ϕ_{k} (x_{k - 1}, x_{k})$ 只涉及相邻两时刻的状态；
观测因子： $ϕ_{k} (x_{k})$ 只涉及当前时刻的状态。

关键发现：因子化使三个期望的计算都可以化约到因子局部的边缘分布（marginals）上：

$E_q[\phi(\mathbf{x})] = \sum_k E_{q_k}[\phi_k(\mathbf{x}_k)], \tag{6.33}$

$E_q\left[\frac{\partial \phi}{\partial \mathbf{x}^T}\right] = \sum_k \mathbf{P}_k^T E_{q_k}\left[\frac{\partial \phi_k}{\partial \mathbf{x}_k^T}\right], \tag{6.35}$

$E_q\left[\frac{\partial^2 \phi}{\partial \mathbf{x}^T \partial \mathbf{x}}\right] = \sum_k \mathbf{P}_k^T E_{q_k}\left[\frac{\partial^2 \phi_k}{\partial \mathbf{x}_k^T \partial \mathbf{x}_k}\right] \mathbf{P}_k, \tag{6.36}$

其中 $P_{k}$ 是从全状态 $x$ 提取 $x_{k}$ 的投影矩阵， $q_{k} (x_{k}) = N (μ_{k}, Σ_{kk})$ 是 $q$ 对应的局部边缘分布。

重要推论：

逆协方差的精确稀疏性： $Σ^{- 1}$ 的非零块仅出现在同一因子中相关联的变量之间。对于批量轨迹估计（每对相邻时刻连接）， $Σ^{- 1}$ 为块三对角（block-tridiagonal）。
无需存储全协方差：只需计算 $Σ^{- 1}$ 非零块对应的 $Σ$ 的子块（通常远小于完整的 $N \times N$ 矩阵）。

直觉：就好像一条时间线上的机器人，时刻 $k$ 只”看到”它的邻居（ $k - 1$ 和 $k + 1$ ）。这种局部依赖结构使全局协方差矩阵的大多数元素为零，只需关注非零部分。

English

When $ϕ (x) = \sum_{k} ϕ_{k} (x_{k})$ (each factor involves only a subset $x_{k}$ of the state), the three required expectations reduce to marginal expectations — expectations over the marginal $q_{k} (x_{k})$ , not the full $q (x)$ .

This is not an approximation. The result is that:

$Σ^{- 1}$ is exactly sparse with a fixed sparsity pattern determined by the factor graph. For batch trajectory estimation, this is block-tridiagonal — the same sparsity as in MAP.
We never need the full (dense) $N \times N$ covariance matrix $Σ$ ; only the sub-blocks corresponding to the non-zero blocks of $Σ^{- 1}$ are needed.

This exactly sparse GVI is referred to as ESGVI (Exactly Sparse GVI).

6.3.2 协方差矩阵的部分计算 / Partial Computation of the Covariance

中文

我们需要从（稀疏的） $Σ^{- 1}$ 高效地恢复需要的 $Σ$ 子块，而不用显式计算密集的完整 $Σ$ 。

Takahashi 算法（1973）正好解决这个问题：在对 $Σ^{- 1}$ 做 $LDL^{T}$ 分解（用于求解均值更新方程）的同时，可以通过后向代换高效提取所有需要的协方差子块：

$\boldsymbol{\Sigma}^{-1} = \mathbf{LDL}^T, \tag{6.40}$

$\boldsymbol{\Sigma}_{K,K} = \mathbf{D}_{K,K}^{-1}, \tag{6.48a}$

$\boldsymbol{\Sigma}_{j,k} = \delta(j,k)\mathbf{D}_{j,k}^{-1} - \sum_{\ell=k+1}^K \boldsymbol{\Sigma}_{j,\ell} \mathbf{L}_{\ell,k}, \quad j \geq k. \tag{6.48d}$

“四角法则”（four corners of a box rule）：若 $L_{k, i} \neq = 0$ 且 $L_{j, i} \neq = 0$ ，则必须有 $L_{j, k} \neq = 0$ 。这确保所有需要的协方差子块都可以被完整计算。

计算代价：整体复杂度与 MAP 方法相同（都是同一 $LDL^{T}$ 分解主导），只有一个更大的常数系数（来自边缘期望的额外计算）。

English

The required sub-blocks of $Σ$ (corresponding to non-zero blocks of $Σ^{- 1}$ ) can be computed efficiently by the Takahashi back-substitution (6.48), which piggybacks on the $LDL^{T}$ decomposition already performed to solve for $δ μ$ . The overall complexity per GVI iteration is the same order as MAP, with a larger constant due to the marginal expectation evaluations.

6.3.3 边缘采样（Cubature） / Marginal Sampling

中文

优化方案需要计算每个因子 $k$ 的局部期望：

$E_{q_k}[\phi_k(\mathbf{x}_k)], \quad E_{q_k}\left[\frac{\partial \phi_k}{\partial \mathbf{x}_k^T}\right], \quad E_{q_k}\left[\frac{\partial^2 \phi_k}{\partial \mathbf{x}_k^T \partial \mathbf{x}_k}\right]. \tag{6.49}$

无导数实现：再次用 Stein 引理，将包含 $ϕ_{k}$ 导数的期望化为只含 $ϕ_{k}$ 本身的期望：

$E_{q_k}\left[\frac{\partial \phi_k}{\partial \mathbf{x}_k^T}\right] = \boldsymbol{\Sigma}_{kk}^{-1} E_{q_k}[(\mathbf{x}_k - \boldsymbol{\mu}_k)\phi_k(\mathbf{x}_k)], \tag{6.52}$

$E_{q_k}\left[\frac{\partial^2 \phi_k}{\partial \mathbf{x}_k^T \partial \mathbf{x}_k}\right] = \boldsymbol{\Sigma}_{kk}^{-1}E_{q_k}[(\mathbf{x}_k-\boldsymbol{\mu}_k)(\mathbf{x}_k-\boldsymbol{\mu}_k)^T\phi_k]\boldsymbol{\Sigma}_{kk}^{-1} - \boldsymbol{\Sigma}_{kk}^{-1}E_{q_k}[\phi_k]. \tag{6.53}$

这三个期望用 cubature / sigma 点方法数值近似：

$E_{q_k}[\phi_k(\mathbf{x}_k)] \approx \sum_\ell w_{k,\ell} \phi_k(\mathbf{x}_{k,\ell}), \tag{6.56a}$

其中 $x_{k, ℓ} = μ_{k} + Σ_{kk} ξ_{k, ℓ}$ 是 sigma 点， $w_{k, ℓ}$ 是对应权重。

常用选项（参见第四章）：

UKF sigma 点（ $2 N_{k} + 1$ 个，含中心点）
球形 cubature 规则（ $2 N_{k}$ 个）
Gauss-Hermite cubature（更精确， $M^{N_{k}}$ 个，适合因子维数小时使用）

ESGVI 算法完整流程：

初始化 μ, Σ^{-1}（可由 MAP 提供）
重复直到收敛：
  1. 对每个因子 k：
     a. 从 q_k = N(μ_k, Σ_{kk}) 生成 sigma 点 {x_{k,ℓ}, w_{k,ℓ}}
     b. 计算三个局部期望（6.56）
  2. 组装全局期望（6.35, 6.36）
  3. 更新 Σ^{-1}（由 (6.23a) 给出——直接是期望的 Hessian）
  4. 分解 Σ^{-1} = LDL^T
  5. 求解 Σ^{-1} δμ = -gradient（前/后向代换）
  6. 用 Takahashi 后向代换更新所需的 Σ 子块
  7. 更新均值：μ ← μ + δμ

直觉：ESGVI 本质上是一个”智能化的 IEKF/批量 MAP”——它不只在均值点处线性化，而是在当前高斯后验的采样点处取期望，从而更好地捕捉模型非线性。随着迭代的进行，后验分布（均值+协方差）共同收敛，不只是均值在移动。

English

The three marginal expectations in (6.49) are evaluated using Gaussian cubature (sigma-point methods). Stein’s lemma converts the derivative expectations in (6.52)-(6.53) into expectations of $ϕ_{k}$ itself (times polynomial factors), which can then be approximated by weighted sums over sigma points. The full ESGVI algorithm alternates between:

Computing marginal cubature expectations for each factor;
Updating $Σ^{- 1}$ (= expected Hessian of $ϕ$ );
Solving the linear system for $δ μ$ ;
Using Takahashi backward substitution to recover the required blocks of $Σ$ ;
Updating $μ$ .

Since the expectations are over the improving posterior (not fixed about the prior as in SPKF/ISPKF), ESGVI is a posterior statistical linearization method.

6.4 扩展 / Extensions

6.4.1 替代损失函数（ESGVI-GN）/ Alternate Loss Functional

中文

当 $ϕ (x) = \frac{1}{2} e (x)^{T} W^{- 1} e (x)$ 时，可以定义一个替代损失：

$V'(q) = \frac{1}{2}E_q[\mathbf{e}(\mathbf{x})]^T \mathbf{W}^{-1} E_q[\mathbf{e}(\mathbf{x})] + \frac{1}{2}\ln|\boldsymbol{\Sigma}^{-1}|, \tag{6.69}$

由 Jensen 不等式，有 $V^{'} (q) \leq V (q)$ （ $V^{'}$ 是 $V$ 的一个保守下界），对应 Gauss-Newton 近似（用期望误差替代误差的期望，跳过 Hessian 的完整计算）。

ESGVI-GN 更新方程：

$\boldsymbol{\Sigma}^{-1(i+1)} = \bar{\mathbf{E}}^{(i)T}\mathbf{W}^{-1}\bar{\mathbf{E}}^{(i)}, \tag{6.74}$

$\boldsymbol{\Sigma}^{-1(i+1)}\delta\boldsymbol{\mu} = -\bar{\mathbf{E}}^{(i)T}\mathbf{W}^{-1}\bar{\mathbf{e}}^{(i)}, \tag{6.75}$

其中 $\overset{ˉ}{e}^{(i)} = E_{q^{(i)}} [e (x)]$ ， $\overset{ˉ}{E}^{(i)} = E_{q^{(i)}} [\frac{\partial e}{\partial x}]$ 是统计 Jacobian（statistical Jacobian）。

优势：

无需计算完整 Hessian，所需 sigma 点数减半；
产生的逆协方差更保守（比 ESGVI 更小的 $Σ^{- 1}$ ），有助于避免过度自信；
可作为 ESGVI 的粗预处理步骤。

English

The alternate loss $V^{'} (q)$ replaces $E_{q} [e^{T} W^{- 1} e]$ with $E_{q} [e]^{T} W^{- 1} E_{q} [e]$ (Jensen’s inequality guarantees $V^{'} \leq V$ , a conservative lower bound). The resulting update is Gauss-Newton style, using a statistical Jacobian $\overset{ˉ}{E} = E_{q} [\partial e / \partial x]$ . This requires fewer cubature points (halved polynomial order) and produces a more conservative (less overconfident) covariance than full ESGVI.

6.4.2 参数估计（EM 算法）/ Parameter Estimation

中文

ESGVI 框架可以自然扩展到同时估计模型参数 $θ$ （如系统矩阵 $A$ 、 $B$ 、 $C$ ，以及噪声协方差 $Q$ 、 $R$ ）。

这通过**期望最大化（Expectation Maximization, EM）**算法实现：

步骤	操作
E 步（Expectation）	固定 $θ$ ，用 ESGVI 迭代直到收敛，得到 $q (x) = N (\hat{x}, \hat{P})$
M 步（Minimization）	固定 $q (x)$ ，对 $θ$ 最小化 $V(q

M 步对 $θ$ 求导并令其为零，得到闭合更新公式。由于因子化似然，M 步的期望只需要 $\hat{P}$ 中与非零块对应的子块——正是 E 步已经计算过的。

协方差估计（M 步闭合解）：

$\mathbf{W} = \frac{1}{K}\sum_{k=1}^K E_{q_k}[\mathbf{e}_k(\mathbf{x}_k)\mathbf{e}_k(\mathbf{x}_k)^T]. \tag{6.87}$

EM 收敛性：每次 EM 迭代使 $- ln p (z ∣ θ)$ （测量数据的负对数似然）单调不增，但只保证收敛到局部最优。

English

By treating the model parameters $θ = {A, B, C, Q, R, \overset{ˇ}{P}_{0}}$ as unknowns, ESGVI extends naturally to EM-based system identification:

E-step: Run ESGVI to convergence with fixed $θ$ , obtaining $q (x) = N (\hat{x}, \hat{P})$ .
M-step: With fixed $q (x)$ , minimize $V (q ∣ θ)$ over $θ$ in closed form using the marginal expectations.

For example, the measurement covariance update is $R = \frac{1}{K + 1} \sum_{k} E_{q_{k}} [(y_{k} - C x_{k}) (y_{k} - C x_{k})^{T} + C \hat{P}_{k} C^{T}]$ , which includes a correction term for the state estimate uncertainty. EM alternates E and M steps, converging to a local minimum of $- ln p (z ∣ θ)$ .

6.5 线性系统的特殊情形 / Linear Systems

6.5.1 恢复批量线性解 / Recovery of the Batch Solution

中文

对于线性高斯系统，可以验证 GVI 在一步迭代后就精确收敛到批量 MAP（Cholesky 平滑器）的解：

$\hat{\mathbf{P}}^{-1} = \underbrace{\mathbf{A}^{-T}\mathbf{Q}^{-1}\mathbf{A}^{-1} + \mathbf{C}^T\mathbf{R}^{-1}\mathbf{C}}_{\text{块三对角}}, \tag{6.91a}$

$\hat{\mathbf{P}}^{-1}\hat{\mathbf{x}} = \mathbf{A}^{-T}\mathbf{Q}^{-1}\mathbf{Bu} + \mathbf{C}^T\mathbf{R}^{-1}\mathbf{y}. \tag{6.91b}$

这与第三章推导的结果完全一致。对于非线性系统，GVI 与 MAP 不同——GVI 迭代使用完整边缘后验期望，而 MAP 只在均值点展开。

English

For linear-Gaussian systems, the GVI update reduces to the batch MAP (Cholesky smoother) in a single iteration, exactly recovering (6.91). The block-tridiagonal sparsity of $\hat{P}^{- 1}$ is preserved. For nonlinear systems, GVI provides a richer approximation of the posterior than MAP.

6.5.2 系统辨识 / System Identification

中文

通过 EM 算法，对于线性时不变（LTI）系统，M 步的闭合解为：

$[\mathbf{A} \; \mathbf{B}] = \left[\sum_k (\hat{\mathbf{x}}_k \hat{\mathbf{x}}_{k-1}^T + \hat{\mathbf{P}}_{k,k-1}), \; \sum_k \hat{\mathbf{x}}_k \mathbf{u}_k^T\right] \cdot \left[\begin{smallmatrix} \sum_k (\hat{\mathbf{x}}_{k-1}\hat{\mathbf{x}}_{k-1}^T + \hat{\mathbf{P}}_{k-1}) & \sum_k \hat{\mathbf{x}}_{k-1}\mathbf{u}_k^T \\ \sum_k \mathbf{u}_k\hat{\mathbf{x}}_{k-1}^T & \sum_k \mathbf{u}_k\mathbf{u}_k^T \end{smallmatrix}\right]^{-1}, \tag{6.99a}$

$\mathbf{Q} = \frac{1}{K}\sum_{k=1}^K \left[(\hat{\mathbf{x}}_k - \mathbf{A}\hat{\mathbf{x}}_{k-1} - \mathbf{B}\mathbf{u}_k)(\cdots)^T + \hat{\mathbf{P}}_k - \hat{\mathbf{P}}_{k,k-1}\mathbf{A}^T - \mathbf{A}\hat{\mathbf{P}}_{k,k-1}^T + \mathbf{A}\hat{\mathbf{P}}_{k-1}\mathbf{A}^T\right], \tag{6.101b}$

其中 $\hat{P}_{k, k - 1}$ 是相邻时刻的互协方差块，由 RTS 平滑器计算。

与自适应协方差估计（第五章）的比较：EM 方法用平滑后的状态估计计算误差，并通过协方差项修正；自适应方法用滤波后的状态估计（与误差不相关），通过减去状态估计协方差来修正。两种方法形式相似但细节不同。

English

For LTI systems, the M-step closed-form updates for $A$ , $B$ , $C$ , $Q$ , $R$ , and $\overset{ˇ}{P}_{0}$ are given in (6.99)-(6.101). These require the cross-covariance blocks $\hat{P}_{k, k - 1}$ (off-diagonal blocks of $\hat{P}$ ), which are computed as part of the RTS smoother in the E-step. The EM covariance estimates differ from the adaptive covariance estimates (Section 5.5.2) in that EM uses the smoothed posterior (correlated with residuals, corrected by adding $\hat{P}$ -based terms), while adaptive estimation uses the filtered prior (uncorrelated, corrected by subtracting $\overset{ˇ}{P}$ -based terms).

6.6 非线性系统 / Nonlinear Systems

6.6.1 卡尔曼滤波器修正步的 GVI / GVI for the Kalman Filter Correction Step

中文

可以把 GVI 应用于滤波器的单步修正：给定预测分布 $N (\overset{ˇ}{x}, \overset{ˇ}{P})$ 和测量 $y$ ，最小化：

$V(q) = \phi_p(\mathbf{x}) + \phi_m(\mathbf{x}) = \frac{1}{2}(\mathbf{x}-\check{\mathbf{x}})^T\check{\mathbf{P}}^{-1}(\mathbf{x}-\check{\mathbf{x}}) + \frac{1}{2}(\mathbf{y}-\mathbf{g}(\mathbf{x}))^T\mathbf{R}^{-1}(\mathbf{y}-\mathbf{g}(\mathbf{x})). \tag{6.102}$

使用替代损失（ESGVI-GN），得到 Kalman 型修正公式：

$\mathbf{K} = \check{\mathbf{P}}\bar{\mathbf{G}}^T(\mathbf{R} + \bar{\mathbf{G}}\check{\mathbf{P}}\bar{\mathbf{G}}^T)^{-1}, \tag{6.107a}$

$\mathbf{P}_{\text{op}} \leftarrow (\mathbf{1} - \mathbf{K}\bar{\mathbf{G}})\check{\mathbf{P}}, \tag{6.107b}$

$\mathbf{x}_{\text{op}} \leftarrow \check{\mathbf{x}} + \mathbf{K}(\mathbf{y} - \bar{\mathbf{g}} - \bar{\mathbf{G}}(\check{\mathbf{x}} - \mathbf{x}_{\text{op}})), \tag{6.107c}$

其中 $\overset{ˉ}{g} = E_{q} [g (x)]$ ， $\overset{ˉ}{G} = E_{q} [\frac{\partial g}{\partial x}]$ 为统计 Jacobian，均可用 sigma 点计算。

GVI 与 ISPKF 的本质区别：

ISPKF 的 sigma 点从先验 $\overset{ˇ}{P}$ 生成（先验统计线性化）；
GVI 的 sigma 点从迭代更新的后验 $P_{op}$ 生成（后验统计线性化）。

这一差异在高度非线性情形下非常重要：后验通常比先验更窄，对非线性的近似更精确。

English

Applied to a single Kalman correction step, GVI (with the alternate loss) produces a Kalman-form update where the standard Jacobian $G$ is replaced by the statistical Jacobian $\overset{ˉ}{G} = E_{q} [\partial g / \partial x]$ , evaluated at the iteratively improving posterior $q$ rather than the predicted prior.

This is the key distinction from ISPKF: ISPKF uses sigma points drawn from the prior $\overset{ˇ}{P}$ (prior statistical linearization), while GVI uses sigma points from the posterior $P_{op}$ (posterior statistical linearization). For strong nonlinearities where the posterior is significantly narrower than the prior, posterior linearization is more accurate.

6.6.2 立体相机例子的再访 / Stereo Camera Example Revisited

中文

回到第四章的立体相机例子（ $x_{true} = 26$ m）。对比各方法的估计均值（ $1 0^{5}$ 次 Monte Carlo）：

方法	估计均值	偏差
MAP / IEKF	24.5694 m	$- 33.0$ cm（大偏差）
ISPKF	24.7414 m	$- 3.84$ cm
GVI（ESGVI）	24.7792 m	$+ 0.28$ cm（极小偏差）
真实后验均值	24.7770 m	—

GVI 的估计最接近真实后验均值，偏差极小（0.28 cm vs. MAP 的 33 cm）。这验证了 GVI 优化目标的正确性：它确实找到了 KL 散度意义下最接近真实后验的高斯分布。

核心洞察：MAP、IEKF、ISPKF 和 GVI 之间的本质区别在于用多少个点来近似期望：

MAP/IEKF：1 个点（仅在均值处展开）

SPKF（非迭代）： $2 N + 1$ 个 sigma 点，但从先验取

ISPKF： $2 N + 1$ 个 sigma 点，从先验迭代取

GVI： $2 N + 1$ 个 sigma 点，从后验迭代取

English

On the stereo camera example, GVI achieves a bias of only 0.28 cm vs. 33.0 cm for MAP and 3.84 cm for ISPKF. This demonstrates that GVI, by minimizing $KL (q ∥ p)$ , genuinely finds the closest Gaussian to the true posterior — not just the mode (MAP) or an approximation of the mean (ISPKF).

The comparison reveals a spectrum of approximation quality:

More sigma points evaluated at the iterating posterior → better approximation → GVI.
One sigma point at the mean → MAP/IEKF.
Multiple sigma points at the prior → SPKF/ISPKF (intermediate).

6.7 本章小结 / Summary

中文

本章建立了变分推断的完整框架：

概念	关键点
KL 散度选择	选 KL $(q ∥ p)$ ：期望对 $q$ 取，可计算
损失泛函	$V (q) = E_{q} [ϕ] + \frac{1}{2} ln ∥ Σ^{- 1} ∥$ ：数据拟合 + 熵惩罚
优化方案	Newton 迭代：GVI 更新 = 期望 Hessian + 期望梯度的线性系统
精确稀疏	因子化似然 → $Σ^{- 1}$ 块三对角 → 只需局部边缘协方差子块
Cubature	用 sigma 点计算期望，无需显式导数
参数估计	EM 算法自然融入：E 步=GVI，M 步=闭合公式
与 MAP 关系	MAP = GVI 的单点近似（1 个 sigma 点在均值处）

四个核心结论：

GVI 是估计理论的统一框架：MAP、KF、平滑器、系统辨识都可以在 GVI 框架下推导，是它们各自的特殊情形或近似。
精确稀疏性使 GVI 与 MAP 等量级：因子化似然使 GVI 的计算代价（关于状态维数的阶）与 MAP 相同，只是常数倍数更大。
GVI 比 MAP 更接近真实后验：在非线性系统中，GVI 能显著减少偏差（如立体相机例子中从 33 cm 降至 0.28 cm），代价是每次迭代需要更多 sigma 点评估。
后验统计线性化优于先验统计线性化：GVI 使用从后验生成的 sigma 点，比 ISPKF（从先验生成）更精确，尤其在高度非线性情形。

English

GVI provides a principled top-down framework that unifies state estimation, parameter estimation, and system identification. The choice $KL (q ∥ p)$ leads to the ELBO loss, whose minimization recovers MAP (with one sigma point) or achieves a better Gaussian posterior approximation (with more sigma points). Factored likelihoods ensure exact block-tridiagonal sparsity in $Σ^{- 1}$ , making ESGVI computationally tractable via the Takahashi partial covariance algorithm. The EM extension enables joint state and parameter estimation from data alone.

The progression of approximation quality: MAP → SPKF/ISPKF → GVI, with each step requiring more computation but achieving a closer match to the true Bayesian posterior.

第二部分将从纯代数向量状态转向三维几何：旋转不是普通向量，机器人在三维空间中的位姿需要特殊的数学工具。下一章介绍三维几何的基础知识。

Part II moves from algebraic vector states to three-dimensional geometry: rotations are not ordinary vectors, and the pose of a robot in 3D space requires specialized mathematical tools. The next chapter introduces the necessary 3D geometry primer.

Chunibyo

Explorer

ch06_variational

第六章　变分推断

Chapter 6 Variational Inference

6.1 引言 / Introduction

6.2 高斯变分推断 / Gaussian Variational Inference

6.2.1 损失函数泛函 / Loss Functional

6.2.2 优化方案 / Optimization Scheme

6.2.3 自然梯度下降解释 / Natural Gradient Descent

6.3 精确稀疏性 / Exact Sparsity

6.3.1 因子化联合似然 / Factored Joint Likelihood

6.3.2 协方差矩阵的部分计算 / Partial Computation of the Covariance

6.3.3 边缘采样（Cubature） / Marginal Sampling

6.4 扩展 / Extensions

6.4.1 替代损失函数（ESGVI-GN）/ Alternate Loss Functional

6.4.2 参数估计（EM 算法）/ Parameter Estimation

6.5 线性系统的特殊情形 / Linear Systems

6.5.1 恢复批量线性解 / Recovery of the Batch Solution

6.5.2 系统辨识 / System Identification

6.6 非线性系统 / Nonlinear Systems

6.6.1 卡尔曼滤波器修正步的 GVI / GVI for the Kalman Filter Correction Step

6.6.2 立体相机例子的再访 / Stereo Camera Example Revisited

6.7 本章小结 / Summary

Graph View

Table of Contents

Backlinks

Chunibyo

Explorer

ch06_variational

第六章 变分推断 §

Chapter 6 Variational Inference §

6.1 引言 / Introduction §

6.2 高斯变分推断 / Gaussian Variational Inference §

6.2.1 损失函数泛函 / Loss Functional §

6.2.2 优化方案 / Optimization Scheme §

6.2.3 自然梯度下降解释 / Natural Gradient Descent §

6.3 精确稀疏性 / Exact Sparsity §

6.3.1 因子化联合似然 / Factored Joint Likelihood §

6.3.2 协方差矩阵的部分计算 / Partial Computation of the Covariance §

6.3.3 边缘采样（Cubature） / Marginal Sampling §

6.4 扩展 / Extensions §

6.4.1 替代损失函数（ESGVI-GN）/ Alternate Loss Functional §

6.4.2 参数估计（EM 算法）/ Parameter Estimation §

6.5 线性系统的特殊情形 / Linear Systems §

6.5.1 恢复批量线性解 / Recovery of the Batch Solution §

6.5.2 系统辨识 / System Identification §

6.6 非线性系统 / Nonlinear Systems §

6.6.1 卡尔曼滤波器修正步的 GVI / GVI for the Kalman Filter Correction Step §

6.6.2 立体相机例子的再访 / Stereo Camera Example Revisited §

6.7 本章小结 / Summary §

Graph View

Table of Contents

Backlinks

第六章　变分推断

Chapter 6 Variational Inference

6.1 引言 / Introduction

6.2 高斯变分推断 / Gaussian Variational Inference

6.2.1 损失函数泛函 / Loss Functional

6.2.2 优化方案 / Optimization Scheme

6.2.3 自然梯度下降解释 / Natural Gradient Descent

6.3 精确稀疏性 / Exact Sparsity

6.3.1 因子化联合似然 / Factored Joint Likelihood

6.3.2 协方差矩阵的部分计算 / Partial Computation of the Covariance

6.3.3 边缘采样（Cubature） / Marginal Sampling

6.4 扩展 / Extensions

6.4.1 替代损失函数（ESGVI-GN）/ Alternate Loss Functional

6.4.2 参数估计（EM 算法）/ Parameter Estimation

6.5 线性系统的特殊情形 / Linear Systems

6.5.1 恢复批量线性解 / Recovery of the Batch Solution

6.5.2 系统辨识 / System Identification

6.6 非线性系统 / Nonlinear Systems

6.6.1 卡尔曼滤波器修正步的 GVI / GVI for the Kalman Filter Correction Step

6.6.2 立体相机例子的再访 / Stereo Camera Example Revisited

6.7 本章小结 / Summary