第六章 变分推断
Chapter 6 Variational Inference
6.1 引言 / Introduction
中文
前几章自底向上地发展了估计理论:从线性高斯情形出发,逐步扩展到非线性和非理想情形,最终得到一系列估计器(EKF、IEKF、SPKF、批量 MAP……)。本章换一个”自顶向下”的视角——从一个统一的概率目标函数出发,推导出这些方法,并进一步解锁参数辨识的能力。
核心问题:真实后验 对非线性模型而言往往不是高斯分布,计算它在计算上不可行。**变分推断(Variational Inference)**的思路是:
不去计算真实后验,而是在高斯分布族中寻找最接近真实后验的那个高斯分布 。
“最接近”的度量是KL 散度(Kullback-Leibler divergence)。限于高斯近似的变分推断,称为高斯变分推断(Gaussian Variational Inference, GVI)。
直觉:想象你想知道一条崎岖山谷(真实后验)的形状,但地图(计算)太贵。变分推断的策略是:用一个椭球(高斯分布)去近似这个山谷,找到覆盖它最好的椭球。MAP 只找山谷最低点,GVI 还要找最能代表整个山谷形状的椭球。
English
The previous chapters developed estimation from the bottom up: starting with linear-Gaussian and extending to nonlinear and non-ideal cases. This chapter takes a top-down view: starting from a single probabilistic objective and recovering (and extending) many earlier results.
Motivation: For nonlinear models, the true Bayesian posterior is non-Gaussian and generally intractable. Gaussian variational inference (GVI) seeks the Gaussian that best approximates the true posterior by minimizing the KL divergence between them.
Key advantages of GVI over MAP:
- Returns a full Gaussian approximation (mean + covariance), not just a point estimate
- The covariance accounts for the nonlinearity, not just a Laplace approximation
- Can jointly estimate model parameters (system matrices, biases, covariances)
6.2 高斯变分推断 / Gaussian Variational Inference
6.2.1 损失函数泛函 / Loss Functional
中文
KL 散度有两种方向,各有不同性质:
\text{KL}(p \| q) = E_p[\ln p(\mathbf{x}|\mathbf{z}) - \ln q(\mathbf{x})], \tag{6.4}
\text{KL}(q \| p) = E_q[\ln q(\mathbf{x}) - \ln p(\mathbf{x}|\mathbf{z})]. \tag{6.5}
我们选择 ,原因是期望是对 (我们自己的估计)取的,可以用采样或解析方法计算;而 的期望对真实后验 取,不可计算。
展开 ,丢掉不依赖于 的常数项,定义损失函数泛函(loss functional):
V(q) = \underbrace{E_q[\phi(\mathbf{x})]}_{\text{数据拟合}} + \underbrace{\frac{1}{2}\ln|\boldsymbol{\Sigma}^{-1}|}_{\text{熵惩罚}}, \tag{6.7}
其中 是联合分布的负对数似然。
注意我们用精度矩阵(信息矩阵) 而非协方差 来描述 ,原因是精度矩阵在批量估计中具有稀疏性。
损失函数的直觉:
- 第一项 :鼓励高斯分布 集中在数据似然高的区域(数据拟合);
- 第二项 :等于负熵,惩罚过于集中(过于自信)的分布;
- 两者平衡产生最优高斯近似。
是所谓**证据下界(ELBO)**的负值。
English
We choose because the expectation is over — our own Gaussian approximation — rather than the intractable true posterior .
After expanding and dropping the constant term , the loss functional to minimize is:
The first term fits the data; the second is the negative Gaussian entropy, penalizing overconfidence. The balance between them yields the Gaussian that is closest in KL divergence to the true posterior. is the negative ELBO.
6.2.2 优化方案 / Optimization Scheme
中文
目标是对 的参数 求 的最小值。用 Stein 引理(见附录)将损失泛函对参数的导数化简为对 本身的导数期望:
\frac{\partial V(q)}{\partial \boldsymbol{\mu}^T} = \boldsymbol{\Sigma}^{-1} E_q[(\mathbf{x}-\boldsymbol{\mu})\phi(\mathbf{x})], \tag{6.8a}
\frac{\partial^2 V(q)}{\partial \boldsymbol{\mu}^T \partial \boldsymbol{\mu}} = E_q\left[\frac{\partial^2 \phi(\mathbf{x})}{\partial \mathbf{x}^T \partial \mathbf{x}}\right] = \boldsymbol{\Sigma}^{-1(i+1)}, \tag{6.23a}
\boldsymbol{\Sigma}^{-1(i+1)} \delta\boldsymbol{\mu} = -E_{q^{(i)}}\left[\frac{\partial \phi(\mathbf{x})}{\partial \mathbf{x}^T}\right], \tag{6.23b}
\boldsymbol{\mu}^{(i+1)} = \boldsymbol{\mu}^{(i)} + \delta\boldsymbol{\mu}. \tag{6.23c}
这是一个 Newton 型迭代格式:用当前估计 计算期望,求解线性方程组得到均值更新量,同时直接得到新的逆协方差。
与 MAP 的联系:若期望用均值点处的函数值近似(相当于只用 1 个采样点),则 (6.23) 退化为 MAP 的 Newton 迭代。因此,MAP = GVI 的一个粗糙近似(只使用了 的均值)。
收敛性保证:利用代价函数的泰勒展开可以证明,每次迭代使 不增:
V(q^{(i+1)}) - V(q^{(i)}) \approx -\frac{1}{2}\delta\boldsymbol{\mu}^T \boldsymbol{\Sigma}^{-1(i+1)} \delta\boldsymbol{\mu} - \frac{1}{2}\text{tr}(\boldsymbol{\Sigma}^{(i)}\delta\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}^{(i)}\delta\boldsymbol{\Sigma}^{-1}) \leq 0. \tag{6.14}
English
Using Stein’s lemma, the gradient of simplifies to expectations of derivatives of . The iterative update scheme is:
If the expectations are evaluated only at the mean (one-point approximation), this reduces exactly to MAP Newton’s method with the Laplace covariance. Thus, MAP is a special (degenerate) case of GVI. Using more cubature points gives a better approximation of the expectations and hence a better-quality posterior estimate.
Local convergence is guaranteed: the descent step decreases as long as either or .
6.2.3 自然梯度下降解释 / Natural Gradient Descent
中文
可以证明,以上优化方案等价于对参数向量 进行自然梯度下降(Natural Gradient Descent, NGD):
\delta\boldsymbol{\alpha} = -\mathcal{I}_\alpha^{-1} \frac{\partial V(q)}{\partial \boldsymbol{\alpha}^T}, \tag{6.19}
其中 是 Fisher 信息矩阵(FIM)。自然梯度将普通梯度按信息几何”预调节”(precondition),使每步更新在参数空间中更高效。
English
The GVI update is equivalent to natural gradient descent on the variational parameters, with the Fisher information matrix acting as a preconditioner. This interpretation provides a geometric understanding of why the method converges efficiently.
6.3 精确稀疏性 / Exact Sparsity
中文
对于大规模状态估计(如轨迹长度为 的机器人问题),直接使用 (6.23) 需要 的计算量( 是总状态维数),代价太高。本节展示如何利用似然函数的因子化结构来大幅降低计算复杂度。
English
For large-scale problems (e.g., long robot trajectories), directly applying (6.23) is per iteration in the total state dimension . Factored joint likelihoods permit a dramatic reduction in cost.
6.3.1 因子化联合似然 / Factored Joint Likelihood
中文
许多实际估计问题中,联合似然可以因子化:
\phi(\mathbf{x}) = \sum_{k=1}^K \phi_k(\mathbf{x}_k), \tag{6.32}
其中每个因子 只涉及状态的一个子集 。例如:
- 运动因子: 只涉及相邻两时刻的状态;
- 观测因子: 只涉及当前时刻的状态。
关键发现:因子化使三个期望的计算都可以化约到因子局部的边缘分布(marginals)上:
E_q[\phi(\mathbf{x})] = \sum_k E_{q_k}[\phi_k(\mathbf{x}_k)], \tag{6.33}
E_q\left[\frac{\partial \phi}{\partial \mathbf{x}^T}\right] = \sum_k \mathbf{P}_k^T E_{q_k}\left[\frac{\partial \phi_k}{\partial \mathbf{x}_k^T}\right], \tag{6.35}
E_q\left[\frac{\partial^2 \phi}{\partial \mathbf{x}^T \partial \mathbf{x}}\right] = \sum_k \mathbf{P}_k^T E_{q_k}\left[\frac{\partial^2 \phi_k}{\partial \mathbf{x}_k^T \partial \mathbf{x}_k}\right] \mathbf{P}_k, \tag{6.36}
其中 是从全状态 提取 的投影矩阵, 是 对应的局部边缘分布。
重要推论:
- 逆协方差的精确稀疏性: 的非零块仅出现在同一因子中相关联的变量之间。对于批量轨迹估计(每对相邻时刻连接), 为块三对角(block-tridiagonal)。
- 无需存储全协方差:只需计算 非零块对应的 的子块(通常远小于完整的 矩阵)。
直觉:就好像一条时间线上的机器人,时刻 只”看到”它的邻居( 和 )。这种局部依赖结构使全局协方差矩阵的大多数元素为零,只需关注非零部分。
English
When (each factor involves only a subset of the state), the three required expectations reduce to marginal expectations — expectations over the marginal , not the full .
This is not an approximation. The result is that:
- is exactly sparse with a fixed sparsity pattern determined by the factor graph. For batch trajectory estimation, this is block-tridiagonal — the same sparsity as in MAP.
- We never need the full (dense) covariance matrix ; only the sub-blocks corresponding to the non-zero blocks of are needed.
This exactly sparse GVI is referred to as ESGVI (Exactly Sparse GVI).
6.3.2 协方差矩阵的部分计算 / Partial Computation of the Covariance
中文
我们需要从(稀疏的) 高效地恢复需要的 子块,而不用显式计算密集的完整 。
Takahashi 算法(1973)正好解决这个问题:在对 做 分解(用于求解均值更新方程)的同时,可以通过后向代换高效提取所有需要的协方差子块:
\boldsymbol{\Sigma}^{-1} = \mathbf{LDL}^T, \tag{6.40}
\boldsymbol{\Sigma}_{K,K} = \mathbf{D}_{K,K}^{-1}, \tag{6.48a}
\boldsymbol{\Sigma}_{j,k} = \delta(j,k)\mathbf{D}_{j,k}^{-1} - \sum_{\ell=k+1}^K \boldsymbol{\Sigma}_{j,\ell} \mathbf{L}_{\ell,k}, \quad j \geq k. \tag{6.48d}
“四角法则”(four corners of a box rule):若 且 ,则必须有 。这确保所有需要的协方差子块都可以被完整计算。
计算代价:整体复杂度与 MAP 方法相同(都是同一 分解主导),只有一个更大的常数系数(来自边缘期望的额外计算)。
English
The required sub-blocks of (corresponding to non-zero blocks of ) can be computed efficiently by the Takahashi back-substitution (6.48), which piggybacks on the decomposition already performed to solve for . The overall complexity per GVI iteration is the same order as MAP, with a larger constant due to the marginal expectation evaluations.
6.3.3 边缘采样(Cubature) / Marginal Sampling
中文
优化方案需要计算每个因子 的局部期望:
E_{q_k}[\phi_k(\mathbf{x}_k)], \quad E_{q_k}\left[\frac{\partial \phi_k}{\partial \mathbf{x}_k^T}\right], \quad E_{q_k}\left[\frac{\partial^2 \phi_k}{\partial \mathbf{x}_k^T \partial \mathbf{x}_k}\right]. \tag{6.49}
无导数实现:再次用 Stein 引理,将包含 导数的期望化为只含 本身的期望:
E_{q_k}\left[\frac{\partial \phi_k}{\partial \mathbf{x}_k^T}\right] = \boldsymbol{\Sigma}_{kk}^{-1} E_{q_k}[(\mathbf{x}_k - \boldsymbol{\mu}_k)\phi_k(\mathbf{x}_k)], \tag{6.52}
E_{q_k}\left[\frac{\partial^2 \phi_k}{\partial \mathbf{x}_k^T \partial \mathbf{x}_k}\right] = \boldsymbol{\Sigma}_{kk}^{-1}E_{q_k}[(\mathbf{x}_k-\boldsymbol{\mu}_k)(\mathbf{x}_k-\boldsymbol{\mu}_k)^T\phi_k]\boldsymbol{\Sigma}_{kk}^{-1} - \boldsymbol{\Sigma}_{kk}^{-1}E_{q_k}[\phi_k]. \tag{6.53}
这三个期望用 cubature / sigma 点方法数值近似:
E_{q_k}[\phi_k(\mathbf{x}_k)] \approx \sum_\ell w_{k,\ell} \phi_k(\mathbf{x}_{k,\ell}), \tag{6.56a}
其中 是 sigma 点, 是对应权重。
常用选项(参见第四章):
- UKF sigma 点( 个,含中心点)
- 球形 cubature 规则( 个)
- Gauss-Hermite cubature(更精确, 个,适合因子维数小时使用)
ESGVI 算法完整流程:
初始化 μ, Σ^{-1}(可由 MAP 提供)
重复直到收敛:
1. 对每个因子 k:
a. 从 q_k = N(μ_k, Σ_{kk}) 生成 sigma 点 {x_{k,ℓ}, w_{k,ℓ}}
b. 计算三个局部期望(6.56)
2. 组装全局期望(6.35, 6.36)
3. 更新 Σ^{-1}(由 (6.23a) 给出——直接是期望的 Hessian)
4. 分解 Σ^{-1} = LDL^T
5. 求解 Σ^{-1} δμ = -gradient(前/后向代换)
6. 用 Takahashi 后向代换更新所需的 Σ 子块
7. 更新均值:μ ← μ + δμ
直觉:ESGVI 本质上是一个”智能化的 IEKF/批量 MAP”——它不只在均值点处线性化,而是在当前高斯后验的采样点处取期望,从而更好地捕捉模型非线性。随着迭代的进行,后验分布(均值+协方差)共同收敛,不只是均值在移动。
English
The three marginal expectations in (6.49) are evaluated using Gaussian cubature (sigma-point methods). Stein’s lemma converts the derivative expectations in (6.52)-(6.53) into expectations of itself (times polynomial factors), which can then be approximated by weighted sums over sigma points. The full ESGVI algorithm alternates between:
- Computing marginal cubature expectations for each factor;
- Updating (= expected Hessian of );
- Solving the linear system for ;
- Using Takahashi backward substitution to recover the required blocks of ;
- Updating .
Since the expectations are over the improving posterior (not fixed about the prior as in SPKF/ISPKF), ESGVI is a posterior statistical linearization method.
6.4 扩展 / Extensions
6.4.1 替代损失函数(ESGVI-GN)/ Alternate Loss Functional
中文
当 时,可以定义一个替代损失:
V'(q) = \frac{1}{2}E_q[\mathbf{e}(\mathbf{x})]^T \mathbf{W}^{-1} E_q[\mathbf{e}(\mathbf{x})] + \frac{1}{2}\ln|\boldsymbol{\Sigma}^{-1}|, \tag{6.69}
由 Jensen 不等式,有 ( 是 的一个保守下界),对应 Gauss-Newton 近似(用期望误差替代误差的期望,跳过 Hessian 的完整计算)。
ESGVI-GN 更新方程:
\boldsymbol{\Sigma}^{-1(i+1)} = \bar{\mathbf{E}}^{(i)T}\mathbf{W}^{-1}\bar{\mathbf{E}}^{(i)}, \tag{6.74}
\boldsymbol{\Sigma}^{-1(i+1)}\delta\boldsymbol{\mu} = -\bar{\mathbf{E}}^{(i)T}\mathbf{W}^{-1}\bar{\mathbf{e}}^{(i)}, \tag{6.75}
其中 , 是统计 Jacobian(statistical Jacobian)。
优势:
- 无需计算完整 Hessian,所需 sigma 点数减半;
- 产生的逆协方差更保守(比 ESGVI 更小的 ),有助于避免过度自信;
- 可作为 ESGVI 的粗预处理步骤。
English
The alternate loss replaces with (Jensen’s inequality guarantees , a conservative lower bound). The resulting update is Gauss-Newton style, using a statistical Jacobian . This requires fewer cubature points (halved polynomial order) and produces a more conservative (less overconfident) covariance than full ESGVI.
6.4.2 参数估计(EM 算法)/ Parameter Estimation
中文
ESGVI 框架可以自然扩展到同时估计模型参数 (如系统矩阵 、、,以及噪声协方差 、)。
这通过**期望最大化(Expectation Maximization, EM)**算法实现:
| 步骤 | 操作 |
|---|---|
| E 步(Expectation) | 固定 ,用 ESGVI 迭代直到收敛,得到 |
| M 步(Minimization) | 固定 ,对 最小化 $V(q |
M 步对 求导并令其为零,得到闭合更新公式。由于因子化似然,M 步的期望只需要 中与非零块对应的子块——正是 E 步已经计算过的。
协方差估计(M 步闭合解):
\mathbf{W} = \frac{1}{K}\sum_{k=1}^K E_{q_k}[\mathbf{e}_k(\mathbf{x}_k)\mathbf{e}_k(\mathbf{x}_k)^T]. \tag{6.87}
EM 收敛性:每次 EM 迭代使 (测量数据的负对数似然)单调不增,但只保证收敛到局部最优。
English
By treating the model parameters as unknowns, ESGVI extends naturally to EM-based system identification:
- E-step: Run ESGVI to convergence with fixed , obtaining .
- M-step: With fixed , minimize over in closed form using the marginal expectations.
For example, the measurement covariance update is , which includes a correction term for the state estimate uncertainty. EM alternates E and M steps, converging to a local minimum of .
6.5 线性系统的特殊情形 / Linear Systems
6.5.1 恢复批量线性解 / Recovery of the Batch Solution
中文
对于线性高斯系统,可以验证 GVI 在一步迭代后就精确收敛到批量 MAP(Cholesky 平滑器)的解:
\hat{\mathbf{P}}^{-1} = \underbrace{\mathbf{A}^{-T}\mathbf{Q}^{-1}\mathbf{A}^{-1} + \mathbf{C}^T\mathbf{R}^{-1}\mathbf{C}}_{\text{块三对角}}, \tag{6.91a}
\hat{\mathbf{P}}^{-1}\hat{\mathbf{x}} = \mathbf{A}^{-T}\mathbf{Q}^{-1}\mathbf{Bu} + \mathbf{C}^T\mathbf{R}^{-1}\mathbf{y}. \tag{6.91b}
这与第三章推导的结果完全一致。对于非线性系统,GVI 与 MAP 不同——GVI 迭代使用完整边缘后验期望,而 MAP 只在均值点展开。
English
For linear-Gaussian systems, the GVI update reduces to the batch MAP (Cholesky smoother) in a single iteration, exactly recovering (6.91). The block-tridiagonal sparsity of is preserved. For nonlinear systems, GVI provides a richer approximation of the posterior than MAP.
6.5.2 系统辨识 / System Identification
中文
通过 EM 算法,对于线性时不变(LTI)系统,M 步的闭合解为:
[\mathbf{A} \; \mathbf{B}] = \left[\sum_k (\hat{\mathbf{x}}_k \hat{\mathbf{x}}_{k-1}^T + \hat{\mathbf{P}}_{k,k-1}), \; \sum_k \hat{\mathbf{x}}_k \mathbf{u}_k^T\right] \cdot \left[\begin{smallmatrix} \sum_k (\hat{\mathbf{x}}_{k-1}\hat{\mathbf{x}}_{k-1}^T + \hat{\mathbf{P}}_{k-1}) & \sum_k \hat{\mathbf{x}}_{k-1}\mathbf{u}_k^T \\ \sum_k \mathbf{u}_k\hat{\mathbf{x}}_{k-1}^T & \sum_k \mathbf{u}_k\mathbf{u}_k^T \end{smallmatrix}\right]^{-1}, \tag{6.99a}
\mathbf{Q} = \frac{1}{K}\sum_{k=1}^K \left[(\hat{\mathbf{x}}_k - \mathbf{A}\hat{\mathbf{x}}_{k-1} - \mathbf{B}\mathbf{u}_k)(\cdots)^T + \hat{\mathbf{P}}_k - \hat{\mathbf{P}}_{k,k-1}\mathbf{A}^T - \mathbf{A}\hat{\mathbf{P}}_{k,k-1}^T + \mathbf{A}\hat{\mathbf{P}}_{k-1}\mathbf{A}^T\right], \tag{6.101b}
其中 是相邻时刻的互协方差块,由 RTS 平滑器计算。
与自适应协方差估计(第五章)的比较:EM 方法用平滑后的状态估计计算误差,并通过协方差项修正;自适应方法用滤波后的状态估计(与误差不相关),通过减去状态估计协方差来修正。两种方法形式相似但细节不同。
English
For LTI systems, the M-step closed-form updates for , , , , , and are given in (6.99)-(6.101). These require the cross-covariance blocks (off-diagonal blocks of ), which are computed as part of the RTS smoother in the E-step. The EM covariance estimates differ from the adaptive covariance estimates (Section 5.5.2) in that EM uses the smoothed posterior (correlated with residuals, corrected by adding -based terms), while adaptive estimation uses the filtered prior (uncorrelated, corrected by subtracting -based terms).
6.6 非线性系统 / Nonlinear Systems
6.6.1 卡尔曼滤波器修正步的 GVI / GVI for the Kalman Filter Correction Step
中文
可以把 GVI 应用于滤波器的单步修正:给定预测分布 和测量 ,最小化:
V(q) = \phi_p(\mathbf{x}) + \phi_m(\mathbf{x}) = \frac{1}{2}(\mathbf{x}-\check{\mathbf{x}})^T\check{\mathbf{P}}^{-1}(\mathbf{x}-\check{\mathbf{x}}) + \frac{1}{2}(\mathbf{y}-\mathbf{g}(\mathbf{x}))^T\mathbf{R}^{-1}(\mathbf{y}-\mathbf{g}(\mathbf{x})). \tag{6.102}
使用替代损失(ESGVI-GN),得到 Kalman 型修正公式:
\mathbf{K} = \check{\mathbf{P}}\bar{\mathbf{G}}^T(\mathbf{R} + \bar{\mathbf{G}}\check{\mathbf{P}}\bar{\mathbf{G}}^T)^{-1}, \tag{6.107a}
\mathbf{P}_{\text{op}} \leftarrow (\mathbf{1} - \mathbf{K}\bar{\mathbf{G}})\check{\mathbf{P}}, \tag{6.107b}
\mathbf{x}_{\text{op}} \leftarrow \check{\mathbf{x}} + \mathbf{K}(\mathbf{y} - \bar{\mathbf{g}} - \bar{\mathbf{G}}(\check{\mathbf{x}} - \mathbf{x}_{\text{op}})), \tag{6.107c}
其中 , 为统计 Jacobian,均可用 sigma 点计算。
GVI 与 ISPKF 的本质区别:
- ISPKF 的 sigma 点从先验 生成(先验统计线性化);
- GVI 的 sigma 点从迭代更新的后验 生成(后验统计线性化)。
这一差异在高度非线性情形下非常重要:后验通常比先验更窄,对非线性的近似更精确。
English
Applied to a single Kalman correction step, GVI (with the alternate loss) produces a Kalman-form update where the standard Jacobian is replaced by the statistical Jacobian , evaluated at the iteratively improving posterior rather than the predicted prior.
This is the key distinction from ISPKF: ISPKF uses sigma points drawn from the prior (prior statistical linearization), while GVI uses sigma points from the posterior (posterior statistical linearization). For strong nonlinearities where the posterior is significantly narrower than the prior, posterior linearization is more accurate.
6.6.2 立体相机例子的再访 / Stereo Camera Example Revisited
中文
回到第四章的立体相机例子( m)。对比各方法的估计均值( 次 Monte Carlo):
| 方法 | 估计均值 | 偏差 |
|---|---|---|
| MAP / IEKF | 24.5694 m | cm(大偏差) |
| ISPKF | 24.7414 m | cm |
| GVI(ESGVI) | 24.7792 m | cm(极小偏差) |
| 真实后验均值 | 24.7770 m | — |
GVI 的估计最接近真实后验均值,偏差极小(0.28 cm vs. MAP 的 33 cm)。这验证了 GVI 优化目标的正确性:它确实找到了 KL 散度意义下最接近真实后验的高斯分布。
核心洞察:MAP、IEKF、ISPKF 和 GVI 之间的本质区别在于用多少个点来近似期望:
- MAP/IEKF:1 个点(仅在均值处展开)
- SPKF(非迭代): 个 sigma 点,但从先验取
- ISPKF: 个 sigma 点,从先验迭代取
- GVI: 个 sigma 点,从后验迭代取
English
On the stereo camera example, GVI achieves a bias of only 0.28 cm vs. 33.0 cm for MAP and 3.84 cm for ISPKF. This demonstrates that GVI, by minimizing , genuinely finds the closest Gaussian to the true posterior — not just the mode (MAP) or an approximation of the mean (ISPKF).
The comparison reveals a spectrum of approximation quality:
- More sigma points evaluated at the iterating posterior → better approximation → GVI.
- One sigma point at the mean → MAP/IEKF.
- Multiple sigma points at the prior → SPKF/ISPKF (intermediate).
6.7 本章小结 / Summary
中文
本章建立了变分推断的完整框架:
| 概念 | 关键点 |
|---|---|
| KL 散度选择 | 选 KL:期望对 取,可计算 |
| 损失泛函 | :数据拟合 + 熵惩罚 |
| 优化方案 | Newton 迭代:GVI 更新 = 期望 Hessian + 期望梯度的线性系统 |
| 精确稀疏 | 因子化似然 → 块三对角 → 只需局部边缘协方差子块 |
| Cubature | 用 sigma 点计算期望,无需显式导数 |
| 参数估计 | EM 算法自然融入:E 步=GVI,M 步=闭合公式 |
| 与 MAP 关系 | MAP = GVI 的单点近似(1 个 sigma 点在均值处) |
四个核心结论:
-
GVI 是估计理论的统一框架:MAP、KF、平滑器、系统辨识都可以在 GVI 框架下推导,是它们各自的特殊情形或近似。
-
精确稀疏性使 GVI 与 MAP 等量级:因子化似然使 GVI 的计算代价(关于状态维数的阶)与 MAP 相同,只是常数倍数更大。
-
GVI 比 MAP 更接近真实后验:在非线性系统中,GVI 能显著减少偏差(如立体相机例子中从 33 cm 降至 0.28 cm),代价是每次迭代需要更多 sigma 点评估。
-
后验统计线性化优于先验统计线性化:GVI 使用从后验生成的 sigma 点,比 ISPKF(从先验生成)更精确,尤其在高度非线性情形。
English
GVI provides a principled top-down framework that unifies state estimation, parameter estimation, and system identification. The choice leads to the ELBO loss, whose minimization recovers MAP (with one sigma point) or achieves a better Gaussian posterior approximation (with more sigma points). Factored likelihoods ensure exact block-tridiagonal sparsity in , making ESGVI computationally tractable via the Takahashi partial covariance algorithm. The EM extension enables joint state and parameter estimation from data alone.
The progression of approximation quality: MAP → SPKF/ISPKF → GVI, with each step requiring more computation but achieving a closer match to the true Bayesian posterior.
第二部分将从纯代数向量状态转向三维几何:旋转不是普通向量,机器人在三维空间中的位姿需要特殊的数学工具。下一章介绍三维几何的基础知识。
Part II moves from algebraic vector states to three-dimensional geometry: rotations are not ordinary vectors, and the pose of a robot in 3D space requires specialized mathematical tools. The next chapter introduces the necessary 3D geometry primer.