第二章　概率论基础

Chapter 2 Primer on Probability Theory

本章目标：建立全书所需的概率工具箱。核心是三件事：（1）用概率密度函数描述不确定性；（2）用贝叶斯公式融合先验知识和新测量；（3）理解高斯分布的一切性质——因为后续几乎所有算法都建立在它之上。

Chapter goal: Build the probability toolkit for the entire book. Three essentials: (1) represent uncertainty with probability density functions; (2) fuse prior knowledge with new measurements via Bayes’ rule; (3) master Gaussian distributions—virtually every algorithm in this book relies on them.

2.0 从零开始：什么是概率？ / Starting from Zero: What Is Probability?

中文

在进入数学之前，先建立直觉。

频率派 vs 贝叶斯派：概率有两种解释方式。

频率派：概率是长期频率。掷骰子掷 1000 次，1 朝上出现约 167 次，所以 $P (出现 1) = 1/6$ 。概率是客观的、可重复实验的属性。
贝叶斯派：概率是信念程度（degree of belief）。“明天下雨的概率是 70%“——这个事件只会发生一次，不能重复。但你可以用 0.7 这个数字表达你的信念强度。

本书采用贝叶斯观点。对我们来说，“机器人在位置 $x$ 处的概率”不是”机器人走了无穷多次路径中有多少次在 $x$ “，而是”给定我迄今为止的所有测量，我相信机器人在 $x$ 处的程度“。随着测量的增加，这个信念会不断更新。

连续 vs 离散：日常生活中的概率往往是离散的（骰子有6个面，抛硬币有2种结果）。但机器人的位置是连续的——它可以在空间中的任意一点。处理连续量需要概率密度函数（PDF），而不是简单的概率列表。

English

Before the mathematics, build intuition.

Frequentist vs Bayesian. Probability has two interpretations.

Frequentist: probability is long-run frequency. Roll a die 1000 times; “1” appears about 167 times, so $P (face 1) = 1/6$ . Probability is an objective property of repeatable experiments.
Bayesian: probability is a degree of belief. “70% chance of rain tomorrow”—this event happens only once, not repeatedly. The number 0.7 quantifies strength of belief.

This book adopts the Bayesian view. “The probability that the robot is at position $x$ ” means: given all measurements collected so far, how strongly do I believe the robot is at $x$ ? As measurements arrive, beliefs are updated.

Continuous vs discrete. Daily-life probabilities are often discrete (dice: 6 faces). But a robot’s position is continuous—it can be anywhere in space. Handling continuous quantities requires probability density functions (PDFs) rather than simple probability lists.

2.1 概率密度函数 / Probability Density Functions

2.1.1 定义 / Definitions

中文

基础概念：随机变量

设 $x$ 是一个随机变量——它的值不是确定的，而是按某种概率分布取值。例如：

机器人的位置估计 $x \in R$ （一维情形）
传感器的噪声 $ϵ \in R$

描述随机变量 $x$ 在区间 $[a, b]$ 内取各值可能性的函数，叫做概率密度函数（PDF） $p (x)$ ，它满足两个条件：

非负性： $p (x) \geq 0$ （概率不能为负）
归一性（全概率公理）： $\int_a^b p(x)\, dx = 1 \tag{2.1}$ 即所有可能结果的概率之和为 1。

密度 ≠ 概率： $p (x)$ 是概率密度，不是概率本身，它可以大于 1！概率是密度函数下的面积。

类比：想象一块铁板，质量密度（单位面积质量）可以很大，但你要得到一小块的质量，需要用密度乘以面积（积分）。

$x$ 落在区间 $[c, d]$ 内的概率是： $\Pr(c \leq x \leq d) = \int_c^d p(x)\, dx \tag{2.2}$

累积分布函数（CDF） $P (x)$ 给出 $x$ 小于等于某值的概率： $P(x) = \Pr(x' \leq x) = \int_{-\infty}^x p(x')\, dx' \tag{2.3}$

$P (x)$ 是单调不减的，从 $0$ （在 $- \infty$ 处）增长到 $1$ （在 $+ \infty$ 处）。

条件概率密度： $p (x ∣ y)$ 表示”在已知 $y$ 的条件下 $x$ 的概率密度”，对每个固定的 $y$ ，它仍然是关于 $x$ 的合法 PDF： $(\forall y)\quad \int_a^b p(x \mid y)\, dx = 1 \tag{2.4}$

多维情形：当状态向量 $x = (x_{1}, \dots, x_{N})^{T} \in R^{N}$ 时，联合 PDF $p (x)$ 满足： $\int \cdots \int p(x_1, x_2, \ldots, x_N)\, dx_1\, dx_2 \cdots dx_N = 1 \tag{2.7}$

English

Random variable. A random variable $x$ does not have a fixed value but instead takes values according to a probability distribution. Examples: a robot’s estimated position, sensor noise.

A probability density function (PDF) $p (x)$ over the interval $[a, b]$ satisfies:

Non-negativity: $p (x) \geq 0$
Total probability axiom: $\int_{a}^{b} p (x) d x = 1$

Density ≠ probability. $p (x)$ can exceed 1. Probability is the area under the density curve over an interval: $Pr (c \leq x \leq d) = \int_{c}^{d} p (x) d x$

The CDF $P (x) = \int_{- \infty}^{x} p (x^{'}) d x^{'}$ gives the probability that the variable is $\leq x$ .

2.1.2 边缘化与贝叶斯定理 / Marginalization and Bayes’ Rule

中文

这一节是全书最重要的内容之一。

联合分布的因式分解

对于两个随机变量 $x$ 和 $y$ 的联合分布 $p (x, y)$ ，有一个基本的分解： $p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x} \mid \mathbf{y})\, p(\mathbf{y}) = p(\mathbf{y} \mid \mathbf{x})\, p(\mathbf{x}) \tag{2.8}$

这就是说，“x 和 y 同时发生”的概率 = “已知 y 时 x 发生的概率” × “y 发生的概率”。

直觉：抛一枚硬币，再掷一个骰子。“硬币正面且骰子出现3”的概率 = $P (骰子 3 ∣ 正面) \times P (正面) = \frac{1}{6} \times \frac{1}{2} = \frac{1}{12}$ 。

边缘化（Marginalization）

对联合分布 $p (x, y)$ 关于 $x$ 积分，得到 $y$ 的边缘分布： $p(\mathbf{y}) = \int p(\mathbf{x}, \mathbf{y})\, d\mathbf{x} = \int p(\mathbf{x} \mid \mathbf{y})\, p(\mathbf{y})\, d\mathbf{x} = p(\mathbf{y}) \tag{2.9}$

直觉：你知道北京今天是晴天还是阴天的联合概率 $p (天气, 温度)$ 。如果你对温度不感兴趣，只想知道天气的概率，就把所有可能的温度”积分掉”，得到 $p (天气)$ 。这就是边缘化——“抹去”不关心的变量。

贝叶斯定理（Bayes’ Rule）—— 全书的核心引擎

由公式 (2.8) 的两种因式分解，立刻得到： $\boxed{p(\mathbf{x} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{x})\, p(\mathbf{x})}{p(\mathbf{y})}} \tag{2.10}$

在状态估计中，各项的含义是：

符号	名称	含义
$p (x)$	先验（prior）	在获得测量之前，对状态的信念
$p (y ∣ x)$	似然（likelihood）	传感器模型：如果状态是 $x$ ，得到测量 $y$ 的概率
$p (x ∣ y)$	后验（posterior）	获得测量 $y$ 之后，对状态的更新信念
$p (y)$	归一化常数	与 $x$ 无关，确保后验积分为 1

用文字表达： $后验 \propto 似然 \times 先验$

直觉：你站在一个房间里，不知道自己在哪里。先验 $p (x)$ 是你”蒙眼走进来”时各位置的概率分布（假设均匀分布）。然后你看到窗户——这是一个测量 $y$ 。似然 $p (y ∣ x)$ 告诉你：如果你在位置 $x$ ，看到窗户的可能性有多大。贝叶斯公式把这两个信息结合起来，给出后验 $p (x ∣ y)$ ——有窗户的墙附近概率更高。

分母通过边缘化计算： $p(\mathbf{y}) = \int p(\mathbf{y} \mid \mathbf{x})\, p(\mathbf{x})\, d\mathbf{x} \tag{2.12}$

这个积分在一般情形下计算量巨大。卡尔曼滤波器的伟大之处在于，当 $p (x)$ 和 $p (y ∣ x)$ 都是高斯分布时，这个积分有解析解。

English

Joint factorization: $p (x, y) = p (x ∣ y) p (y) = p (y ∣ x) p (x)$

Marginalization: integrating a joint density over one variable yields the marginal of the other: $p (y) = \int p (x, y) d x$

Bayes’ rule rearranges the factorization to give the posterior: $p(\mathbf{x} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{x})\, p(\mathbf{x})}{p(\mathbf{y})} \tag{2.10}$

In state estimation: $p (x)$ is the prior (belief before measurement), $p (y ∣ x)$ is the sensor model (likelihood of seeing $y$ if the state is $x$ ), and $p (x ∣ y)$ is the posterior (updated belief after measurement). The denominator $p (y)$ is a normalizing constant. In words: $posterior \propto likelihood \times prior$

The denominator requires computing $p (y) = \int p (y ∣ x) p (x) d x$ , which is generally expensive. The Kalman filter achieves this in closed form when both the prior and likelihood are Gaussian.

2.1.3 期望与矩 / Expectations and Moments

中文

期望算子

期望算子 $E [\cdot]$ 计算某个关于随机变量 $x$ 的函数 $f (x)$ 的”平均值”： $E[f(\mathbf{x})] = \int f(\mathbf{x})\, p(\mathbf{x})\, d\mathbf{x} \tag{2.13}$

直觉：如果你把骰子掷 100 次，每次结果乘以它出现的次数再除以 100，就得到期望值 $E [x] = (1 + 2 + 3 + 4 + 5 + 6) /6 = 3.5$ 。连续情形把求和换成积分。

两个最重要的期望：均值与协方差

均值（mean） $μ$ 是 PDF 的”重心”，即随机变量的期望值： $\boldsymbol{\mu} = E[\mathbf{x}] = \int \mathbf{x}\, p(\mathbf{x})\, d\mathbf{x} \tag{2.16}$

协方差矩阵（covariance matrix） $Σ$ 描述随机变量围绕均值的散布程度和各分量之间的相关性： $\boldsymbol{\Sigma} = E\left[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T\right] \tag{2.17}$

协方差矩阵的物理意义（以二维为例）：

$Σ = [σ_{11} σ_{21} σ_{12} σ_{22}]$

$σ_{11}$ ： $x_{1}$ 的方差，即 $x_{1}$ 偏离均值的平均平方距离。 $σ_{11}$ 是标准差。

$σ_{22}$ ： $x_{2}$ 的方差。

$σ_{12} = σ_{21}$ ： $x_{1}$ 与 $x_{2}$ 的协方差。若 $σ_{12} > 0$ ，则 $x_{1}$ 大时 $x_{2}$ 也倾向于大（正相关）；若 $σ_{12} < 0$ ，反相关；若 $σ_{12} = 0$ ，不相关。

协方差矩阵始终是对称正定矩阵（symmetric positive-definite），这意味着对任意非零向量 $v$ ，有 $v^{T} Σ v > 0$ 。几何上，它定义了一个椭球形的不确定性区域。

两个联合分布变量 $x$ 和 $y$ 的互协方差： $\text{cov}(\mathbf{x}, \mathbf{y}) = E\left[(\mathbf{x} - E[\mathbf{x}])(\mathbf{y} - E[\mathbf{y}])^T\right] = E[\mathbf{x}\mathbf{y}^T] - E[\mathbf{x}]E[\mathbf{y}]^T \tag{2.18}$

English

The expectation of a function $f (x)$ under PDF $p (x)$ is its probability-weighted average: $E [f (x)] = \int f (x) p (x) d x$

The mean $μ = E [x]$ is the centre of mass of the PDF.

The covariance matrix $Σ = E [(x - μ) (x - μ)^{T}]$ encodes both the spread of each component (diagonal entries = variances) and pairwise correlations (off-diagonal entries). It is always symmetric positive-definite and geometrically defines an ellipsoidal uncertainty region.

2.1.4 统计独立与不相关 / Independence and Uncorrelatedness

中文

统计独立：如果知道 $y$ 的值对 $x$ 的概率分布没有任何影响，则 $x$ 和 $y$ 统计独立： $p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x})\, p(\mathbf{y}) \tag{2.19}$

不相关：若互协方差为零，即 $cov (x, y) = 0$ ，则称不相关： $E[\mathbf{x}\mathbf{y}^T] = E[\mathbf{x}]E[\mathbf{y}]^T \tag{2.20}$

重要区别：

统计独立 $\Rightarrow$ 不相关（可以证明）

不相关 $\neq \Rightarrow$ 统计独立（一般情形）

例外：对于高斯分布，不相关等价于统计独立！（见 §2.2.3）这是高斯分布的一个特殊而美妙的性质，极大地简化了计算。

English

$x$ and $y$ are statistically independent if $p (x, y) = p (x) p (y)$ . They are uncorrelated if $cov (x, y) = 0$ .

Independence implies uncorrelatedness, but not vice versa in general. For Gaussian distributions specifically, these conditions are equivalent (§2.2.3).

2.1.5 香农信息与互信息 / Shannon Information and Mutual Information

中文

香农熵（Shannon entropy） 衡量一个 PDF 的”不确定性”有多大： $H(\mathbf{x}) = -E[\ln p(\mathbf{x})] = -\int p(\mathbf{x})\ln p(\mathbf{x})\, d\mathbf{x} \tag{2.21}$

直觉：

如果 PDF 非常尖锐（集中在一点附近）， $p (x)$ 在峰值处很大， $ln p (x)$ 很大（负号后变小），熵 $H$ 小。→ 不确定性小。

如果 PDF 非常平坦（均匀分布），所有值等可能， $H$ 最大。→ 不确定性最大。

互信息（mutual information） 衡量”知道 $y$ 之后，对 $x$ 的不确定性减少了多少”： $I(\mathbf{x}, \mathbf{y}) = \iint p(\mathbf{x}, \mathbf{y})\ln\frac{p(\mathbf{x}, \mathbf{y})}{p(\mathbf{x})p(\mathbf{y})}\, d\mathbf{x}\, d\mathbf{y} \tag{2.22}$

性质：

$I (x, y) \geq 0$ ，等号成立当且仅当 $x$ 与 $y$ 统计独立
$I (x, y) = H (x) + H (y) - H (x, y)$

互信息常用于传感器选择：选择那个能最大程度减少状态不确定性的传感器。

English

Shannon entropy $H (x) = - E [ln p (x)]$ quantifies the uncertainty of a PDF: a sharply peaked PDF has low entropy (low uncertainty); a flat PDF has high entropy (high uncertainty).

Mutual information $I (x, y) \geq 0$ measures how much knowing $y$ reduces uncertainty about $x$ . It equals zero if and only if $x$ and $y$ are independent.

2.1.6 KL 散度：衡量两个 PDF 的差异 / Kullback–Leibler Divergence

中文

给定两个关于同一随机变量的 PDF $p_{1} (x)$ 和 $p_{2} (x)$ ，KL 散度衡量它们之间的”距离”： $\text{KL}(p_2 \| p_1) = -\int p_2(\mathbf{x})\ln\frac{p_1(\mathbf{x})}{p_2(\mathbf{x})}\, d\mathbf{x} \geq 0 \tag{2.25}$

性质：

始终非负， $KL (p_{2} ∥ p_{1}) = 0$ 当且仅当 $p_{1} = p_{2}$
不对称： $KL (p_{2} ∥ p_{1}) \neq = KL (p_{1} ∥ p_{2})$ （所以不是严格意义上的”距离”）

KL 散度在第 6 章的变分推断中扮演核心角色，用于找到一个简单分布（高斯）来近似一个复杂的后验分布。

English

The KL divergence $KL (p_{2} ∥ p_{1}) \geq 0$ measures how different two PDFs are. It is zero only when they are identical, and is asymmetric ( $KL (p_{2} ∥ p_{1}) \neq = KL (p_{1} ∥ p_{2})$ ). It plays a central role in variational inference (Chapter 6).

2.1.7–2.1.9 随机采样与归一化乘积 / Sampling and Normalized Product

中文

随机采样：从 PDF $p (x)$ 生成一个随机样本（记作 $x_{meas} \leftarrow p (x)$ ）就像”按照概率的权重随机抽签”。粒子滤波器（第 4 章）大量使用这一操作。

对于标量高斯，可以先从均匀分布 $U [0, 1]$ 采样，再通过分位数函数（CDF 的反函数）变换得到高斯样本。多维情形见 §2.2.13。

样本均值与协方差：给定 $N$ 个样本 $x_{1, meas}, \dots, x_{N, meas}$ ，用以下公式估计真实均值和协方差： $\hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^N \mathbf{x}_{i,\text{meas}}, \qquad \hat{\boldsymbol{\Sigma}} = \frac{1}{N-1}\sum_{i=1}^N (\mathbf{x}_{i,\text{meas}} - \hat{\boldsymbol{\mu}})(\mathbf{x}_{i,\text{meas}} - \hat{\boldsymbol{\mu}})^T \tag{2.29}$

分母用 $N - 1$ 而非 $N$ ，称为 Bessel 修正，使样本协方差成为真实协方差的无偏估计。

归一化乘积：两个 PDF 的归一化乘积： $p(\mathbf{x}) = \eta\, p_1(\mathbf{x})\, p_2(\mathbf{x}), \quad \eta = \left(\int p_1(\mathbf{x})\, p_2(\mathbf{x})\, d\mathbf{x}\right)^{-1} \tag{2.30}$

在贝叶斯框架下，若两个独立测量 $y_{1}$ 、 $y_{2}$ 各自给出对 $x$ 的估计，在均匀先验下可以通过归一化乘积来融合： $p(\mathbf{x} \mid \mathbf{y}_1, \mathbf{y}_2) = \eta\, p(\mathbf{x} \mid \mathbf{y}_1)\, p(\mathbf{x} \mid \mathbf{y}_2) \tag{2.32}$

这是传感器融合的数学基础。

English

Random sampling ( $x_{meas} \leftarrow p (x)$ ): draw a realization of the random variable; fundamental to particle filters (Chapter 4).

Sample mean and covariance from $N$ samples use the Bessel-corrected denominator $N - 1$ to give an unbiased covariance estimate.

Normalized product: $p (x) = η p_{1} (x) p_{2} (x)$ is a valid PDF. Under a uniform prior, fusing two independent estimates via normalized product gives $p (x ∣ y_{1}, y_{2}) = η p (x ∣ y_{1}) p (x ∣ y_{2})$ —the mathematical foundation of sensor fusion.

2.1.10 Cramér–Rao 下界与 Fisher 信息 / CRLB and Fisher Information

中文

问题：我有一个未知参数 $θ$ （比如机器人的真实位置），通过传感器得到测量 $x_{meas} \sim p (x ∣ θ)$ 。我用某种方法从测量中估计 $\hat{θ}$ 。我能把这个估计做得多精确？

Cramér–Rao 下界（CRLB） 给出任何无偏估计器协方差的理论下界： $\text{cov}(\hat{\boldsymbol{\theta}} \mid \mathbf{x}_\text{meas}) \geq \mathcal{I}_{\boldsymbol{\theta}}^{-1} \tag{2.39}$

其中 Fisher 信息矩阵（FIM） 衡量测量中关于 $θ$ 包含多少信息： $\mathcal{I}_{\boldsymbol{\theta}} = E\left[\frac{\partial^2(-\ln p(\mathbf{x} \mid \boldsymbol{\theta}))}{\partial \boldsymbol{\theta}^T \partial \boldsymbol{\theta}}\right] \tag{2.41}$

直觉：

Fisher 信息越大 → 测量对参数越”灵敏”→ 估计误差的下界越小

Fisher 信息越小 → 测量对参数不敏感 → 无论用什么估计器，误差都不可能太小

这给了我们一个基准：任何无偏估计器的精度都不可能超过 CRLB。能达到 CRLB 的估计器称为有效估计器（efficient estimator）。

English

CRLB gives the theoretical lower bound on the covariance of any unbiased estimator $\hat{θ}$ : $cov (\hat{θ}) \geq I_{θ}^{- 1}$ , where the Fisher information matrix $I_{θ}$ measures how sensitive the measurement is to changes in $θ$ . An estimator achieving this bound is called efficient. The CRLB sets a fundamental precision limit—no amount of clever algorithm design can beat it.

2.2 高斯概率密度函数 / Gaussian Probability Density Functions

中文

高斯分布（也叫正态分布）是本书的主角。为什么？

中心极限定理：大量独立随机变量之和趋向高斯分布。传感器噪声往往是许多小误差之和，自然呈高斯分布。
数学可处理性：高斯分布在线性变换、条件化、边缘化、乘积等操作下都保持封闭（结果还是高斯），这使得许多推断有解析解。
最大熵：在给定均值和协方差的所有分布中，高斯分布的熵最大——它是”最不假设额外信息”的分布。

English

The Gaussian is the central distribution in this book because:

Central limit theorem: sums of many independent errors converge to Gaussian—sensor noise is naturally Gaussian.
Closed-form tractability: Gaussian distributions stay Gaussian under linear transforms, conditioning, marginalisation, and products—enabling analytic solutions.
Maximum entropy: among all distributions with a given mean and covariance, the Gaussian maximises entropy—it is the “least informative” choice.

2.2.1 定义 / Definitions

中文

一维高斯分布

$p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \tag{2.42}$

参数：

$μ$ ：均值（mean），PDF 的对称中心，也是最可能的值（众数）
$σ^{2}$ ：方差（variance）， $σ$ 称为标准差，描述散布宽度

记作 $x \sim N (μ, σ^{2})$ 。

经验法则（3σ 规则）：

$[μ - σ, μ + σ]$ 内包含约 68.3% 的概率

$[μ - 2 σ, μ + 2 σ]$ 内包含约 95.4% 的概率

$[μ - 3 σ, μ + 3 σ]$ 内包含约 99.7% 的概率

多维高斯分布

$p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^N \det\boldsymbol{\Sigma}}}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right) \tag{2.45}$

参数：

$μ \in R^{N}$ ：均值向量
$Σ \in R^{N \times N}$ ：协方差矩阵（对称正定）

记作 $x \sim N (μ, Σ)$ 。

几何理解：多维高斯的等概率面是椭球面，满足： $(x - μ)^{T} Σ^{- 1} (x - μ) = c^{2}$ 椭球的形状和朝向由 $Σ$ 的特征值和特征向量决定。 $Σ$ 的特征值越大，椭球在那个方向越”胖”（不确定性越大）。

指数内部的量

公式中的二次型 $(x - μ)^{T} Σ^{- 1} (x - μ)$ 称为马氏距离（Mahalanobis distance）的平方（见 §2.2.9），它是”用不确定性校正过的欧氏距离”。协方差矩阵的逆 $Σ^{- 1}$ 称为精度矩阵（precision matrix）或信息矩阵，见 §2.2.4。

English

Univariate Gaussian: $p (x ∣ μ, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$ , notation: $x \sim N (μ, σ^{2})$ .

Empirical rule (3σ): $\approx$ 68% of probability within $\pm σ$ ; $\approx$ 95% within $\pm 2 σ$ ; $\approx$ 99.7% within $\pm 3 σ$ .

Multivariate Gaussian: $p (x ∣ μ, Σ) \propto exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))$ , notation: $x \sim N (μ, Σ)$ .

Equiprobability surfaces are ellipsoids; their shape and orientation are determined by the eigendecomposition of $Σ$ .

2.2.2 联合高斯与条件推断 / Joint Gaussian and Conditional Inference

中文

这一节是卡尔曼滤波器的数学核心。

联合高斯分布：两个向量 $(x, y)$ 服从联合高斯： $p(\mathbf{x}, \mathbf{y}) = \mathcal{N}\!\left(\begin{bmatrix}\boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y\end{bmatrix},\, \begin{bmatrix}\boldsymbol{\Sigma}_{xx} & \boldsymbol{\Sigma}_{xy} \\ \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_{yy}\end{bmatrix}\right) \tag{2.48}$

其中 $Σ_{y x} = Σ_{x y}^{T}$ ，互协方差描述 $x$ 和 $y$ 之间的相关性。

核心结论：条件高斯分布

利用 Schur 补（一种矩阵分块求逆技巧），可以证明：给定 $y$ 后， $x$ 的条件分布仍然是高斯分布：

$\boxed{p(\mathbf{x} \mid \mathbf{y}) = \mathcal{N}\!\left(\boldsymbol{\mu}_x + \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}_{yy}^{-1}(\mathbf{y} - \boldsymbol{\mu}_y),\;\; \boldsymbol{\Sigma}_{xx} - \boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}_{yy}^{-1}\boldsymbol{\Sigma}_{yx}\right)} \tag{2.52b}$

解读这个公式：

新均值 = 旧均值 + 修正项 $μ_{x ∣ y} = μ_{x} + 增益 Σ_{x y} Σ_{yy}^{- 1} 新息（ innovation ） (y - μ_{y})$
- 新息 = 实际测量值 - 预期测量值，即”测量给了我们什么惊喜”
- 增益 = $Σ_{x y} Σ_{yy}^{- 1}$ ，决定”惊喜”如何传播到状态估计
新协方差 = 旧协方差 - 减少量 $Σ_{x ∣ y} = Σ_{xx} - \geq 0 Σ_{x y} Σ_{yy}^{- 1} Σ_{y x}$ 协方差只会减小，不会增大——测量总是减少不确定性。

这就是卡尔曼滤波更新步骤的本质！ 第 3 章将把它系统地发展成完整的滤波算法。

直觉：你预测朋友会在某咖啡馆（均值 $μ_{x}$ ，不确定度 $Σ_{xx}$ ）。你打了个电话（测量 $y$ ），听到附近有地铁声（不同于预期的地铁声）。你用这个”新息”更新位置估计：朝有地铁的地方偏移，不确定性减小。

English

Joint Gaussian: $p (x, y) = N (μ, Σ)$ with block mean and covariance.

Key result — conditional Gaussian: Given a measured value $y_{meas}$ , the posterior $p (x ∣ y)$ is also Gaussian: $p (x ∣ y) = N (μ_{x} + Σ_{x y} Σ_{yy}^{- 1} (y - μ_{y}), Σ_{xx} - Σ_{x y} Σ_{yy}^{- 1} Σ_{y x})$

The updated mean adds a correction proportional to the innovation $(y - μ_{y})$ (difference between actual and predicted measurement). The updated covariance is strictly smaller—measurements always reduce uncertainty. This formula is the mathematical heart of the Kalman filter update step (Chapter 3).

2.2.3 独立性与不相关：高斯的特殊性 / Independence and Uncorrelatedness for Gaussians

中文

一般情形下，独立 ⇒ 不相关，反之不成立。但高斯分布特殊：

高斯分布的重要性质： $x$ 和 $y$ 服从联合高斯分布时，不相关 $\Leftrightarrow$ 统计独立。

证明很简单：若 $Σ_{x y} = 0$ ，代入 (2.52b) 得到 $p (x ∣ y) = p (x) = N (μ_{x}, Σ_{xx})$ ，即 $y$ 的值不影响 $x$ 的分布，就是独立的。

这个性质让我们可以用”互协方差为零”直接判断高斯随机变量的独立性，极大简化推导。

English

For jointly Gaussian $x$ and $y$ : uncorrelated $\Leftrightarrow$ statistically independent. This equivalence (unique to Gaussians) simplifies proofs throughout the book: checking $Σ_{x y} = 0$ is sufficient to establish independence.

2.2.4 信息形式 / Information Form

中文

高斯分布通常用 $(μ, Σ)$ 表达，称为矩形式（moment form）。另一种等价表达是信息形式（information form）：

名称	符号	公式
精度矩阵（信息矩阵）	$Λ = Σ^{- 1}$	协方差的逆
信息向量	$ξ = Σ^{- 1} μ$

高斯分布的指数部分可以写成： $- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ) = - \frac{1}{2} x^{T} Λ x + ξ^{T} x + 常数$

为什么需要信息形式？

当某个变量完全未知（no prior information），对应的协方差 $Σ \to \infty$ （无穷大矩阵）。在矩形式中这无法表达，但在信息形式中， $Λ = Σ^{- 1} \to 0$ （全零矩阵），很好处理。

多个独立测量的归一化乘积在信息形式下极其简单：精度矩阵相加（见 §2.2.8）。

稀疏图优化（第 9–11 章）中，系统信息矩阵是稀疏的，可以高效求解。

English

The information form parameterises a Gaussian by precision matrix $Λ = Σ^{- 1}$ and information vector $ξ = Σ^{- 1} μ$ .

Advantages: (1) a completely uninformative prior has $Λ = 0$ (vs $Σ = \infty$ ); (2) fusing independent measurements simply adds precision matrices; (3) the system information matrix in batch estimation problems is typically sparse, enabling efficient solvers.

2.2.5 联合高斯的边缘分布 / Marginals of a Joint Gaussian

中文

联合高斯的边缘分布直接从均值和协方差的对应子块读出： $p(\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_x, \boldsymbol{\Sigma}_{xx}), \qquad p(\mathbf{y}) = \mathcal{N}(\boldsymbol{\mu}_y, \boldsymbol{\Sigma}_{yy}) \tag{2.59}$

在信息形式下的边缘化比较复杂，需要用 Schur 补：若信息矩阵为 $[A_{xx} A_{y x} A_{x y} A_{yy}]$ ，则关于 $x$ 的边缘分布精度矩阵为： $\boldsymbol{\Sigma}_{xx}^{-1} = \mathbf{A}_{xx} - \mathbf{A}_{xy}\mathbf{A}_{yy}^{-1}\mathbf{A}_{yx} \tag{2.61}$

这正是矩阵 Schur 补的形式，会在 SLAM（第 10 章）的稀疏求解中再次出现。

English

The marginals of a joint Gaussian are simply the diagonal blocks: $p (x) = N (μ_{x}, Σ_{xx})$ . In information form, marginalisation requires the Schur complement: $Σ_{xx}^{- 1} = A_{xx} - A_{x y} A_{yy}^{- 1} A_{y x}$ , a formula that recurs in sparse SLAM solvers (Chapters 9–10).

2.2.6 线性变量变换 / Linear Change of Variables

中文

从 $x$ 到 $y$ （正向）：设 $x \sim N (μ_{x}, Σ_{xx})$ ， $y = Gx$ （线性变换），则： $\mathbf{y} \sim \mathcal{N}(\mathbf{G}\boldsymbol{\mu}_x,\; \mathbf{G}\boldsymbol{\Sigma}_{xx}\mathbf{G}^T) \tag{2.64}$

证明思路：对均值，线性算子可以直接穿过期望算子： $E [Gx] = G E [x] = G μ_{x}$ 。对协方差，类似地得到 $G Σ_{xx} G^{T}$ 。

直觉：把圆形不确定椭球通过矩阵 $G$ 拉伸，就得到新的椭球。 $G$ 在某方向拉伸越多，那个方向的不确定性越大。

这个公式是机器人运动学中不确定性传播的基础：已知状态 $x$ 的不确定性，通过雅可比矩阵 $G$ 计算输出的不确定性。

English

If $x \sim N (μ_{x}, Σ_{xx})$ and $y = Gx$ , then: $y \sim N (G μ_{x}, G Σ_{xx} G^{T})$

This uncertainty propagation formula is fundamental in robotics: given state uncertainty $Σ_{xx}$ and a Jacobian $G$ , it gives the output uncertainty $Σ_{yy}$ .

2.2.7 通过非线性传播高斯 / Passing a Gaussian through a Nonlinearity

中文

真实传感器模型是非线性的（相机透视投影、距离-角度传感器等），因此我们常常需要计算： $p(\mathbf{y}) = \int p(\mathbf{y} \mid \mathbf{x})\, p(\mathbf{x})\, d\mathbf{x}, \quad \text{其中} \quad p(\mathbf{y} \mid \mathbf{x}) = \mathcal{N}(\mathbf{g}(\mathbf{x}), \mathbf{R}) \tag{2.74}$

$g (\cdot)$ 是非线性函数， $R$ 是测量噪声协方差。

线性化近似（一阶泰勒展开）：在均值 $μ_{x}$ 附近线性化： $\mathbf{g}(\mathbf{x}) \approx \mathbf{g}(\boldsymbol{\mu}_x) + \mathbf{G}(\mathbf{x} - \boldsymbol{\mu}_x), \quad \mathbf{G} = \left.\frac{\partial \mathbf{g}(\mathbf{x})}{\partial \mathbf{x}}\right|_{\mathbf{x}=\boldsymbol{\mu}_x} \tag{2.83}$

其中 $G$ 是 $g$ 在均值处的雅可比矩阵（Jacobian）。

经过线性化后，代入 (2.74) 积分，结果是： $\mathbf{y} \approx \mathcal{N}(\mathbf{g}(\boldsymbol{\mu}_x),\; \mathbf{R} + \mathbf{G}\boldsymbol{\Sigma}_{xx}\mathbf{G}^T) \tag{2.88}$

解读：

均值：直接把均值代入非线性函数

协方差：线性传播项 $G Σ_{xx} G^{T}$ （状态不确定性经 Jacobian 传播）加上测量噪声 $R$

这个公式将直接导出扩展卡尔曼滤波器（EKF）的预测步骤（第 4 章）。

更精确的方法——Sigma 点变换：线性化在非线性较强时精度不足。一种更精确的方法是通过一组精心选择的”sigma 点”来捕捉非线性效果，见第 4 章。

English

For a stochastic nonlinearity $p (y ∣ x) = N (g (x), R)$ , the integral $p (y) = \int p (y ∣ x) p (x) d x$ has no closed form in general.

Linearization (first-order Taylor): approximate $g (x) \approx g (μ_{x}) + G (x - μ_{x})$ where $G$ is the Jacobian at the mean. This gives: $y \approx N (g (μ_{x}), R + G Σ_{xx} G^{T})$

The output mean is $g$ evaluated at the input mean; the output covariance is the Jacobian-propagated input uncertainty plus measurement noise. This formula underlies the EKF prediction step (Chapter 4).

2.2.8 高斯的归一化乘积 / Normalized Product of Gaussians

中文

关键结论： $K$ 个高斯 PDF 的归一化乘积仍然是高斯，且在信息形式下非常简洁：

$\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right) = \eta \prod_{k=1}^K \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^T\boldsymbol{\Sigma}_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k)\right) \tag{2.89}$

合并后的精度矩阵和信息向量为： $\boldsymbol{\Sigma}^{-1} = \sum_{k=1}^K \boldsymbol{\Sigma}_k^{-1}, \qquad \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} = \sum_{k=1}^K \boldsymbol{\Sigma}_k^{-1}\boldsymbol{\mu}_k \tag{2.90}$

直觉（1维情形）：两个关于同一变量的高斯估计，信息量（精度 = 方差的倒数）直接相加，融合后的均值是精度加权平均： $\frac{1}{σ ^{2}} = \frac{1}{σ _{1}^{2}} + \frac{1}{σ _{2}^{2}}, \frac{μ}{σ ^{2}} = \frac{μ _{1}}{σ _{1}^{2}} + \frac{μ _{2}}{σ _{2}^{2}}$ 融合后的方差比任意一个单独估计的方差都小——两个传感器总比一个好。

更一般的形式（带矩阵 $G_{k}$ ）： $\boldsymbol{\Sigma}^{-1} = \sum_{k=1}^K \mathbf{G}_k^T \boldsymbol{\Sigma}_k^{-1} \mathbf{G}_k, \qquad \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} = \sum_{k=1}^K \mathbf{G}_k^T \boldsymbol{\Sigma}_k^{-1}\boldsymbol{\mu}_k \tag{2.92}$

这是批量估计（batch estimation）中最小二乘的矩阵形式，第 3 章将大量使用。

English

The normalized product of $K$ Gaussians is Gaussian. In information form, precision matrices simply add: $Σ^{- 1} = \sum_{k = 1}^{K} Σ_{k}^{- 1}, Σ^{- 1} μ = \sum_{k = 1}^{K} Σ_{k}^{- 1} μ_{k}$

Fusion always reduces uncertainty: the combined variance is smaller than any individual variance. The generalized form with matrices $G_{k}$ is the matrix form of least squares, central to batch estimation (Chapter 3).

2.2.9 马氏距离与卡方分布 / Mahalanobis Distance and Chi-Squared

中文

马氏距离是”考虑了协方差结构的欧氏距离”： $d_M^2 = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \tag{2.94}$

当 $x \sim N (μ, Σ)$ 时， $d_{M}^{2}$ 服从卡方分布 $χ^{2} (N)$ （ $N$ 为维数）：

均值 $= N$ ，方差 $= 2 N$

为什么要用马氏距离而非欧氏距离？

假设机器人位置的不确定性在东西方向很大（ $σ_{x} = 10 m$ ），南北方向很小（ $σ_{y} = 1 m$ ）。用欧氏距离，偏东 5 m 和偏北 5 m 一样”远”。但用马氏距离，偏东 5 m（ $= 0.5 σ_{x}$ ，很正常）比偏北 5 m（ $= 5 σ_{y}$ ，很异常）小得多——符合直觉。

马氏距离用于：

检验点 $x$ 是否属于某个高斯分布（ $d_{M}^{2}$ 太大则视为离群点）
估计器性能评估（见 §5.1 NEES/NIS 检验）
最大后验估计中的目标函数即马氏距离的平方和

English

The squared Mahalanobis distance $d_{M}^{2} = (x - μ)^{T} Σ^{- 1} (x - μ)$ is a covariance-weighted Euclidean distance. For $x \sim N (μ, Σ)$ , it follows $χ^{2} (N)$ with mean $N$ and variance $2 N$ .

Applications: outlier detection (a point with large $d_{M}^{2}$ is anomalous given the Gaussian model), estimator consistency testing (Chapter 5), and as the objective function in MAP estimation.

2.2.10–2.2.12 香农信息、互信息与 KL 散度（高斯情形）

中文

高斯 PDF 的香农熵： $H(\mathbf{x}) = \frac{1}{2}\ln\left((2\pi e)^N \det\boldsymbol{\Sigma}\right) \tag{2.99}$

几何解释： $det Σ$ 正比于不确定性椭球的体积。协方差越大 → 椭球越大 → 熵越高 → 不确定性越大。

联合高斯的互信息： $I(\mathbf{x}, \mathbf{y}) = -\frac{1}{2}\ln\det\!\left(\mathbf{1} - \boldsymbol{\Sigma}_{xx}^{-1}\boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}_{yy}^{-1}\boldsymbol{\Sigma}_{yx}\right) \geq 0 \tag{2.105}$

当 $Σ_{x y} = 0$ （不相关）时， $I = 0$ ，与独立性条件一致。

两个高斯的 KL 散度（闭合形式）： $\text{KL}(p_2 \| p_1) = \frac{1}{2}\left[(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^T\boldsymbol{\Sigma}_1^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) + \ln\frac{\det\boldsymbol{\Sigma}_1}{\det\boldsymbol{\Sigma}_2} + \text{tr}(\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\Sigma}_2) - N\right] \tag{2.107}$

这个解析公式在变分推断中极其重要（第 6 章）。

English

For Gaussian $x \sim N (μ, Σ)$ :

Shannon entropy: $H (x) = \frac{1}{2} ln ((2 π e)^{N} det Σ)$ — depends only on $Σ$ ; proportional to log-volume of uncertainty ellipsoid.
Mutual information between $x$ and $y$ in a joint Gaussian has a closed-form expression involving the cross-covariance block.
KL divergence between two Gaussians has a closed-form expression involving mean differences and covariance ratios (eq. 2.107), essential for variational inference.

2.2.13 多维高斯采样 / Sampling from a Multivariate Gaussian

中文

要从 $z \sim N (μ, Σ)$ 采样，步骤如下：

对协方差做 Cholesky 分解： $Σ = V V^{T}$ （ $V$ 是下三角矩阵）
采样 $N$ 个独立标准正态样本： $x_{meas} \leftarrow N (0, 1)$
变换： $z_{meas} = μ + V x_{meas}$

验证： $E [z] = μ$ ， $cov (z) = V V^{T} = Σ$ 。✓

这个采样方法在粒子滤波器和变分推断的随机梯度算法中广泛使用。

English

To sample from $N (μ, Σ)$ : (1) Cholesky-factor $Σ = V V^{T}$ ; (2) draw $x \leftarrow N (0, 1)$ ; (3) form $z = μ + Vx$ .

2.2.14 CRLB 应用于高斯 / CRLB for Gaussians

中文

设 $K$ 个独立同分布样本 $x_{k} \sim N (μ, Σ)$ ，估计均值 $μ$ 。Fisher 信息矩阵为： $\mathcal{I}_{\boldsymbol{\mu}} = K\boldsymbol{\Sigma}^{-1} \tag{2.116}$

CRLB 给出： $cov (\hat{μ}) \geq \frac{1}{K} Σ$

样本均值 $\hat{μ} = \frac{1}{K} \sum_{k} x_{k}$ 恰好达到此下界，因此是有效估计器。结论符合直觉：测量越多（ $K$ 越大），对均值的估计越精确。

English

With $K$ iid samples from $N (μ, Σ)$ , the CRLB gives $cov (\hat{μ}) \geq \frac{1}{K} Σ$ . The sample mean achieves this bound exactly and is therefore efficient.

2.2.15 Sherman–Morrison–Woodbury 恒等式 / SMW Identity

中文

这是一组矩阵求逆恒等式，通过两种分解（LDU 和 UDL）推导得出：

$\boxed{(\mathbf{A}^{-1} + \mathbf{B}\mathbf{D}^{-1}\mathbf{C})^{-1} = \mathbf{A} - \mathbf{A}\mathbf{B}(\mathbf{D} + \mathbf{C}\mathbf{A}\mathbf{B})^{-1}\mathbf{C}\mathbf{A}} \tag{2.124a}$

为什么重要？

在估计中经常出现如下场景：状态维数很高（ $N$ 大），但测量维数很低（ $M$ 小）。直接求逆 $(Σ^{- 1} + G^{T} R^{- 1} G)^{- 1}$ 需要 $O (N^{3})$ 计算量。用 SMW 恒等式可以将其转化为只需求逆 $M \times M$ 矩阵：

$(Σ^{- 1} + G^{T} R^{- 1} G)^{- 1} = Σ - Σ G^{T} (R + G Σ G^{T})^{- 1} G Σ$

当 $M ≪ N$ 时，这大幅降低计算量。卡尔曼滤波器的两种等价形式（协方差形式 vs 信息形式）之间的变换就依赖 SMW 恒等式。

English

The SMW identity allows efficient computation of matrix inverses of the form $(A^{- 1} + B D^{- 1} C)^{- 1}$ by reducing it to an inversion of a matrix the size of $D$ rather than $A$ . When the measurement dimension $M ≪$ state dimension $N$ , this reduces $O (N^{3})$ inversion to $O (M^{3})$ —critical for the Kalman filter (Chapter 3).

2.2.16–2.2.17 Stein 引理与 Isserlis 定理 / Stein’s Lemma and Isserlis’ Theorem

中文

Stein 引理：设 $x \sim N (μ, Σ)$ ， $f (x)$ 是可微标量函数，则： $E\left[(\mathbf{x} - \boldsymbol{\mu})\, f(\mathbf{x})\right] = \boldsymbol{\Sigma}\, E\left[\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}^T}\right] \tag{2.125}$

直觉：高斯分布的一个深刻性质——“与均值的偏差”和”函数值”之间的期望相关性，可以通过函数的导数来计算，而不需要对整个分布积分。在变分推断（第 6 章）中，Stein 引理让我们可以用样本（而非积分）来估计期望梯度。

Isserlis 定理（Wick 定理）：高斯分布的高阶矩可以分解为二阶矩的乘积之和： $E[x_i x_j x_k x_\ell] = E[x_i x_j]E[x_k x_\ell] + E[x_i x_k]E[x_j x_\ell] + E[x_i x_\ell]E[x_j x_k] \tag{2.129}$

这意味着高斯分布被均值和协方差完全确定——所有高阶矩都可以由二阶矩推算，无需单独指定。

English

Stein’s lemma: For $x \sim N (μ, Σ)$ : $E [(x - μ) f (x)] = Σ E [\partial f / \partial x^{T}]$ . Used in variational inference (Chapter 6) to compute gradient expectations from samples.

Isserlis’ theorem: higher-order moments of a Gaussian factor into sums of products of second-order moments. Since Gaussians are fully characterised by mean and covariance, all higher moments are determined.

2.3 高斯过程 / Gaussian Processes

中文

前面讨论的是离散随机变量 $x \in R^{N}$ ——固定维数的向量。但机器人的轨迹是时间的连续函数 $x (t)$ 。如何对连续函数建立概率模型？

高斯过程（Gaussian Process, GP） 是对连续函数的概率分布。记： $\mathbf{x}(t) \sim \mathcal{GP}(\boldsymbol{\mu}(t),\, \boldsymbol{\Sigma}(t, t')) \tag{2.139}$

其中：

$μ (t)$ ：均值函数，描述轨迹的”平均走向”
$Σ (t, t^{'})$ ：协方差函数（核函数），描述时刻 $t$ 和 $t^{'}$ 处状态的相关性

直觉：想象你在画一条曲线，但手抖了（有随机性）。高斯过程描述了所有可能曲线的概率分布：

均值函数告诉你”平均”的曲线形状

协方差函数告诉你曲线的”抖动范围”和”平滑程度”——时刻 $t$ 和 $t^{'}$ 越接近，它们的值越相关，曲线越平滑

与离散高斯的关系：在任意有限时刻集合 ${t_{1}, t_{2}, \dots, t_{K}}$ 处，GP 给出一个联合高斯分布。对单个时刻 $τ$ ： $x (τ) \sim N (μ (τ), Σ (τ, τ))$

白噪声过程：一个特殊的 GP 是零均值白噪声： $\mathbf{w}(t) \sim \mathcal{GP}(\mathbf{0},\, Q\,\delta(t - t')) \tag{2.141}$

$Q$ 是功率谱密度矩阵， $δ (\cdot)$ 是 Dirac delta 函数。“白噪声”意味着不同时刻的噪声完全不相关（ $t \neq = t^{'}$ 时协方差为零）。

GP 与连续时间状态估计：第 3、4、11 章将展示，连续时间轨迹估计可以自然地视为高斯过程回归（GP regression）——在已知若干时刻观测值的条件下，推断轨迹在其他时刻的值及其不确定性。

English

A Gaussian process $x (t) \sim G P (μ (t), Σ (t, t^{'}))$ is a probability distribution over functions: any finite collection of time instants gives a joint Gaussian.

$μ (t)$ : mean function (the “average” trajectory)
$Σ (t, t^{'})$ : covariance kernel (controls smoothness; nearby times are more correlated)

Zero-mean white noise $w (t) \sim G P (0, Q δ (t - t^{'}))$ : different time instants are completely uncorrelated; $Q$ is the power spectral density.

GP regression—inferring a continuous trajectory from discrete noisy observations—is the framework for continuous-time state estimation in Chapters 3, 4, and 11.

2.4 本章小结 / Chapter Summary

中文

本章建立了全书的概率论基础。核心要点：

概率密度函数（PDF） 用连续函数表达对连续状态的不确定性；所有可能状态下的密度积分为 1。
贝叶斯定理 是状态估计的核心引擎： $后验 \propto 似然 \times 先验$ 它将先验知识与新测量融合，得到更新的信念。
高斯分布 是本书的主要工具：
- 在线性变换、条件化、边缘化、乘积等操作下保持封闭
- 完全由均值 $μ$ 和协方差 $Σ$ 确定
- 对于高斯：不相关 ⟺ 统计独立
关键公式总结：

操作	公式
线性变换	$y = Gx \Rightarrow y \sim N (G μ, G Σ G^{T})$
条件化（高斯推断核心）	$p (x ∥ y) = N (μ_{x} + Σ_{x y} Σ_{yy}^{- 1} (y - μ_{y}), Σ_{xx} - Σ_{x y} Σ_{yy}^{- 1} Σ_{y x})$
归一化乘积（传感器融合）	$Σ^{- 1} = \sum_{k} Σ_{k}^{- 1}$ ， $Σ^{- 1} μ = \sum_{k} Σ_{k}^{- 1} μ_{k}$
非线性传播（线性化）	$y \approx N (g (μ_{x}), R + G Σ_{xx} G^{T})$

高斯过程 将离散高斯推广到连续时间函数，是连续时间轨迹估计的数学基础。

English

Key takeaways:

PDFs represent continuous-state uncertainty; the total probability axiom ensures they integrate to 1.
Bayes’ rule $p (x ∣ y) \propto p (y ∣ x) p (x)$ is the core engine: it fuses prior beliefs with new measurements to form a posterior.
Gaussians are the workhorse: closed under linear transforms, conditioning, marginalisation, and products; fully characterised by $(μ, Σ)$ ; uncorrelated ⟺ independent.
Four fundamental formulas (linear propagation, conditional Gaussian, normalised product, linearised nonlinear propagation) underlie every estimator in this book.
Gaussian processes extend Gaussians to continuous-time functions, enabling continuous-time trajectory estimation.

下一章将用这里建立的工具，推导线性高斯系统的完整估计框架，从批量最大后验估计出发，导出卡尔曼滤波器和平滑器。/ The next chapter uses these tools to derive the complete estimation framework for linear-Gaussian systems—from batch MAP estimation to the Kalman filter and smoother.

Chunibyo

Explorer

ch02_probability

第二章　概率论基础

Chapter 2 Primer on Probability Theory

2.0 从零开始：什么是概率？ / Starting from Zero: What Is Probability?

2.1 概率密度函数 / Probability Density Functions

2.1.1 定义 / Definitions

2.1.2 边缘化与贝叶斯定理 / Marginalization and Bayes’ Rule

2.1.3 期望与矩 / Expectations and Moments

2.1.4 统计独立与不相关 / Independence and Uncorrelatedness

2.1.5 香农信息与互信息 / Shannon Information and Mutual Information

2.1.6 KL 散度：衡量两个 PDF 的差异 / Kullback–Leibler Divergence

2.1.7–2.1.9 随机采样与归一化乘积 / Sampling and Normalized Product

2.1.10 Cramér–Rao 下界与 Fisher 信息 / CRLB and Fisher Information

2.2 高斯概率密度函数 / Gaussian Probability Density Functions

2.2.1 定义 / Definitions

2.2.2 联合高斯与条件推断 / Joint Gaussian and Conditional Inference

2.2.3 独立性与不相关：高斯的特殊性 / Independence and Uncorrelatedness for Gaussians

2.2.4 信息形式 / Information Form

2.2.5 联合高斯的边缘分布 / Marginals of a Joint Gaussian

2.2.6 线性变量变换 / Linear Change of Variables

2.2.7 通过非线性传播高斯 / Passing a Gaussian through a Nonlinearity

2.2.8 高斯的归一化乘积 / Normalized Product of Gaussians

2.2.9 马氏距离与卡方分布 / Mahalanobis Distance and Chi-Squared

2.2.10–2.2.12 香农信息、互信息与 KL 散度（高斯情形）

2.2.13 多维高斯采样 / Sampling from a Multivariate Gaussian

2.2.14 CRLB 应用于高斯 / CRLB for Gaussians

2.2.15 Sherman–Morrison–Woodbury 恒等式 / SMW Identity

2.2.16–2.2.17 Stein 引理与 Isserlis 定理 / Stein’s Lemma and Isserlis’ Theorem

2.3 高斯过程 / Gaussian Processes

2.4 本章小结 / Chapter Summary

Graph View

Table of Contents

Backlinks

Chunibyo

Explorer

ch02_probability

第二章 概率论基础 §

Chapter 2 Primer on Probability Theory §

2.0 从零开始：什么是概率？ / Starting from Zero: What Is Probability? §

2.1 概率密度函数 / Probability Density Functions §

2.1.1 定义 / Definitions §

2.1.2 边缘化与贝叶斯定理 / Marginalization and Bayes’ Rule §

2.1.3 期望与矩 / Expectations and Moments §

2.1.4 统计独立与不相关 / Independence and Uncorrelatedness §

2.1.5 香农信息与互信息 / Shannon Information and Mutual Information §

2.1.6 KL 散度：衡量两个 PDF 的差异 / Kullback–Leibler Divergence §

2.1.7–2.1.9 随机采样与归一化乘积 / Sampling and Normalized Product §

2.1.10 Cramér–Rao 下界与 Fisher 信息 / CRLB and Fisher Information §

2.2 高斯概率密度函数 / Gaussian Probability Density Functions §

2.2.1 定义 / Definitions §

2.2.2 联合高斯与条件推断 / Joint Gaussian and Conditional Inference §

2.2.3 独立性与不相关：高斯的特殊性 / Independence and Uncorrelatedness for Gaussians §

2.2.4 信息形式 / Information Form §

2.2.5 联合高斯的边缘分布 / Marginals of a Joint Gaussian §

2.2.6 线性变量变换 / Linear Change of Variables §

2.2.7 通过非线性传播高斯 / Passing a Gaussian through a Nonlinearity §

2.2.8 高斯的归一化乘积 / Normalized Product of Gaussians §

2.2.9 马氏距离与卡方分布 / Mahalanobis Distance and Chi-Squared §

2.2.10–2.2.12 香农信息、互信息与 KL 散度（高斯情形） §

2.2.13 多维高斯采样 / Sampling from a Multivariate Gaussian §

2.2.14 CRLB 应用于高斯 / CRLB for Gaussians §

2.2.15 Sherman–Morrison–Woodbury 恒等式 / SMW Identity §

2.2.16–2.2.17 Stein 引理与 Isserlis 定理 / Stein’s Lemma and Isserlis’ Theorem §

2.3 高斯过程 / Gaussian Processes §

2.4 本章小结 / Chapter Summary §

Graph View

Table of Contents

Backlinks

第二章　概率论基础

Chapter 2 Primer on Probability Theory

2.0 从零开始：什么是概率？ / Starting from Zero: What Is Probability?

2.1 概率密度函数 / Probability Density Functions

2.1.1 定义 / Definitions

2.1.2 边缘化与贝叶斯定理 / Marginalization and Bayes’ Rule

2.1.3 期望与矩 / Expectations and Moments

2.1.4 统计独立与不相关 / Independence and Uncorrelatedness

2.1.5 香农信息与互信息 / Shannon Information and Mutual Information

2.1.6 KL 散度：衡量两个 PDF 的差异 / Kullback–Leibler Divergence

2.1.7–2.1.9 随机采样与归一化乘积 / Sampling and Normalized Product

2.1.10 Cramér–Rao 下界与 Fisher 信息 / CRLB and Fisher Information

2.2 高斯概率密度函数 / Gaussian Probability Density Functions

2.2.1 定义 / Definitions

2.2.2 联合高斯与条件推断 / Joint Gaussian and Conditional Inference

2.2.3 独立性与不相关：高斯的特殊性 / Independence and Uncorrelatedness for Gaussians

2.2.4 信息形式 / Information Form

2.2.5 联合高斯的边缘分布 / Marginals of a Joint Gaussian

2.2.6 线性变量变换 / Linear Change of Variables

2.2.7 通过非线性传播高斯 / Passing a Gaussian through a Nonlinearity

2.2.8 高斯的归一化乘积 / Normalized Product of Gaussians

2.2.9 马氏距离与卡方分布 / Mahalanobis Distance and Chi-Squared

2.2.10–2.2.12 香农信息、互信息与 KL 散度（高斯情形）

2.2.13 多维高斯采样 / Sampling from a Multivariate Gaussian

2.2.14 CRLB 应用于高斯 / CRLB for Gaussians

2.2.15 Sherman–Morrison–Woodbury 恒等式 / SMW Identity

2.2.16–2.2.17 Stein 引理与 Isserlis 定理 / Stein’s Lemma and Isserlis’ Theorem

2.3 高斯过程 / Gaussian Processes

2.4 本章小结 / Chapter Summary