第十周：大语言模型的强化学习（RLHF / PPO / DPO / GRPO）

📝 教程导读
本教程系统讲解大语言模型（LLM）训练中的强化学习技术，涵盖从基础概念到工程实战的完整知识链。内容包括 KL 散度、RLHF 三阶段流程、PPO 算法、DPO 直接偏好优化、GRPO 组相对偏好优化，以及对应的代码实现。学完本教程后，你应能独立完成一个基于强化学习的 LLM 对齐训练流程。

一、强化学习在 LLM 中的定位

1.1 LLM 训练的四个阶段

大语言模型从零到可用，通常经历以下四个阶段：

阶段	目标	对应技术
预训练	学习语言的通用表征	Next Token Prediction
监督微调（SFT）	学习特定任务的指令遵循能力	Instruction Tuning
强化学习（RLHF/DPO/GRPO）	对齐人类偏好，提升输出质量	PPO / DPO / GRPO
RAG	实时接入外部知识库	检索增强生成

1.2 为什么 SFT 之后还需要强化学习？

是什么？ 强化学习对齐（Alignment）是在 SFT 基础上，通过奖励信号进一步优化模型输出，使其更符合人类偏好的训练范式。

为什么？ SFT 存在以下局限：

多样性不足：SFT 只学习"标准答案"，但同一问题可能有多种优质回答
多目标冲突：安全性、有用性、真实性等目标难以通过单一损失函数同时优化
分布偏移：SFT 数据是静态的，无法覆盖模型在推理时可能生成的所有输出

怎么做？ 通过引入奖励模型（Reward Model）或直接偏好数据，用强化学习算法（PPO/DPO/GRPO）调整模型参数，使模型在生成时倾向于产生高奖励的输出。

1.3 强化学习基础概念

在 LLM 的强化学习框架中，核心概念的映射关系如下：

RL 概念	LLM 中的对应
策略（Policy） $\pi_\theta$	当前正在训练的语言模型
状态（State） $s_t$	提示词 + 已生成的 token 序列 $(x, y_{<t})$
动作（Action） $a_t$	在当前时间步生成的 token $y_t$
奖励（Reward） $r$	Reward Model 给整条回复的打分，或规则判定的分数
参考模型（Ref Model） $\pi_{\text{ref}}$	SFT 后冻结权重的模型，用于 KL 约束

二、KL 散度

2.1 KL 散度的定义

是什么？ KL 散度（Kullback-Leibler Divergence）衡量两个概率分布之间的差异程度，是强化学习对齐中防止策略模型"跑偏"的核心工具。

为什么？ 在 RLHF 训练过程中，如果不对策略模型施加约束，模型可能会为了追求高奖励而生成不自然的文本（reward hacking）。KL 散度惩罚确保训练后的模型不会偏离原始 SFT 模型太远。

标准 KL 散度公式：

$$ D_{KL}[P \| Q] = \sum_x P(x) \log \frac{P(x)}{Q(x)} $$

在 LLM 中，这具体表现为 token 级别的 KL 散度：

$$ D_{KL}[\pi_\theta \| \pi_{\text{ref}}] = \sum_t \pi_\theta(y_t | s_t) \log \frac{\pi_\theta(y_t | s_t)}{\pi_{\text{ref}}(y_t | s_t)} $$

2.2 GRPO 中的无偏低方差 KL 估计

GRPO 使用了一种改进的 KL 估计形式（来自 John Schulman 的研究），称为 k3 估计器：

$$ D_{KL}[\pi_\theta \| \pi_{\text{ref}}] = \frac{\pi_{\text{ref}}(o_{i,t}|q, o_{i,令 $\gamma = \frac{\pi_{\text{ref}}}{\pi_\theta}$，则简化为：

$$ D_{KL} = (\gamma - 1) - \log \gamma $$

三种 KL 估计器的对比：

估计器	公式	偏差	方差
k1（朴素估计）	$\log \frac{p(x)}{q(x)}$	无偏	大
k2（低方差估计）	$\frac{1}{2}(\log \frac{p(x)}{q(x)})^2$	有偏	低
k3（无偏低方差）	$(r-1) - \log r$, 其中 $r=\frac{p(x)}{q(x)}$	无偏	低

💡 关键理解
k3 估计器被 GRPO 采用，因为它同时满足无偏和低方差两个条件。这对于 token 级别的 KL 计算尤为重要，因为每个 token 的 KL 值需要在数值上稳定可靠。

代码实现：

1
2
3
def grpo_kl(pi_logprob, pi_ref_logprob):
    """GRPO 的无偏低方差 KL 散度计算"""
    return pi_ref_logprob.exp() / pi_logprob.exp() - (pi_ref_logprob - pi_logprob) - 1

三、RLHF 完整流程

3.1 三阶段概览

是什么？ RLHF（Reinforcement Learning from Human Feedback）是一种通过人类反馈信号来优化语言模型的训练范式，由 OpenAI 在 InstructGPT 论文中系统提出。

为什么？ 人类的偏好是多维的、主观的，无法简单用一个数学公式描述。RLHF 通过训练一个奖励模型来"压缩"人类偏好为标量分数，再用强化学习算法优化策略。

怎么做？ 完整流程分为三个阶段：

1
2
3
4
5
6
阶段一：SFT（监督微调）
    ↓ 得到基础对话模型
阶段二：Reward Model 训练
    ↓ 得到能打分的奖励模型
阶段三：PPO 策略优化
    ↓ 得到对齐后的最终模型

3.2 阶段一：SFT 监督微调

在 RLHF 实战项目中（以 NL2SQL 为例），SFT 阶段使用 facebook/opt-1.3b 模型 + LoRA 进行微调，训练模型根据自然语言问题生成 SQL 语句。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 手写 LoRA 模块
class Lora(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear
        self.lora_A = torch.nn.Parameter(torch.randn(linear.in_features, 128) * 0.1)
        self.lora_B = torch.nn.Parameter(torch.zeros(128, linear.out_features))
        self.linear.weight.requires_grad = False

    def forward(self, x):
        y_linear = self.linear(x)
        y_lora = x.matmul(self.lora_A).matmul(self.lora_B)
        return y_linear + y_lora / 128

📝 实战场景
在 NL2SQL 场景中，SFT 数据格式为 prompt + chosen，其中 prompt 是 "context: CREATE TABLE ... question: ... answer: "，chosen 是对应的 SQL 语句。

四、Reward Model 训练

4.1 偏好数据构造

是什么？ Reward Model 是一个将（prompt, response）对映射为标量分数的模型，用于代替人类在线评分。

为什么？ PPO 训练需要大量的奖励信号，让人类实时评分不可行。Reward Model 把"人类偏好"压缩为一个可微分的打分函数。

偏好数据的格式为 JSON Lines，每条包含 question、chosen（优质回答）、rejected（劣质回答）：

1
2
3
4
5
{
  "question": "什么是机器学习？",
  "chosen": "机器学习是一种人工智能方法，通过算法和统计模型使计算机系统能够执行特定任务，而无需明确编程指令。",
  "rejected": "机器学习是计算机科学的一个领域。"
}

4.2 Bradley-Terry 模型与损失函数

Reward Model 训练基于 Bradley-Terry 排名模型，其核心假设是：

$$ P(\text{chosen} \succ \text{rejected}) = \sigma(r(\text{chosen}) - r(\text{rejected})) $$

其中 $\sigma$ 是 sigmoid 函数。对应的损失函数为：

$$ \mathcal{L}_{\text{RM}} = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) $$

在手写实现中，对应代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class CriticModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.rwtransformer = AutoModel.from_pretrained('facebook/opt-350m', dropout=0.0)
        self.v_head = torch.nn.Linear(512, 1, bias=False)

    def forward(self, input_ids, attention_mask):
        value = self.rwtransformer(input_ids=input_ids,
                                    attention_mask=attention_mask).last_hidden_state
        value = self.v_head(value).squeeze(-1)

        # 对每对 (chosen, rejected) 计算 pairwise loss
        for ... in zip(input_ids[:4], input_ids[4:], value[:4], value[4:]):
            start = (input_ids_chosen == input_ids_rejected).tolist().index(False)
            value_chosen = value_chosen[start:end]
            value_rejected = value_rejected[start:end]
            loss = -torch.nn.functional.logsigmoid(value_chosen - value_rejected).mean()

4.3 使用 TRL 库训练 Reward Model

TRL 库提供了 RewardTrainer，可以大幅简化 Reward Model 的训练流程：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from trl import RewardTrainer, RewardConfig

# 数据预处理：将 chosen/rejected 转换为 input_ids_chosen / input_ids_rejected
def process_func(example):
    chosen = example["question"] + example["chosen"]
    rejected = example["question"] + example["rejected"]
    tokenized_chosen = tokenizer(chosen)
    tokenized_rejected = tokenizer(rejected)
    return {
        "input_ids_chosen": tokenized_chosen["input_ids"],
        "attention_mask_chosen": tokenized_chosen["attention_mask"],
        "input_ids_rejected": tokenized_rejected["input_ids"],
        "attention_mask_rejected": tokenized_rejected["attention_mask"],
    }

# 配置与训练
config = RewardConfig(output_dir="./reward_model")
trainer = RewardTrainer(model=model, tokenizer=tokenizer, args=config, train_dataset=dataset)
trainer.train()

五、PPO 算法

5.1 PPO 核心思想

是什么？ PPO（Proximal Policy Optimization，近端策略优化）是 OpenAI 提出的一种策略梯度算法，通过限制策略更新幅度来保证训练稳定性。

为什么？ 普通的策略梯度方法在更新步长过大时会导致策略崩溃。PPO 通过裁剪概率比率（Clipped Surrogate Objective）来限制每次更新的幅度。

怎么做？ PPO 的关键组件包括：

5.1.1 记号表

记号	含义
$\pi_\theta$	当前策略（可训练 LLM）
$\pi_{\text{old}}$	采样时冻结的旧策略
$\pi_{\text{ref}}$	参考模型（SFT 权重，不更新）
$R_{\text{RM}}(x,y)$	奖励模型对整条序列的打分
$\beta$	KL 系数（约 0.02 - 0.2）
$\varepsilon$	PPO clip 范围（典型 0.1 或 0.2）

5.1.2 奖励设计（Sequence-level + Token-level KL）

$$ r_t = \begin{cases} -\beta[\log\pi_\theta(y_t|s_t) - \log\pi_{\text{ref}}(y_t|s_t)] & 1 \le t < T \\ R_{\text{RM}}(x,y) - \beta[\log\pi_\theta(y_T|s_T) - \log\pi_{\text{ref}}(y_T|s_T)] & t = T \end{cases} $$

每一步都有 KL 惩罚，防止策略远离参考模型
终止步叠加 Reward Model 的打分

5.1.3 概率比率

$$ \rho_t(\theta) = \frac{\pi_\theta(y_t|s_t)}{\pi_{\text{old}}(y_t|s_t)} $$

5.1.4 优势估计（GAE）

$$ A_t^{(\lambda)} = \sum_{k=0}^{\infty} (\gamma\lambda)^k [r_{t+k} + \gamma V_\phi(s_{t+k+1}) - V_\phi(s_{t+k})] $$

5.1.5 Clipped Surrogate Objective（策略损失）

$$ \mathcal{L}_{\text{clip}} = -\mathbb{E}_t \left[ \min\left( \rho_t A_t,\ \text{clip}(\rho_t, 1-\varepsilon, 1+\varepsilon) A_t \right) \right] $$

5.1.6 价值函数损失

$$ \mathcal{L}_V = \frac{1}{2} \mathbb{E}_t \left[ (V_\phi(s_t) - \hat{G}_t)^2 \right] $$

5.1.7 总损失

$$ \mathcal{L} = \mathcal{L}_{\text{clip}} + c_V \mathcal{L}_V + \mathcal{L}_{\text{ent}} $$

⚠️ 核心理解
PPO 的 clip 机制保证了策略更新不会太激进：当 $\rho_t$ 超出 $[1-\varepsilon, 1+\varepsilon]$ 范围时，梯度被截断，防止策略崩溃。

5.2 PPO 手写实现（NL2SQL 场景）

以下是 RLHF 目录下完整 PPO 训练的核心代码：

奖励 + KL 融合：

1
2
3
4
5
6
7
def get_reward_kl(end, prob_old, prob_ref, reward):
    # 两份预测概率求 KL 散度
    reward_kl = -0.1 * (prob_old - prob_ref)
    # 把 Reward Model 的分数加在最后一个 token 上
    for i, e in enumerate(end):
        reward_kl[i, e] += reward[i].clamp(-5, 5)
    return reward_kl

优势估计（GAE 的高效实现）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def get_delta(value_old, reward_kl):
    delta = []
    for i in reversed(range(255, value_old.shape[1])):
        value_next = 0.0
        if i != value_old.shape[1] - 1:
            value_next = value_old[:, i + 1]
        d = reward_kl[:, i] + value_next - value_old[:, i]
        if len(delta):
            d += 0.95 * delta[-1]
        delta.append(d)
    delta = torch.stack(delta[::-1], dim=1)
    return delta

Actor Loss（策略损失）：

1
2
3
4
5
6
def get_loss_actor(prob_new, prob_old, delta, generate_mask):
    ratio = ((prob_new - prob_old) * generate_mask).exp()
    loss1 = delta * ratio
    loss2 = delta * ratio.clamp(0.8, 1.2)  # clip 范围 [0.8, 1.2]
    loss = torch.min(loss1, loss2) * generate_mask
    return -(loss.sum() / generate_mask.sum() / 8)

Critic Loss（价值函数损失）：

1
2
3
4
5
6
def get_loss_critic(value_new, value_old, delta, generate_mask):
    loss1 = (value_new - delta - value_old)**2
    value_new_clipped = value_new.clamp(value_old - 0.2, value_old + 0.2)
    loss2 = (value_new_clipped - delta - value_old)**2
    loss = torch.max(loss1, loss2) * generate_mask
    return loss.sum() / 2 / generate_mask.sum() / 8

5.3 PPO 情感分析实战（IMDB + DistilBERT）

是什么？ 使用 PPO 训练 Qwen2.5-0.5B-Instruct 模型，使其生成的影评续写更倾向于正面情感。

怎么做？ 完整流程如下：

数据集：IMDB 影评数据集，截取前 10 个 token 作为 prompt
策略模型：AutoModelForCausalLMWithValueHead（带价值头的 Qwen）
参考模型：冻结参数的同架构模型
奖励模型：预训练的 DistilBERT 情感分类器

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# PPO 配置
config = PPOConfig(
    model_name="Qwen2.5-0.5B-Instruct",
    learning_rate=6e-6,
    batch_size=128,
    mini_batch_size=16,
    target_kl=0.03,
    kl_penalty="kl",
    ppo_epochs=1,
)

# 奖励计算：使用情感分析 pipeline
sentiment_pipe = pipeline("sentiment-analysis", model=reward_model,
                          tokenizer=reward_tokenizer, device=device)

# 训练循环
for step, batch in enumerate(ppo_trainer.dataloader):
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)

    # 拼接 prompt + response 送入奖励模型
    texts_to_score = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts_to_score)

    # 提取 POSITIVE 分数作为奖励
    rewards = [torch.tensor(score['score']) for score in pipe_outputs
               if score['label'] == 'POSITIVE']

    # PPO 优化步骤
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

💡 价值头的设计
TRL 库中 AutoModelForCausalLMWithValueHead 将价值头直接集成到策略模型中。这样做的好处是：策略网络和价值网络共享底层特征提取器，提高训练效率，同时简化架构。

5.4 使用 TRL 库的 PPO 训练

TRL 库封装了 PPO 训练流程：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer

model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_path,
    reward_adapter="./reward_model",  # 加载 Reward Model 适配器
    peft_config=peft_config,
    quantization_config=bnb_config
)

ppo_config = PPOConfig(
    kl_penalty="full",
    ppo_epochs=3,
    batch_size=2,
    mini_batch_size=1
)

ppo_trainer = PPOTrainer(
    config=ppo_config, model=model,
    ref_model=None, tokenizer=tokenizer,
    dataset=queries_dataset, data_collator=collator
)

for batch in ppo_trainer.dataloader:
    response_tensors = ppo_trainer.generate(batch, **generation_kwargs)
    scores = [model.compute_reward_score(input_ids=ids)[0, -1, 0]
              for ids in combined_ids]
    stats = ppo_trainer.step(batch, response_tensors, scores)

六、DPO 直接偏好优化

6.1 DPO 的原理

是什么？ DPO（Direct Preference Optimization）是一种无需显式训练 Reward Model 的偏好对齐方法。它直接在偏好数据上优化策略模型，通过数学推导将 RL 问题转化为一个简单的分类损失。

为什么？ PPO 流程复杂（需要 4 个模型：Actor、Critic、Reward、Reference），训练不稳定，超参数敏感。DPO 将整个 RL 过程简化为一个二分类问题，只需要 2 个模型（当前模型 + 参考模型）。

怎么做？ DPO 的核心推导如下：

从 RLHF 的目标函数出发：

$$ \max_\pi \mathbb{E}[r(x,y)] - \beta D_{KL}[\pi \| \pi_{\text{ref}}] $$

可以推导出隐式奖励函数：

$$ r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) $$

将隐式奖励代入 Bradley-Terry 模型，得到 DPO 损失函数：

$$ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right] $$

⚠️ 重要概念
DPO 是 off-policy 方法！它的训练数据（偏好对）是预先收集好的，不需要在训练过程中在线采样。这与 PPO（on-policy）形成鲜明对比。

6.2 DPO 的原生 PyTorch 实现

以下是用 GPT-2 在 NL2SQL 数据集上训练 DPO 的核心代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def get_prob_diff(actor, input_ids, attention_mask, answer_mask):
    prob = actor(input_ids=input_ids, attention_mask=attention_mask).logits
    input_ids = input_ids[:, 1:]
    answer_mask = answer_mask[:, 1:]
    prob = prob[:, :-1]

    # 取所有字的预测概率（对数域）
    prob = (prob.softmax(2) + 1e-8).log()
    prob = prob.gather(2, index=input_ids.unsqueeze(2)).squeeze(2)

    # 取答案部分的联合概率（对数和）
    prob = (prob * answer_mask).sum(1)

    # chosen 与 rejected 的概率差
    return prob[:b] - prob[b:]

# DPO 训练循环
for i in range(8000):
    data = get_data()
    prob_diff = get_prob_diff(model_actor, **data)
    with torch.no_grad():
        prob_diff_ref = get_prob_diff(model_actor_ref, **data)

    # DPO 损失：beta * (prob_diff - prob_diff_ref)
    loss = 0.1 * (prob_diff - prob_diff_ref)
    loss = -(loss.sigmoid() + 1e-8).log().mean()

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

📝 数据构造方式
DPO 的偏好数据中，chosen 是标准 SQL 语句，rejected 为空字符串（或低质量回答）。在实际业务中，可以用模型生成多条回答，人工标注优劣来构造偏好对。

6.3 使用 TRL 库的 DPO 训练

TRL 提供了 DPOTrainer，使 DPO 训练更加简洁：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from trl import DPOConfig, DPOTrainer

args = DPOConfig(
    output_dir='output_dir',
    loss_type='sigmoid',       # DPO 损失类型
    beta=0.1,                  # KL 系数
    per_device_train_batch_size=8,
    max_steps=80000,
    learning_rate=1e-5,
    optim='rmsprop',
    max_length=100,
    max_prompt_length=100,
)

# 数据集格式需包含 prompt, chosen, rejected 三列
trainer = DPOTrainer(
    model=model_actor,
    ref_model=model_actor_ref,
    args=args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()

七、GRPO 组相对偏好优化

7.1 GRPO 的核心思想

是什么？ GRPO（Group Relative Policy Optimization）是 DeepSeek 在 R1 论文中提出的强化学习方法，它与 PPO 有两个核心区别：

不需要训练 Reward Model，而是基于规则直接判定奖励
不需要训练 Value Model，优势函数通过组内相对比较计算

为什么？ PPO 需要同时维护 4 个模型（Actor + Critic + Reward + Reference），训练成本高、流程复杂。GRPO 通过"组采样 + 规则奖励"大幅简化了流程，特别适合有明确正确答案的任务（如数学推理）。

怎么做？ 对于每个问题 $q$，GRPO 采样 $G$ 条回答（如 $G=64$），用规则判定每条回答的奖励，然后通过组内标准化计算优势。

7.2 GRPO 损失函数

$$ \mathcal{L}_{GRPO}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left( \frac{\pi_\theta(o_{i,t}|q, o_{i,7.2.1 组相对优势函数$$ \hat{A}_{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)} $$

其中 $r = {r_1, r_2, …, r_G}$ 是组内所有回答的奖励。

💡 关键特性
同一回答内每个 token 的优势值相同（sentence-level advantage）
当组内全对或全错时，优势无效，需要跳过该批次
优势来自真实环境奖励，而非价值函数估计

代码实现：

1
2
3
4
5
def grpo_advantage(rewards):
    epsilon = 0.001 * torch.randn(1)
    rewards = torch.tensor(rewards, dtype=torch.float)
    A = (rewards - rewards.mean()) / (rewards.std() + epsilon)
    return A

7.2.2 规则奖励函数

GRPO 使用基于规则的奖励，而非训练一个神经网络奖励模型：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 答案完全正确得 2 分
def correctness_reward_func(prompts, completions, answer, **kwargs):
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

# 答案是整数得 0.5 分
def int_reward_func(completions, **kwargs):
    extracted = [extract_xml_answer(c[0]['content']) for c in completions]
    return [0.5 if r.isdigit() else 0.0 for r in extracted]

# 严格格式检查得 0.5 分
def strict_format_reward_func(completions, **kwargs):
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [c[0]["content"] for c in completions]
    return [0.5 if re.match(pattern, r, re.DOTALL) else 0.0 for r in responses]

7.2.3 手写 GRPO Loss

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def grpo_loss(pi_logprob, pi_old_logprob, pi_ref_logprob, advantage, input_len, len_oi):
    epsilon = 0.2
    beta = 0.01
    bs, seq_len = pi_logprob.shape

    # 只对 response 部分计算 loss
    mask = torch.zeros(bs, seq_len)
    mask[:, input_len:] = 1

    # 策略比率 + clip
    ratio = torch.exp(pi_logprob - pi_old_logprob)
    ratio_clip = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
    advantage = advantage.unsqueeze(dim=1)

    policy_gradient = torch.minimum(ratio * advantage, ratio_clip * advantage)
    kl = grpo_kl(pi_logprob, pi_ref_logprob)

    loss = (policy_gradient - beta * kl) * mask
    loss = (-1 / group_num) * (1 / len_oi.unsqueeze(dim=1)) * loss
    return loss.sum()

7.3 GRPO 数学推理实战（GSM8K 数据集）

场景：使用 GRPO 训练 Qwen2.5-0.5B-Instruct 在 GSM8K 数据集上的数学推理能力。

数据准备：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# GSM8K 数据格式
# answer 字段中 #### 后面的数字是最终答案
data['train'][0] = {
    'question': 'Natalia sold clips to 48 of her friends in April...',
    'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\n...#### 72'
}

# 提取标准答案
def extract_hash_answer(text):
    return text.split("####")[1].strip()  # 返回 "72"

训练配置：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
training_args = GRPOConfig(
    output_dir="outputs/Qwen2.5-0.5B-reasoning-GRPO",
    learning_rate=5e-6,
    adam_beta1=0.9, adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    bf16=True,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    num_generations=8,          # 每个问题采样 8 条回答
    max_completion_length=200,
    num_train_epochs=1,
    max_grad_norm=0.1,
)

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,         # XML 标签结构 0~0.5 分
        soft_format_reward_func,      # 宽松格式 0.5 分
        strict_format_reward_func,    # 严格格式 0.5 分
        int_reward_func,              # 整数格式 0.5 分
        correctness_reward_func,      # 答案正确 2.0 分
    ],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

7.4 Qwen3-4B GRPO 训练（Unsloth 框架）

是什么？ 使用 Unsloth 框架 + vLLM 加速的 GRPO 训练，目标是将 Qwen3-4B-Base 转换为推理模型。

怎么做？

Step 1：格式预微调（SFT warmup）

先用少量样本（约 59 条）进行 SFT，让模型学会自定义的推理格式：

1
2
3
4
<start_working_out>
{推理过程}
<end_working_out>
<SOLUTION>{最终答案}</SOLUTION>

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen3-4B-Base",
    max_seq_length=2048,
    load_in_4bit=False,
    fast_inference=True,        # 启用 vLLM 快速推理
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(model, r=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_alpha=64,
    use_gradient_checkpointing="unsloth",
)

Step 2：GRPO 训练

使用 Open R1 数据集（DAPO-Math-17k），多维度奖励函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 精确格式匹配：3 分
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        if match_format.search(response) is not None:
            score = 3.0
        else:
            score = 0
        scores.append(score)
    return scores

# 答案正确：5 分（精确匹配），3.5 分（去空格匹配），2.0/1.5 分（数值近似）
def check_answer(prompts, completions, answer, **kwargs):
    for guess, true_answer in zip(extracted_responses, answer):
        if guess == true_answer: score += 5.0
        elif guess.strip() == true_answer.strip(): score += 3.5
        else:
            ratio = float(guess) / float(true_answer)
            if 0.9 <= ratio <= 1.1: score += 2.0
            elif 0.8 <= ratio <= 1.2: score += 1.5
            else: score -= 2.5

训练效果对比
训练前（Base 模型）：对 “What is the sqrt of 101?” 返回大量无关网页摘录内容 训练后（GRPO 模型）：输出结构化的推理过程，逐步计算得出 sqrt(101) ≈ 10.050，并用牛顿迭代法验证

八、四种方法的对比总结

维度	RLHF (PPO)	DPO	GRPO
所需模型数	4个（Actor + Critic + Reward + Ref）	2个（Policy + Ref）	2个（Policy + Ref）
是否需要 Reward Model	需要，单独训练	不需要，隐式奖励	不需要，规则奖励
是否需要 Value Model	需要（Critic）	不需要	不需要
采样方式	On-policy（在线采样）	Off-policy（离线数据）	On-policy（组采样）
优势估计	GAE（需要 Value 网络）	无（直接偏好对比）	组内标准化
训练复杂度	高	低	中
训练稳定性	需要仔细调参	相对稳定	较稳定
适用场景	通用对齐	有偏好数据的场景	有客观正确答案的场景（数学、代码）
代表工作	InstructGPT, ChatGPT	Llama 2, Zephyr	DeepSeek R1
KL 约束方式	加入奖励中	隐式包含在损失函数中	独立 KL 惩罚项

💡 如何选择？
通用对话对齐：PPO（数据充足时）或 DPO（偏好数据充足时）
数学推理/代码生成：GRPO（有明确的正确答案判定规则）
资源受限：DPO（最简单，只需偏好数据 + 2 个模型）
追求最佳效果：PPO（需要工程能力强，调参经验丰富）

九、关键代码片段汇总

9.1 LoRA 模块（手写实现）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class Lora(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear
        self.lora_A = torch.nn.Parameter(torch.randn(linear.in_features, 128) * 0.1)
        self.lora_B = torch.nn.Parameter(torch.zeros(128, linear.out_features))
        self.linear.weight.requires_grad = False

    def forward(self, x):
        y_linear = self.linear(x)
        y_lora = x.matmul(self.lora_A).matmul(self.lora_B)
        return y_linear + y_lora / 128

def merge(model):
    """训练后合并 LoRA 权重到原始模型"""
    for name, layer in model.named_modules():
        if isinstance(layer, Lora):
            linear = layer.linear
            linear.weight.data += layer.lora_A.matmul(layer.lora_B).t() / 128
            set_layer(model, name, linear)

9.2 GRPO KL 散度计算

1
2
3
def grpo_kl(pi_logprob, pi_ref_logprob):
    """无偏低方差 KL 估计（k3 estimator）"""
    return pi_ref_logprob.exp() / pi_logprob.exp() - (pi_ref_logprob - pi_logprob) - 1

9.3 GRPO 优势函数

1
2
3
4
5
def grpo_advantage(rewards):
    """组相对优势：每条回答的奖励减去组均值，除以标准差"""
    epsilon = 0.001 * torch.randn(1)
    rewards = torch.tensor(rewards, dtype=torch.float)
    return (rewards - rewards.mean()) / (rewards.std() + epsilon)

9.4 DPO 损失函数

1
2
3
4
5
6
# DPO 核心：chosen 和 rejected 的对数概率差
prob_diff = get_prob_diff(model_actor, **data)        # 当前模型
prob_diff_ref = get_prob_diff(model_actor_ref, **data) # 参考模型

loss = 0.1 * (prob_diff - prob_diff_ref)  # beta = 0.1
loss = -(loss.sigmoid() + 1e-8).log().mean()  # sigmoid + log = log-sigmoid

十、小测验

请完成以下 15 道测试题，答案见 教程_强化学习_答案.md

1. 在 RLHF 的 LLM 框架中，“状态”（State）对应的是什么？

2. KL 散度在 RLHF 中的作用是什么？如果不加 KL 惩罚会出现什么问题？

3. RLHF 的三个阶段分别是什么？请按顺序写出。

4. Reward Model 使用的 Bradley-Terry 模型的核心假设是什么？请写出损失函数。

5. PPO 中的 Clipped Surrogate Objective 的 clip 范围 $\varepsilon$ 典型值是多少？clip 的目的是什么？

6. 在 PPO 的奖励设计中，KL 惩罚是如何融入每个 token 的奖励的？最后一个 token 有什么特殊之处？

7. DPO 相对于 PPO 的最大优势是什么？DPO 需要几个模型？

8. DPO 是 on-policy 还是 off-policy？这意味着什么？

9. DPO 的隐式奖励函数是什么形式？请写出公式。

10. GRPO 与 PPO 的两个核心区别是什么？

11. GRPO 的优势函数 $\hat{A}_{i,t}$ 的计算公式是什么？为什么同一回答内每个 token 的优势值相同？

12. 当 GRPO 的组内回答全部正确或全部错误时，优势函数会出现什么问题？应该如何处理？

13. GRPO 使用的 KL 估计器（k3）的公式是什么？它相比朴素估计器 k1 的优势是什么？

14. 在 GRPO 的 GSM8K 数学推理实战中，奖励函数包含哪几个维度？各自的分数是多少？

15. 请比较 PPO、DPO、GRPO 三种方法在"所需模型数量"、“采样方式”、“适用场景"三个维度上的差异。

十一、思维导图结构建议

以下是本教程的思维导图结构，可用于 Obsidian 的 Canvas 或思维导图插件：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
大语言模型强化学习
├── 一、为什么需要 RLHF
│   ├── SFT 的局限性
│   │   ├── 多样性不足
│   │   ├── 多目标冲突
│   │   └── 分布偏移
│   └── RL 核心概念映射
│       ├── 策略 → LLM
│       ├── 状态 → prompt + 已生成 token
│       ├── 动作 → 下一个 token
│       └── 奖励 → RM 打分 / 规则判定
│
├── 二、KL 散度
│   ├── 标准 KL 公式
│   ├── Token-level KL
│   └── GRPO 无偏低方差 KL (k3 估计器)
│       ├── k1: 无偏高方差
│       ├── k2: 有偏低方差
│       └── k3: 无偏低方差 ← GRPO 采用
│
├── 三、RLHF (PPO) 完整流程
│   ├── Stage 1: SFT
│   ├── Stage 2: Reward Model
│   │   ├── 偏好数据 (chosen/rejected)
│   │   ├── Bradley-Terry 模型
│   │   └── Pairwise Loss
│   └── Stage 3: PPO
│       ├── 奖励设计 (RM + KL)
│       ├── GAE 优势估计
│       ├── Clipped Surrogate Objective
│       ├── Value Loss
│       └── 实战: IMDB + DistilBERT / NL2SQL
│
├── 四、DPO 直接偏好优化
│   ├── 原理推导
│   │   ├── RL 目标 → 闭式解
│   │   ├── 隐式奖励函数
│   │   └── DPO Loss
│   ├── Off-policy 特性
│   ├── 原生 PyTorch 实现
│   └── TRL DPOTrainer 实现
│
├── 五、GRPO 组相对偏好优化
│   ├── 与 PPO 的核心区别
│   │   ├── 规则奖励替代 Reward Model
│   │   └── 组优势替代 Value Model
│   ├── 组相对优势函数
│   ├── 规则奖励设计
│   ├── GRPO Loss
│   ├── GRPO KL (k3 estimator)
│   ├── GSM8K 数学推理实战
│   └── Qwen3-4B + Unsloth 实战
│
└── 六、方法对比
    ├── PPO: 4模型, on-policy, 通用对齐
    ├── DPO: 2模型, off-policy, 有偏好数据
    └── GRPO: 2模型, on-policy, 有规则判定