PPO with Multiplicative Safety-Aware Reward

博主： William.W
发布时间：2026 年 04 月 01 日
28 次浏览
暂无评论
8562字数
分类：

## 1. Closed-loop driving objective

At each time step $t$, the ego vehicle receives an ego-centric observation $o_t$, outputs a continuous control action $a_t$, and the simulator transitions to the next state.

The policy objective is to maximize the expected discounted cumulative reward:

$$
J(\theta)
=
\mathbb{E}_{\pi_\theta}
\left[
\sum_{t=0}^{T-1} \gamma^t R_t
\right].
$$

Where:

- $o_t$: ego-centric observation at time step $t$
- $a_t$: action at time step $t$
- $\pi_\theta(a_t \mid o_t)$: policy parameterized by $\theta$
- $R_t$: environment reward at time step $t$
- $\gamma$: discount factor
- $T$: scenario horizon

---

## 2. Multiplicative reward definition

Instead of a standard linear reward, the environment reward is defined as a multiplicative combination of normalized score terms:

$$
R_t
=
\prod_i S_{i,t}^{w_i},
$$

where:

- $S_{i,t}$: normalized score of reward component $i$ at time step $t$
- $w_i$: weight (hyperparameter) of reward component $i$

In the current implementation, the active reward consists of three terms:

$$
R_t
=
S_{\text{route\_safe},t}^{\alpha}
\cdot
S_{\text{termination\_switch},t}^{\beta}
\cdot
S_{\text{forward\_quality},t}^{\gamma},
$$

with default values

$$
\alpha = 0.5,
\qquad
\beta = 0.8,
\qquad
\gamma = 1.2.
$$

Here:

- $\alpha, \beta, \gamma$ are hyperparameters
- each $S$ is a score calculated from the state at each step

---

## 3. Route-safe term

The route-safe term combines a local safe-bubble score with an off-route suppression term:

$$
S_{\text{route\_safe},t}
=
\operatorname{clip}
\left(
S_{\text{bubble},t}
\cdot
S_{\text{offroute},t},
\;
s_{\min},
\;
1
\right).
$$

Where:

- $S_{\text{bubble},t}$: local safety score around the ego vehicle
- $S_{\text{offroute},t}$: route consistency score
- $s_{\min}$: minimum score floor

In the current design:

$$
s_{\min} = 0.2.
$$

### 3.1 Off-route score

The off-route score is

$$
S_{\text{offroute},t}
=
\frac{1}{1+\lambda \max(d_{\text{offroute},t},0)},
$$

where:

- $d_{\text{offroute},t}$: off-route deviation at time step $t$
- $\lambda$: penalty scaling factor

This term decreases smoothly as route deviation increases.

### 3.2 Safe-bubble score

The safe-bubble score is computed from local interaction geometry and TTC-inspired safety margins:

$$
S_{\text{bubble},t}
=
f_{\text{bubble}}
\bigl(
\text{relative geometry},
\text{distance},
\text{TTC-like margins}
\bigr),
$$

and is clipped into

$$
S_{\text{bubble},t} \in [s_{\min}, 1].
$$

Engineering interpretation:

- $S_{\text{bubble},t}$ is a dense pre-collision safety signal
- it penalizes unsafe proximity before an actual collision occurs
- it encourages PPO to maintain a local safety envelope around the ego vehicle

---

## 4. Terminal safety gate

The terminal safety gate is defined by a binary severe-violation indicator:

$$
v_{\text{term},t}
=
\mathbb{I}
\bigl(
\text{offroad}_t
\lor
\text{overlap}_t
\lor
\text{run\_red\_light}_t
\bigr).
$$

The corresponding reward score is

$$
S_{\text{termination\_switch},t}
=
\begin{cases}
1.0, & v_{\text{term},t}=0, \\
s_{\min}, & v_{\text{term},t}=1.
\end{cases}
$$

So the worst score of this term is

$$
S_{\text{termination\_switch},t}= s_{\min} = 0.2.
$$

Its contribution to the final reward is

$$
S_{\text{termination\_switch},t}^{\beta}
=
0.2^{0.8}
\quad
\text{when a terminal violation occurs.}
$$

---

## 5. Forward-quality term

The forward-quality term combines forward progress and comfort.

Let:

- $p_t$: binary making-progress signal
- $c_t$: continuous comfort reward

Then the intermediate quantity is

$$
q_t
=
\operatorname{clip}(p_t \cdot c_t, 0, 1).
$$

The final forward-quality score is

$$
S_{\text{forward\_quality},t}
=
1 + q_t.
$$

Thus,

$$
S_{\text{forward\_quality},t} \in [1, 2].
$$

### 5.1 Making-progress signal

In the current implementation, progress is binary:

$$
p_t
=
\mathbb{I}
\bigl(
\text{progression}_t > \text{progression}_{t-1}
\bigr).
$$

### 5.2 Comfort signal

The comfort signal is written as

$$
c_t = f_{\text{comfort}}(o_t, a_t, o_{t+1}),
$$

where larger values indicate smoother motion.

So this term rewards the policy only when it both:

- moves forward
- remains smooth

---

## 6. Full reward expression

Putting the terms together:

$$
R_t
=
\left[
\operatorname{clip}
\left(
S_{\text{bubble},t}
\cdot
\frac{1}{1+\lambda \max(d_{\text{offroute},t},0)},
\;
s_{\min},
\;
1
\right)
\right]^{\alpha}
\cdot
S_{\text{termination\_switch},t}^{\beta}
\cdot
(1 + q_t)^{\gamma}.
$$

With the current setting:

$$
R_t
=
S_{\text{route\_safe},t}^{0.5}
\cdot
S_{\text{termination\_switch},t}^{0.8}
\cdot
S_{\text{forward\_quality},t}^{1.2}.
$$

---

## 7. How reward enters PPO

The PPO optimizer itself is standard. Your reward affects PPO through the return and the advantage.

### 7.1 Discounted return

$$
G_t
=
\sum_{l=0}^{T-t-1} \gamma^l R_{t+l}.
$$

So your multiplicative reward $R_t$ is the per-step environment reward used to build the return.

---

## 8. Temporal-difference residual

The one-step temporal-difference residual is

$$
\delta_t
=
R_t
+
\gamma (1-d_t)V(o_{t+1})
-
V(o_t),
$$

where:

- $V(o_t)$: value function estimate
- $d_t$: terminal indicator at step $t$

This is the direct point where your reward enters GAE.

---

## 9. Generalized Advantage Estimation (GAE)

The advantage is estimated using GAE:

$$
\hat{A}_t
=
\sum_{l=0}^{T-t-1}
(\gamma \lambda)^l
\left(
\prod_{j=0}^{l-1}(1-d_{t+j})
\right)
\delta_{t+l}.
$$

Equivalently, recursively:

$$
\hat{A}_t
=
\delta_t
+
\gamma \lambda (1-d_t)\hat{A}_{t+1}.
$$

Thus the chain is:

$$
R_t
\;\longrightarrow\;
\delta_t
\;\longrightarrow\;
\hat A_t.
$$

---

## 10. PPO probability ratio

The PPO policy ratio is

$$
\rho_t(\theta)
=
\frac{\pi_\theta(a_t \mid o_t)}
{\pi_{\theta_k}(a_t \mid o_t)}.
$$

Where:

- $\theta_k$: parameters of the policy used to collect data
- $\theta$: updated policy parameters

---

## 11. PPO clipped surrogate objective

The clipped surrogate objective is

$$
L_t^{\text{clip}}(\theta)
=
\min
\left(
\rho_t(\theta)\hat{A}_t,
\;
\operatorname{clip}
\bigl(
\rho_t(\theta),
1-\epsilon,
1+\epsilon
\bigr)
\hat{A}_t
\right).
$$

where $\epsilon$ is the PPO clipping coefficient.

---

## 12. Value target and value loss

A common value target is

$$
V_t^{\text{target}}
=
\hat{A}_t + V(o_t).
$$

Then the value loss is

$$
L_t^{\text{value}}(\theta)
=
\left(
V_\theta(o_t) - V_t^{\text{target}}
\right)^2.
$$

---

## 13. Entropy regularization

To encourage exploration, PPO includes an entropy term:

$$
\mathcal{H}\bigl(\pi_\theta(\cdot \mid o_t)\bigr).
$$

---

## 14. Full PPO objective

The total PPO objective is

$$
L^{\text{PPO}}(\theta)
=
\mathbb{E}_t
\left[
L_t^{\text{clip}}(\theta)
-
c_v L_t^{\text{value}}(\theta)
+
c_e \mathcal{H}\bigl(\pi_\theta(\cdot \mid o_t)\bigr)
\right],
$$

where:

- $c_v$: value loss coefficient
- $c_e$: entropy coefficient

---

## 15. Full dependency chain from reward to PPO update

Putting everything together:

$$
R_t
=
\prod_i S_{i,t}^{w_i}
$$

$$
\Downarrow
$$

$$
\delta_t
=
R_t + \gamma(1-d_t)V(o_{t+1}) - V(o_t)
$$

$$
\Downarrow
$$

$$
\hat{A}_t
=
\text{GAE}(\delta_t)
$$

$$
\Downarrow
$$

$$
L_t^{\text{clip}}(\theta)
=
\min
\left(
\rho_t(\theta)\hat{A}_t,
\operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t
\right)
$$

$$
\Downarrow
$$

$$
L^{\text{PPO}}(\theta)
=
\mathbb{E}_t
\left[
L_t^{\text{clip}}(\theta)
-
c_v L_t^{\text{value}}(\theta)
+
c_e \mathcal{H}(\pi_\theta)
\right]
$$

$$
\Downarrow
$$

$$
\theta_{k+1}
=
\arg\max_\theta L^{\text{PPO}}(\theta)
$$

---

## 16. Short interpretation

Your contribution does not modify the PPO optimizer itself.

Instead, it modifies the task signal used by PPO:

- the environment reward is replaced by a structured multiplicative reward
- this reward changes the TD residual
- which changes the GAE advantage
- which changes the PPO policy update

In short:

$$
\text{reward design}
\;\Rightarrow\;
\text{return / advantage}
\;\Rightarrow\;
\text{PPO learning behavior}.
$$

最后修改：2026 年 04 月 01 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

PPO with Multiplicative Safety-Aware Reward

William.W • 2026 年 04 月 01 日

## 1. Closed-loop driving objective

At each time step $t$, the ego vehicle receives an ego-centric observation $o_t$, outputs a continuous control action $a_t$, and the simulator transitions to the next state.

The policy objective is to maximize the expected discounted cumulative reward:

$$
J(\theta)
=
\mathbb{E}_{\pi_\theta}
\left[
\sum_{t=0}^{T-1} \gamma^t R_t
\right].
$$

Where:

---

## 2. Multiplicative reward definition

Instead of a standard linear reward, the environment reward is defined as a multiplicative combination of normalized score terms:

$$
R_t
=
\prod_i S_{i,t}^{w_i},
$$

where:

- $S_{i,t}$: normalized score of reward component $i$ at time step $t$
- $w_i$: weight (hyperparameter) of reward component $i$

In the current implementation, the active reward consists of three terms:

$$
R_t
=
S_{\text{route\_safe},t}^{\alpha}
\cdot
S_{\text{termination\_switch},t}^{\beta}
\cdot
S_{\text{forward\_quality},t}^{\gamma},
$$

with default values

$$
\alpha = 0.5,
\qquad
\beta = 0.8,
\qquad
\gamma = 1.2.
$$

Here:

- $\alpha, \beta, \gamma$ are hyperparameters
- each $S$ is a score calculated from the state at each step

---

## 3. Route-safe term

The route-safe term combines a local safe-bubble score with an off-route suppression term:

$$
S_{\text{route\_safe},t}
=
\operatorname{clip}
\left(
S_{\text{bubble},t}
\cdot
S_{\text{offroute},t},
\;
s_{\min},
\;
1
\right).
$$

Where:

- $S_{\text{bubble},t}$: local safety score around the ego vehicle
- $S_{\text{offroute},t}$: route consistency score
- $s_{\min}$: minimum score floor

In the current design:

$$
s_{\min} = 0.2.
$$

### 3.1 Off-route score

The off-route score is

$$
S_{\text{offroute},t}
=
\frac{1}{1+\lambda \max(d_{\text{offroute},t},0)},
$$

where:

- $d_{\text{offroute},t}$: off-route deviation at time step $t$
- $\lambda$: penalty scaling factor

This term decreases smoothly as route deviation increases.

### 3.2 Safe-bubble score

The safe-bubble score is computed from local interaction geometry and TTC-inspired safety margins:

$$
S_{\text{bubble},t}
=
f_{\text{bubble}}
\bigl(
\text{relative geometry},
\text{distance},
\text{TTC-like margins}
\bigr),
$$

and is clipped into

$$
S_{\text{bubble},t} \in [s_{\min}, 1].
$$

Engineering interpretation:

---

## 4. Terminal safety gate

The terminal safety gate is defined by a binary severe-violation indicator:

$$
v_{\text{term},t}
=
\mathbb{I}
\bigl(
\text{offroad}_t
\lor
\text{overlap}_t
\lor
\text{run\_red\_light}_t
\bigr).
$$

The corresponding reward score is

$$
S_{\text{termination\_switch},t}
=
\begin{cases}
1.0, & v_{\text{term},t}=0, \\
s_{\min}, & v_{\text{term},t}=1.
\end{cases}
$$

So the worst score of this term is

$$
S_{\text{termination\_switch},t}= s_{\min} = 0.2.
$$

Its contribution to the final reward is

$$
S_{\text{termination\_switch},t}^{\beta}
=
0.2^{0.8}
\quad
\text{when a terminal violation occurs.}
$$

---

## 5. Forward-quality term

The forward-quality term combines forward progress and comfort.

Let:

- $p_t$: binary making-progress signal
- $c_t$: continuous comfort reward

Then the intermediate quantity is

$$
q_t
=
\operatorname{clip}(p_t \cdot c_t, 0, 1).
$$

The final forward-quality score is

$$
S_{\text{forward\_quality},t}
=
1 + q_t.
$$

Thus,

$$
S_{\text{forward\_quality},t} \in [1, 2].
$$

### 5.1 Making-progress signal

In the current implementation, progress is binary:

$$
p_t
=
\mathbb{I}
\bigl(
\text{progression}_t > \text{progression}_{t-1}
\bigr).
$$

### 5.2 Comfort signal

The comfort signal is written as

$$
c_t = f_{\text{comfort}}(o_t, a_t, o_{t+1}),
$$

where larger values indicate smoother motion.

So this term rewards the policy only when it both:

- moves forward
- remains smooth

---

## 6. Full reward expression

Putting the terms together:

With the current setting:

$$
R_t
=
S_{\text{route\_safe},t}^{0.5}
\cdot
S_{\text{termination\_switch},t}^{0.8}
\cdot
S_{\text{forward\_quality},t}^{1.2}.
$$

---

## 7. How reward enters PPO

The PPO optimizer itself is standard. Your reward affects PPO through the return and the advantage.

### 7.1 Discounted return

$$
G_t
=
\sum_{l=0}^{T-t-1} \gamma^l R_{t+l}.
$$

So your multiplicative reward $R_t$ is the per-step environment reward used to build the return.

---

## 8. Temporal-difference residual

The one-step temporal-difference residual is

$$
\delta_t
=
R_t
+
\gamma (1-d_t)V(o_{t+1})
-
V(o_t),
$$

where:

- $V(o_t)$: value function estimate
- $d_t$: terminal indicator at step $t$

This is the direct point where your reward enters GAE.

---

## 9. Generalized Advantage Estimation (GAE)

The advantage is estimated using GAE:

$$
\hat{A}_t
=
\sum_{l=0}^{T-t-1}
(\gamma \lambda)^l
\left(
\prod_{j=0}^{l-1}(1-d_{t+j})
\right)
\delta_{t+l}.
$$

Equivalently, recursively:

$$
\hat{A}_t
=
\delta_t
+
\gamma \lambda (1-d_t)\hat{A}_{t+1}.
$$

Thus the chain is:

$$
R_t
\;\longrightarrow\;
\delta_t
\;\longrightarrow\;
\hat A_t.
$$

---

## 10. PPO probability ratio

The PPO policy ratio is

$$
\rho_t(\theta)
=
\frac{\pi_\theta(a_t \mid o_t)}
{\pi_{\theta_k}(a_t \mid o_t)}.
$$

Where:

- $\theta_k$: parameters of the policy used to collect data
- $\theta$: updated policy parameters

---

## 11. PPO clipped surrogate objective

The clipped surrogate objective is

$$
L_t^{\text{clip}}(\theta)
=
\min
\left(
\rho_t(\theta)\hat{A}_t,
\;
\operatorname{clip}
\bigl(
\rho_t(\theta),
1-\epsilon,
1+\epsilon
\bigr)
\hat{A}_t
\right).
$$

where $\epsilon$ is the PPO clipping coefficient.

---

## 12. Value target and value loss

A common value target is

$$
V_t^{\text{target}}
=
\hat{A}_t + V(o_t).
$$

Then the value loss is

$$
L_t^{\text{value}}(\theta)
=
\left(
V_\theta(o_t) - V_t^{\text{target}}
\right)^2.
$$

---

## 13. Entropy regularization

To encourage exploration, PPO includes an entropy term:

$$
\mathcal{H}\bigl(\pi_\theta(\cdot \mid o_t)\bigr).
$$

---

## 14. Full PPO objective

The total PPO objective is

$$
L^{\text{PPO}}(\theta)
=
\mathbb{E}_t
\left[
L_t^{\text{clip}}(\theta)
-
c_v L_t^{\text{value}}(\theta)
+
c_e \mathcal{H}\bigl(\pi_\theta(\cdot \mid o_t)\bigr)
\right],
$$

where:

- $c_v$: value loss coefficient
- $c_e$: entropy coefficient

---

## 15. Full dependency chain from reward to PPO update

Putting everything together:

$$
R_t
=
\prod_i S_{i,t}^{w_i}
$$

$$
\Downarrow
$$

$$
\delta_t
=
R_t + \gamma(1-d_t)V(o_{t+1}) - V(o_t)
$$

$$
\Downarrow
$$

$$
\hat{A}_t
=
\text{GAE}(\delta_t)
$$

$$
\Downarrow
$$

$$
L_t^{\text{clip}}(\theta)
=
\min
\left(
\rho_t(\theta)\hat{A}_t,
\operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t
\right)
$$

$$
\Downarrow
$$

$$
L^{\text{PPO}}(\theta)
=
\mathbb{E}_t
\left[
L_t^{\text{clip}}(\theta)
-
c_v L_t^{\text{value}}(\theta)
+
c_e \mathcal{H}(\pi_\theta)
\right]
$$

$$
\Downarrow
$$

$$
\theta_{k+1}
=
\arg\max_\theta L^{\text{PPO}}(\theta)
$$

---

## 16. Short interpretation

Your contribution does not modify the PPO optimizer itself.

Instead, it modifies the task signal used by PPO:

- the environment reward is replaced by a structured multiplicative reward
- this reward changes the TD residual
- which changes the GAE advantage
- which changes the PPO policy update

In short:

$$
\text{reward design}
\;\Rightarrow\;
\text{return / advantage}
\;\Rightarrow\;
\text{PPO learning behavior}.
$$

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

PPO with Multiplicative Safety-Aware Reward

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款