Loading... ## 1. Closed-loop driving objective At each time step $t$, the ego vehicle receives an ego-centric observation $o_t$, outputs a continuous control action $a_t$, and the simulator transitions to the next state. The policy objective is to maximize the expected discounted cumulative reward: $$ J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{T-1} \gamma^t R_t \right]. $$ Where: - $o_t$: ego-centric observation at time step $t$ - $a_t$: action at time step $t$ - $\pi_\theta(a_t \mid o_t)$: policy parameterized by $\theta$ - $R_t$: environment reward at time step $t$ - $\gamma$: discount factor - $T$: scenario horizon --- ## 2. Multiplicative reward definition Instead of a standard linear reward, the environment reward is defined as a multiplicative combination of normalized score terms: $$ R_t = \prod_i S_{i,t}^{w_i}, $$ where: - $S_{i,t}$: normalized score of reward component $i$ at time step $t$ - $w_i$: weight (hyperparameter) of reward component $i$ In the current implementation, the active reward consists of three terms: $$ R_t = S_{\text{route\_safe},t}^{\alpha} \cdot S_{\text{termination\_switch},t}^{\beta} \cdot S_{\text{forward\_quality},t}^{\gamma}, $$ with default values $$ \alpha = 0.5, \qquad \beta = 0.8, \qquad \gamma = 1.2. $$ Here: - $\alpha, \beta, \gamma$ are hyperparameters - each $S$ is a score calculated from the state at each step --- ## 3. Route-safe term The route-safe term combines a local safe-bubble score with an off-route suppression term: $$ S_{\text{route\_safe},t} = \operatorname{clip} \left( S_{\text{bubble},t} \cdot S_{\text{offroute},t}, \; s_{\min}, \; 1 \right). $$ Where: - $S_{\text{bubble},t}$: local safety score around the ego vehicle - $S_{\text{offroute},t}$: route consistency score - $s_{\min}$: minimum score floor In the current design: $$ s_{\min} = 0.2. $$ ### 3.1 Off-route score The off-route score is $$ S_{\text{offroute},t} = \frac{1}{1+\lambda \max(d_{\text{offroute},t},0)}, $$ where: - $d_{\text{offroute},t}$: off-route deviation at time step $t$ - $\lambda$: penalty scaling factor This term decreases smoothly as route deviation increases. ### 3.2 Safe-bubble score The safe-bubble score is computed from local interaction geometry and TTC-inspired safety margins: $$ S_{\text{bubble},t} = f_{\text{bubble}} \bigl( \text{relative geometry}, \text{distance}, \text{TTC-like margins} \bigr), $$ and is clipped into $$ S_{\text{bubble},t} \in [s_{\min}, 1]. $$ Engineering interpretation: - $S_{\text{bubble},t}$ is a dense pre-collision safety signal - it penalizes unsafe proximity before an actual collision occurs - it encourages PPO to maintain a local safety envelope around the ego vehicle --- ## 4. Terminal safety gate The terminal safety gate is defined by a binary severe-violation indicator: $$ v_{\text{term},t} = \mathbb{I} \bigl( \text{offroad}_t \lor \text{overlap}_t \lor \text{run\_red\_light}_t \bigr). $$ The corresponding reward score is $$ S_{\text{termination\_switch},t} = \begin{cases} 1.0, & v_{\text{term},t}=0, \\ s_{\min}, & v_{\text{term},t}=1. \end{cases} $$ So the worst score of this term is $$ S_{\text{termination\_switch},t}= s_{\min} = 0.2. $$ Its contribution to the final reward is $$ S_{\text{termination\_switch},t}^{\beta} = 0.2^{0.8} \quad \text{when a terminal violation occurs.} $$ --- ## 5. Forward-quality term The forward-quality term combines forward progress and comfort. Let: - $p_t$: binary making-progress signal - $c_t$: continuous comfort reward Then the intermediate quantity is $$ q_t = \operatorname{clip}(p_t \cdot c_t, 0, 1). $$ The final forward-quality score is $$ S_{\text{forward\_quality},t} = 1 + q_t. $$ Thus, $$ S_{\text{forward\_quality},t} \in [1, 2]. $$ ### 5.1 Making-progress signal In the current implementation, progress is binary: $$ p_t = \mathbb{I} \bigl( \text{progression}_t > \text{progression}_{t-1} \bigr). $$ ### 5.2 Comfort signal The comfort signal is written as $$ c_t = f_{\text{comfort}}(o_t, a_t, o_{t+1}), $$ where larger values indicate smoother motion. So this term rewards the policy only when it both: - moves forward - remains smooth --- ## 6. Full reward expression Putting the terms together: $$ R_t = \left[ \operatorname{clip} \left( S_{\text{bubble},t} \cdot \frac{1}{1+\lambda \max(d_{\text{offroute},t},0)}, \; s_{\min}, \; 1 \right) \right]^{\alpha} \cdot S_{\text{termination\_switch},t}^{\beta} \cdot (1 + q_t)^{\gamma}. $$ With the current setting: $$ R_t = S_{\text{route\_safe},t}^{0.5} \cdot S_{\text{termination\_switch},t}^{0.8} \cdot S_{\text{forward\_quality},t}^{1.2}. $$ --- ## 7. How reward enters PPO The PPO optimizer itself is standard. Your reward affects PPO through the return and the advantage. ### 7.1 Discounted return $$ G_t = \sum_{l=0}^{T-t-1} \gamma^l R_{t+l}. $$ So your multiplicative reward $R_t$ is the per-step environment reward used to build the return. --- ## 8. Temporal-difference residual The one-step temporal-difference residual is $$ \delta_t = R_t + \gamma (1-d_t)V(o_{t+1}) - V(o_t), $$ where: - $V(o_t)$: value function estimate - $d_t$: terminal indicator at step $t$ This is the direct point where your reward enters GAE. --- ## 9. Generalized Advantage Estimation (GAE) The advantage is estimated using GAE: $$ \hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \left( \prod_{j=0}^{l-1}(1-d_{t+j}) \right) \delta_{t+l}. $$ Equivalently, recursively: $$ \hat{A}_t = \delta_t + \gamma \lambda (1-d_t)\hat{A}_{t+1}. $$ Thus the chain is: $$ R_t \;\longrightarrow\; \delta_t \;\longrightarrow\; \hat A_t. $$ --- ## 10. PPO probability ratio The PPO policy ratio is $$ \rho_t(\theta) = \frac{\pi_\theta(a_t \mid o_t)} {\pi_{\theta_k}(a_t \mid o_t)}. $$ Where: - $\theta_k$: parameters of the policy used to collect data - $\theta$: updated policy parameters --- ## 11. PPO clipped surrogate objective The clipped surrogate objective is $$ L_t^{\text{clip}}(\theta) = \min \left( \rho_t(\theta)\hat{A}_t, \; \operatorname{clip} \bigl( \rho_t(\theta), 1-\epsilon, 1+\epsilon \bigr) \hat{A}_t \right). $$ where $\epsilon$ is the PPO clipping coefficient. --- ## 12. Value target and value loss A common value target is $$ V_t^{\text{target}} = \hat{A}_t + V(o_t). $$ Then the value loss is $$ L_t^{\text{value}}(\theta) = \left( V_\theta(o_t) - V_t^{\text{target}} \right)^2. $$ --- ## 13. Entropy regularization To encourage exploration, PPO includes an entropy term: $$ \mathcal{H}\bigl(\pi_\theta(\cdot \mid o_t)\bigr). $$ --- ## 14. Full PPO objective The total PPO objective is $$ L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ L_t^{\text{clip}}(\theta) - c_v L_t^{\text{value}}(\theta) + c_e \mathcal{H}\bigl(\pi_\theta(\cdot \mid o_t)\bigr) \right], $$ where: - $c_v$: value loss coefficient - $c_e$: entropy coefficient --- ## 15. Full dependency chain from reward to PPO update Putting everything together: $$ R_t = \prod_i S_{i,t}^{w_i} $$ $$ \Downarrow $$ $$ \delta_t = R_t + \gamma(1-d_t)V(o_{t+1}) - V(o_t) $$ $$ \Downarrow $$ $$ \hat{A}_t = \text{GAE}(\delta_t) $$ $$ \Downarrow $$ $$ L_t^{\text{clip}}(\theta) = \min \left( \rho_t(\theta)\hat{A}_t, \operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t \right) $$ $$ \Downarrow $$ $$ L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ L_t^{\text{clip}}(\theta) - c_v L_t^{\text{value}}(\theta) + c_e \mathcal{H}(\pi_\theta) \right] $$ $$ \Downarrow $$ $$ \theta_{k+1} = \arg\max_\theta L^{\text{PPO}}(\theta) $$ --- ## 16. Short interpretation Your contribution does not modify the PPO optimizer itself. Instead, it modifies the task signal used by PPO: - the environment reward is replaced by a structured multiplicative reward - this reward changes the TD residual - which changes the GAE advantage - which changes the PPO policy update In short: $$ \text{reward design} \;\Rightarrow\; \text{return / advantage} \;\Rightarrow\; \text{PPO learning behavior}. $$ 最后修改:2026 年 04 月 01 日 © 允许规范转载 打赏 赞赏作者 赞 如果觉得我的文章对你有用,请随意赞赏