영감 및 되새길점
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
- The PPO implementation(openai/baselines) contains many non-trivial optimizations that are not (or only barely) described in its corresponding paper.
- Find that PPO’s marked improvement over TRPO (and even stochastic gradient descent) can be largely attributed to these optimizations.
- When building algorithms, we should understand precisely how each component impacts agent training—both in terms of overall performance and underlying algorithmic behavior.
코드
implementation-matters/agent.py at master · MadryLab/implementation-matters
Contributions and Key Results
다크 프로그래머 :: 최적화 기법의 직관적 이해
Trust Region Policy Optimization — Spinning Up documentation (openai.com)
- Normal policy gradient keeps new and old policies close in parameter space.
- But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency.
- TRPO nicely avoids this kind of collapse and tends to quickly and monotonically improve performance.
- On-policy: it explores by sampling actions according to the latest version of its stochastic policy.
Proximal Policy Optimization — Spinning Up documentation (openai.com)

- 액션의 결과로 얻은 Advantage가 양수일 때, 그런 액션이 나올 확률을 높이고 싶겠지만 기존에 해당 액션이 뽑힐 확률보다 (1+e)배 이상 늘리지 못한다.
- 액션의 결과로 얻은 Advantage가 음수일 때, 그런 액션이 나올 확률을 줄이고 싶겠지만 기존에 해당 액션이 뽑힐 확률보다 (1-e)배 이하로 줄이지 못한다.