영감 및 되새길점

The PPO implementation(openai/baselines) contains many non-trivial optimizations that are not (or only barely) described in its corresponding paper.
Find that PPO’s marked improvement over TRPO (and even stochastic gradient descent) can be largely attributed to these optimizations.
When building algorithms, we should understand precisely how each component impacts agent training—both in terms of overall performance and underlying algorithmic behavior.

코드

Contributions and Key Results

Normal policy gradient keeps new and old policies close in parameter space.
But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency.
TRPO nicely avoids this kind of collapse and tends to quickly and monotonically improve performance.
On-policy: it explores by sampling actions according to the latest version of its stochastic policy.

Untitled

액션의 결과로 얻은 Advantage가 양수일 때, 그런 액션이 나올 확률을 높이고 싶겠지만 기존에 해당 액션이 뽑힐 확률보다 (1+e)배 이상 늘리지 못한다.
액션의 결과로 얻은 Advantage가 음수일 때, 그런 액션이 나올 확률을 줄이고 싶겠지만 기존에 해당 액션이 뽑힐 확률보다 (1-e)배 이하로 줄이지 못한다.