Our great sponsors
-
stable-baselines3
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
The code is a TRPO code. In this code, when "get_kl" , I can't understand the differences between the "mean0, log_std0, std0" and "mean1, log_std1, std1", aren't they equal in the code? And both the difference between the log_probs of old policy and new policy in the part of "get_loss" , aren't they equal in the code? Thanks for the help!
Good point...I'll check in more detail when I get a chance later today! I would suggest looking at a more recent implementation like https://github.com/DLR-RM/stable-baselines3 or https://github.com/thu-ml/tianshou if you're trying to build. https://spinningup.openai.com/en/latest/algorithms/trpo.html is particularly good for understanding
Good point...I'll check in more detail when I get a chance later today! I would suggest looking at a more recent implementation like https://github.com/DLR-RM/stable-baselines3 or https://github.com/thu-ml/tianshou if you're trying to build. https://spinningup.openai.com/en/latest/algorithms/trpo.html is particularly good for understanding
Related posts
- New to reinforcement learning.
- How to proceed further? (Learning RL)
- PPO rollout buffer for turn-based two-player game with varying turn lengths
- Show HN: An end-to-end reinforcement learning library for infinite horizon tasks
- Problem with Truncated Quantile Critics (TQC) and n-step learning algorithm.