Question about the old policy and new policy in TRPO code

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

pytorch-trpo

2 409 10.0 Python

PyTorch implementation of Trust Region Policy Optimization

The code is a TRPO code. In this code, when "get_kl" , I can't understand the differences between the "mean0, log_std0, std0" and "mean1, log_std1, std1", aren't they equal in the code? And both the difference between the log_probs of old policy and new policy in the part of "get_loss" , aren't they equal in the code? Thanks for the help！

stable-baselines3

46 7,894 8.2 Python

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Good point...I'll check in more detail when I get a chance later today! I would suggest looking at a more recent implementation like https://github.com/DLR-RM/stable-baselines3 or https://github.com/thu-ml/tianshou if you're trying to build. https://spinningup.openai.com/en/latest/algorithms/trpo.html is particularly good for understanding

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
tianshou

8 7,378 9.5 Python

An elegant PyTorch deep reinforcement learning library.

Good point...I'll check in more detail when I get a chance later today! I would suggest looking at a more recent implementation like https://github.com/DLR-RM/stable-baselines3 or https://github.com/thu-ml/tianshou if you're trying to build. https://spinningup.openai.com/en/latest/algorithms/trpo.html is particularly good for understanding

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project