Skip to content

A re-implementation of MBPO algorithm for MDP env.

Notifications You must be signed in to change notification settings

vaishn99/MBPO-MDP

Repository files navigation

MBPO-MDP

What does this repo contains :

  • We have built a framework that consist of an MBPO(A model free RL algorithm) based Agent interacting with an environment,whose dynamics are given by a Tabular MDP.This can strengthen our undertanding on what exactly is going on with the MBPO algorithm. Since optimal policy can be explicitly computed for tabular MDP's ,we can do a comparison with what MBPO based agent gives.In short,the user can use this for debugging purposes.

Objective :

  • To learn/explore about the Real data vs Fake data trade off,that exist in Model Based Policy Optimisation (MBPO) algorithm.

Pre-requisites:

Basic understanding the following topics are expected:

  • MBPO algorithm.
  • off policy policy gradient(importance sampling based).
  • MDPs

I will touch upon MBPO algorithm,a practical version of the same and the intuition behind MBPO algorithm.

Algorithm proposed :(Simplified version)

Screenshot 2022-11-17 at 12 07 22 AM

Algorithm proposed:(Implementation version)

Screenshot 2022-11-17 at 12 07 33 AM

Intuition :

MBPO optimizes a policy under a learned model, collects data under the updated policy, and uses that data to train a new model.

Agent can be switched to any of the following training approaches:

A Note before:

  • By Real episodes ,I mean the episodes obtained by interacting with the actual environment.
  • By Fake episodes ,I mean the the episode obtained by interacting with the estimated environment.As per MBPO algorithm,the agent will maintain an estimate of the dynamics of the environment,now using this estimate ,agent will generate the a bunch of fake trajectories.then perform policy gradient on this fake data.

Three approaches:

  • Policy gradient on real episodes(pure): This is same as that of the normal off -policy policy gradient.
  • Policy gradient on fake episodes(pure): Here policy gradient will be performed but using the fake episode data.
  • Policy gradient on a mixture of real and fake episodes(mixed): Here policy gradient will be performed using data buffer with comprise of a mixture of real and fake episodes.The user can specify the mixing ratio as a parameter.

Thanks/References:

About

A re-implementation of MBPO algorithm for MDP env.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published