On the model-based stochastic value gradient for continuous reinforcement learning

This repository is by Brandon Amos, Samuel Stanton, Denis Yarats, and Andrew Gordon Wilson and contains the PyTorch source code to reproduce the experiments in our L4DC 2021 paper On model-based stochastic value gradient for continuous reinforcement learning. Videos of our agents are available here.

Setup and dependencies

After cloning this repository and installing PyTorch on your system, you can set up the code with:

pip install -r requirements.txt
python3 setup.py develop

Setting up MuJoCo 1.5

This code was developed on an older version of MuJoCo and is compatible with MuJoCo 1.5. This can be set up as described in Raj Ghugare's ALM README and is reproduced here:

Install mjpro150 binaries here.
Extract the downloaded mjpro150 directory into ~/.mujoco/.
Download the free activation key from here and place it in ~/.mujoco/.
Update LD_LIBRARY_PATH to the binaries. Also consider adding it to your shell's initialization file such as ~/.bashrc.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mjpro150/bin

If you are using the latest versions of MuJoCo (> 2.0), it is possible that it might produce inaccurate or zero contact forces in the Humanoid and Ant environments. See #2593, #1541 and #1636. If you encounter any errors, check the troubleshooting section of mujoco-py.

A basic run and analysis

You can start a single local run on the humanoid with:

./train.py env=mbpo_humanoid

This will create an experiment directory in exp/local/<date>/ with models and logging info. Once that has saved out the first model, you can plot a video of the agent with some diagnostic information with the command:

./eval-vis-model.py exp/local/2021.05.07

Reproducing our main experimental results

We have the default hyper-parameters in this repo set to the best ones we found with a hyper-parameter search. The following command reproduces our final results using 10 seeds with the optimal hyper-parameter:

./train.py -m experiment=mbpo_final env=mbpo_cheetah,mbpo_hopper,mbpo_walker2d,mbpo_humanoid,mbpo_ant seed=$(seq -s, 10)

The results from this experiment can be plotted with our notebook nbs/mbpo.ipynb, which can also serve as a starting point for analyzing and developing further methods.

Experiment run for a trained humanoid attaining 10k reward

The directory trained-humanoid contains the experiment logs (*.csv) and checkpoint file (latest.pkl) for a humanoid agent that attains approximately 10k reward. It can be evaluated and visualized with:

./eval-vis-model.py trained-humanoid

Reproducing our sweeps and ablations

Our main hyper-parameter sweeps are run with hydra's multi-tasking mode and can be launched with the following command after uncommenting the hydra/sweeper line in config/train.yaml:

./train.py -m experiment=full_poplin_sweep

The results from this experiment can be plotted with our notebook nbs/poplin.ipynb.

Citations

If you find this repository helpful for your publications, please consider citing our paper:

@inproceedings{amos2021svg,
  title={On the model-based stochastic value gradient for continuous reinforcement learning},
  author={Amos, Brandon and Stanton, Samuel and Yarats, Denis and Wilson, Andrew Gordon},
  booktitle={L4DC},
  year={2021}
}

Licensing

This repository is licensed under the CC BY-NC 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
nbs		nbs
svg		svg
trained-humanoid		trained-humanoid
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eval-vis-model.py		eval-vis-model.py
requirements.txt		requirements.txt
setup.py		setup.py
shrink-checkpoint.py		shrink-checkpoint.py
train.py		train.py

License

facebookresearch/svg

Folders and files

Latest commit

History

Repository files navigation

On the model-based stochastic value gradient for continuous reinforcement learning

Setup and dependencies

Setting up MuJoCo 1.5

A basic run and analysis

Reproducing our main experimental results

Experiment run for a trained humanoid attaining 10k reward

Reproducing our sweeps and ablations

Citations

Licensing

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages