Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training in process/core level parallelism #3

Open
thisiscam opened this issue Apr 21, 2016 · 5 comments
Open

Training in process/core level parallelism #3

thisiscam opened this issue Apr 21, 2016 · 5 comments

Comments

@thisiscam
Copy link

thisiscam commented Apr 21, 2016

Hi @Zeta36

Great project! I'm trying to run some experiments with the code. It seems that currently the code uses threading with tensorflow, and from my observation, the training loop is not really in full parallel because of running on threads instead of processes. I think ideally, each learner should be on a different process to fully utilize a modern machine.

This might be relevant:
http://stackoverflow.com/questions/34900246/tensorflow-passing-a-session-to-a-python-multiprocess

But it looks like bad news, I can't just spawn a bunch of processes and let them share the same tensorflow session. So maybe a distributed tensorflow session is what we need:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md

@ahundt
Copy link

ahundt commented Dec 10, 2016

I've run this as well and as @thisiscam mentions, it doesn't appear to actually run in parallel with good utilization. When I run the program most python threads are at 5% core utilization except for one thread which is at 97% utilization, this means that collectively only about 2 cores are actually in use.

@ahundt
Copy link

ahundt commented Dec 10, 2016

@thisiscam distributed tensorflow as per your link is across many physical machines networked together, before that approach is taken it is important to completely utilize the capabilities of a single machine.

@ahundt
Copy link

ahundt commented Dec 10, 2016

The threading mechanism and queues is more likely to be the right way to go: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/threading_and_queues/index.md

@ahundt
Copy link

ahundt commented Dec 10, 2016

@thisiscam I saw you made some changes in a branch here: https://github.com/thisiscam/Asynchronous-Methods-for-Deep-Reinforcement-Learning/tree/ale

But it looks like you forgot to add a file for some of the functions like load_ale() which is simply not present.

@thisiscam
Copy link
Author

load_ale comes from python_ale_interface
https://github.com/bbitmaster/ale_python_interface

However, from my experiments I have not yet tuned a good enough parameter that works. It might be due to some bug in the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants