GitHub - yzhwang/jax-multi-gpu-resnet50-example at zzun.app

~~# jax-multi-gpu-resnet50-example~~

This repo shows how to use jax for multi-node multi-GPU training. The example is adapted from the resnet50 example in dm-haiku (https://github.com/deepmind/dm-haiku/tree/main/examples/imagenet). It only requires each node knows the IP of the rank 0 node, very similar to PyTorch's DDP.

~~When two containers on the same cluster are running, one can run the following script in each container to launch a multi-node multi-GPU training job:~~

~~python train.py --server_ip=$ROOT_IP --server_port=$PORT --num_hosts=$NUM_HOSTS --host_idx=$HOST_IDX~~

THIS IS OBSOLETE

Jax multi-host GPU setting is now way easier. Check their

documentation: https://jax.readthedocs.io/en/latest/_autosummary/jax.distributed.initialize.html

related test: https://github.com/google/jax/blob/main/tests/distributed_test.py

And PR to enable this in one of Google Research's repo: google-research/t5x#626

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
apt_install.txt		apt_install.txt
dataset.py		dataset.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

apt_install.txt

apt_install.txt

dataset.py

dataset.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

About

Releases

Packages

Languages

License

yzhwang/jax-multi-gpu-resnet50-example

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages