🔮 Execution time predictions for deep neural network training iterations across different GPUs.

Last update: Dec 27, 2022

Overview

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Habitat is a tool that predicts a deep neural network's training iteration execution time on a given GPU. It currently supports PyTorch. To learn more about how Habitat works, please see our research paper.

Running From Source

Currently, the only way to run Habitat is to build it from source. You should use the Docker image provided in this repository to make sure that you can compile the code.

Download the Habitat pre-trained models.
Run extract-models.sh under analyzer to extract and install the pre-trained models.
Run setup.sh under docker/ to build the Habitat container image.
Run start.sh to start a new container. By default, your home directory will be mounted inside the container under ~/home.
Once inside the container, run install-dev.sh under analyzer/ to build and install the Habitat package.
In your scripts, import habitat to get access to Habitat. See experiments/run_experiment.py for an example showing how to use Habitat.

License

The code in this repository is licensed under the Apache 2.0 license (see LICENSE and NOTICE), with the exception of the files mentioned below.

This software contains source code provided by NVIDIA Corporation. These files are:

The code under cpp/external/cupti_profilerhost_util/ (CUPTI sample code)
cpp/src/cuda/cuda_occupancy.h

The code mentioned above is licensed under the NVIDIA Software Development Kit End User License Agreement.

We include the implementations of several deep neural networks under experiments/ for our evaluation. These implementations are copyrighted by their original authors and carry their original licenses. Please see the corresponding README files and license files inside the subdirectories for more information.

Research Paper

Habitat began as a research project in the EcoSystem Group at the University of Toronto. The accompanying research paper will appear in the proceedings of USENIX ATC'21. If you are interested, you can read a preprint of the paper here.

If you use Habitat in your research, please consider citing our paper:

@inproceedings{habitat-yu21,
  author = {Yu, Geoffrey X. and Gao, Yubo and Golikov, Pavel and Pekhimenko,
    Gennady},
  title = {{Habitat: A Runtime-Based Computational Performance Predictor for
    Deep Neural Network Training}},
  booktitle = {{Proceedings of the 2021 USENIX Annual Technical Conference
    (USENIX ATC'21)}},
  year = {2021},
}

Comments

I wonder what the meaning of varing kernel is.

Hi I am reading Habitat research paper.

I wonder what the meaning of varing kernel is. I thought the GPU kernel is a collection of instructions that run in parallel, is that right?

Can you give me an example of this phrase? 'some DNN operations are implemented using different GPU kernels on different GPUs '

Thank you for taking the time to read.
question

opened by Baek-sohyeon 6
error: function cuptiProfilerBeginSession(&begin_session_params) failed with error CUPTI_ERROR_UNKNOWN

Hi @geoffxy,

Great work here. I am quite interested in your project and try to reproduce from my side. Hower hit the error in the titel, I suspect that it may be caused by incompetible between CUPTI and NVIDIA driver version, I am wondering if could share you experiment setup here, mostly the host side, are you still using 18.04, what the nvidia driver version, did you use nvidia-docker2 or nvidia-container-runtime? what is your docker version?

As mine, I am using 18.04 as host, driver 470.103.01, nvidia-docker2, docker 20.10.12.

Thanks, Liang

opened by liayan 3
How Habitat measures the execution time associated with the operation’s backward pass?

Hi! Thanks for your perfect job.

It's easy to understand to measure the execution time in the forward pass. But in the backward pass, how Habitat does? I think it is an undoubtedly different processor, right?

@geoffxy Hope for your reply soon!
question

opened by xiyiyia 2
Large Prediction Errors
Hi, I am reproducing the experiments in Habitat now. This is an interesting work and it's very convenient to run Habitat and process the results using the following two scripts.

bash habitat/experiments/gather_raw_data.sh <target_device> bash habitat/experiments/process_raw_data.sh

Due to the limitation of GPU resources, I can not access all GPU models listed in the paper and only test it on V100, P100 and T4. But the prediction error is quite large, compared to that shown in the paper. You can check the results here.

Basically, the setting I used follows habitat/docker/Dockerfile. Here are some of my experiment settings that may be different from yours:

CUDA driver version: 455.32.00,

I do not mount the user account on the host machine into the container

So,

Is there any hyper-parameter I need to tune to get a better prediction error ? 2.Can you share the cross-GPU prediction error between each pair of GPUs or just the output of habitat/experiments/process_raw_data.sh? Fig 3 in the paper only shows the results "averaged across all other “origin” GPUs".

Will the setting differences listed above affect the prediction error ? Or any other possible reasons ?

Thanks.
question
opened by joapolarbear 2
CMake Error at CMakeLists.txt:22 (pybind11_add_module):

when running "install-dev.sh", hit below error:

CMake Error at CMakeLists.txt:22 (pybind11_add_module): Unknown CMake command "pybind11_add_module".

-- Configuring incomplete, errors occurred!

opened by liayan 1

CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in container

The default configuration on my OS and current directions in README may lead to a CUPTI_ERROR_INSUFFICIENT_PRIVILEGES when using CUPTI inside the container.

The example log is attached below:

/home/ubuntu/home/habitat/cpp/src/cuda/cupti_tracer.cpp:120: error: function cuptiActivityRegisterCallbacks(cuptiBufferRequested, cuptiBufferCompleted) failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES.
Traceback (most recent call last):
  File "run_experiment.py", line 246, in <module>
    main()
  File "run_experiment.py", line 238, in main
    run_dcgan_experiments(context)
  File "run_experiment.py", line 155, in run_dcgan_experiments
    context,
  File "run_experiment.py", line 85, in run_experiment_config
    threshold = compute_threshold(runnable, context)
  File "run_experiment.py", line 66, in compute_threshold
    runnable()
  File "run_experiment.py", line 150, in runnable
    iteration(*inputs)
  File "/home/ubuntu/home/habitat/experiments/dcgan/entry_point.py", line 41, in iteration
    netD.zero_grad()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1098, in zero_grad
    p.grad.detach_()
  File "/home/ubuntu/home/habitat/analyzer/habitat/tracking/operation.py", line 62, in hook
    kwargs,
  File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/operation.py", line 45, in measure_operation
    record_kernels,
  File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/operation.py", line 164, in _to_run_time_measurement
    if record_kernels else []
  File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/kernel.py", line 34, in measure_kernels
    self._measure_kernels_raw(runnable, fname)
  File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/kernel.py", line 48, in _measure_kernels_raw
    time_kernels = hc.profile(runnable)
RuntimeError: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES

My solution: Adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and reboot.

Ref:

https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters
https://github.com/tensorflow/tensorflow/issues/35860#issuecomment-585436324

opened by yzs981130 1

Fail to build the image
Hi, I am following the steps here to reproduce habitat. When running setup.sh to build the image, the following error occurs

Step 14/19 : RUN gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4 ---> Running in d42ae3b13a05 gpg: WARNING: unsafe permissions on homedir '/root/.gnupg' gpg: keybox '/root/.gnupg/pubring.kbx' created gpg: keyserver receive failed: No name The command '/bin/sh -c gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4' returned a non-zero code: 2

Does it mean the keyserver ha.pool.sks-keyservers.net is not accessible now?

I wonder whether it is necessary to duplicate the user account on the host machine into the container. With a root account in the container, I can access everything mounted from the host machine. What problem does it cause?

Looking forward to your reply. Thanks.
opened by joapolarbear 1
Fix format specifier for size_t

https://stackoverflow.com/questions/2524611/how-can-one-print-a-size-t-variable-portably-using-the-printf-family

Signed-off-by: Kiruya Momochi [email protected]

opened by KiruyaMomochi 0

Broken pillow for torchvision in Dockerfile causes docker build failed

Currently, pip3 install torchvision==0.5.0 should fail due to the broken dependency of pillow, shown in the following CI building process:

https://github.com/yzs-lab/habitat/runs/4311964953?check_suite_focus=true#step:3:915

Corresponding logs are attached below:

The headers or library files could not be found for zlib,
    a required dependency when compiling Pillow from source.
    
    Please see the install instructions at:
       https://pillow.readthedocs.io/en/latest/installation.html
    
    Traceback (most recent call last):
      File "/tmp/pip-build-c0iq5ua_/pillow/setup.py", line 1024, in <module>
        zip_safe=not (debug_build() or PLATFORM_MINGW),
      File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 129, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/lib/python3/dist-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
        self.run_command('build')
      File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 78, in run
        _build_ext.run(self)
      File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
        self.build_extensions()
      File "/tmp/pip-build-c0iq5ua_/pillow/setup.py", line 790, in build_extensions
        raise RequiredDependencyException(f)
    __main__.RequiredDependencyException: zlib
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-c0iq5ua_/pillow/setup.py", line 1037, in <module>
        raise RequiredDependencyException(msg)
    __main__.RequiredDependencyException:
    
    The headers or library files could not be found for zlib,
    a required dependency when compiling Pillow from source.
    
    Please see the install instructions at:
       https://pillow.readthedocs.io/en/latest/installation.html
    
    
    
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-c0iq5ua_/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-8eakyb7g-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-c0iq5ua_/pillow/
The command '/bin/sh -c pip3 install   torch==1.4.0   torchvision==0.5.0   pandas==1.1.2   tqdm==4.49.0' returned a non-zero code: 1

opened by yzs981130 0

Releases(v1.0.0)

v1.0.0(Jun 1, 2021)

This release is the first feature release of Habitat.

Habitat is a tool that predicts a deep neural network's training iteration execution time on a given GPU. To learn more about how Habitat works, please see our research paper.
Source code(tar.gz)
Source code(zip)

Owner

Geoffrey Yu

Computer Science PhD Student at MIT | Software Engineering '18 @uWaterloo

GitHub Repository

CVPR 2021: "The Spatially-Correlative Loss for Various Image Translation Tasks"

Spatially-Correlative Loss arXiv | website We provide the Pytorch implementation of "The Spatially-Correlative Loss for Various Image Translation Task

89 Jan 04, 2023

Awesome AI Learning with +100 AI Cheat-Sheets, Free online Books, Top Courses, Best Videos and Lectures, Papers, Tutorials, +99 Researchers, Premium Websites, +121 Datasets, Conferences, Frameworks, Tools

All about AI with Cheat-Sheets(+100 Cheat-sheets), Free Online Books, Courses, Videos and Lectures, Papers, Tutorials, Researchers, Websites, Datasets

1.2k Jan 01, 2023

SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

SEOVER-Master This code is the implementation of paper： SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

4 Feb 24, 2022

ByteTrack超详细教程！训练自己的数据集&&摄像头实时检测跟踪

45 Dec 19, 2022

TipToiDog - Tip Toi Dog With Python

TipToiDog Was ist dieses Projekt? Meine 5-jährige Tochter spielt sehr gerne das

1 Feb 07, 2022

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Neural Circuit Policies Enabling Auditable Autonomy Online access via SharedIt Neural Circuit Policies (NCPs) are designed sparse recurrent neural net

8 Jan 07, 2023

img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation

img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation Figure 1: We estimate the 6DoF rigid transformation of a 3D face (rendered in si

519 Dec 29, 2022

Evaluation suite for large-scale language models.

This repo contains code for running the evaluations and reproducing the results from the Jurassic-1 Technical Paper (see blog post), with current support for running the tasks through both the AI21 S

71 Dec 17, 2022

N-Omniglot is a large neuromorphic few-shot learning dataset

N-Omniglot [Paper] || [Dataset] N-Omniglot is a large neuromorphic few-shot learning dataset. It reconstructs strokes of Omniglot as videos and uses D

11 Dec 05, 2022

ruptures: change point detection in Python

Welcome to ruptures ruptures is a Python library for off-line change point detection. This package provides methods for the analysis and segmentation

1.1k Jan 03, 2023

3D HourGlass Networks for Human Pose Estimation Through Videos

3D-HourGlass-Network 3D CNN Based Hourglass Network for Human Pose Estimation (3D Human Pose) from videos. This was my summer'18 research project. Dis

51 Jan 02, 2023

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This repository contains the code release for Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. This implementation is written in JAX, and is a fork of Google's JaxNeRF

625 Dec 30, 2022

A Framework for Encrypted Machine Learning in TensorFlow

TF Encrypted is a framework for encrypted machine learning in TensorFlow. It looks and feels like TensorFlow, taking advantage of the ease-of-use of t

0 Jul 06, 2022

load .txt to train YOLOX, same as Yolo others

YOLOX train your data you need generate data.txt like follow format (per line- one image). prepare one data.txt like this: img_path1 x1,y1,x2,y2,clas

18 Aug 18, 2022

Virtual Dance Reality Stage: a feature that offers you to share a stage with another user virtually

Portrait Segmentation using Tensorflow This script removes the background from an input image. You can read more about segmentation here Setup The scr

291 Dec 24, 2022

Winning solution of the Indoor Location & Navigation Kaggle competition

This repository contains the code to generate the winning solution of the Kaggle competition on indoor location and navigation organized by Microsoft

62 Dec 28, 2022

Spatial Contrastive Learning for Few-Shot Classification (SCL)

This repo contains the official implementation of Spatial Contrastive Learning for Few-Shot Classification (SCL), which presents of a novel contrastive learning method applied to few-shot image class

34 Dec 25, 2022

AI Based Smart Exam Proctoring Package

AI Based Smart Exam Proctoring Package It takes image (base64) as input: Provide Output as: Detection of Mobile phone. Detection of More than 1 person

3 Sep 09, 2022

The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

News December 27: v1.1.0 New loss functions: CentroidTripletLoss and VICRegLoss Mean reciprocal rank + per-class accuracies See the release notes Than

5k Jan 05, 2023

Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild