DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

Last update: Jan 04, 2023

Related tags

Overview

DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.

When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.

More information about DirectML can be found in Introduction to DirectML.

Getting Started with DirectML
DirectML Samples
Windows ML on DirectML
ONNX Runtime on DirectML
TensorFlow with DirectML
PyTorch with DirectML
Feedback
External Links
- Documentation
- More information
Contributing

Visit the DirectX Landing Page for more resources for DirectX developers.

Getting Started with DirectML

DirectML is distributed as a system component of Windows 10, and is available as part of the Windows 10 operating system (OS) in Windows 10, version 1903 (10.0; Build 18362), and newer.

Starting with DirectML version 1.4.0, DirectML is also available as a standalone redistributable package (see Microsoft.AI.DirectML), which is useful for applications that wish to use a fixed version of DirectML, or when running on older versions of Windows 10.

Hardware requirements

DirectML requires a DirectX 12 capable device. Almost all commercially-available graphics cards released in the last several years support DirectX 12. Examples of compatible hardware include:

AMD GCN 1st Gen (Radeon HD 7000 series) and above
Intel Haswell (4th-gen core) HD Integrated Graphics and above
NVIDIA Kepler (GTX 600 series) and above
Qualcomm Adreno 600 and above

For application developers

DirectML exposes a native C++ DirectX 12 API. The header and library (DirectML.h/DirectML.lib) are available as part of the redistributable NuGet package, and are also included in the Windows 10 SDK version 10.0.18362 or newer.

The Windows 10 SDK can be downloaded from the Windows Dev Center
Microsoft.AI.DirectML on the NuGet Gallery
DirectML programming guide
DirectML API reference

For users, data scientists, and researchers

DirectML is built-in as a backend to several frameworks such as Windows ML, ONNX Runtime, and TensorFlow.

See the following sections for more information:

Windows ML on DirectML
ONNX Runtime on DirectML
TensorFlow with DirectML (Preview)
PyTorch with DirectML (Preview)

DirectML Samples

DirectML C++ sample code is available under Samples.

HelloDirectML: A minimal "hello world" application that executes a single DirectML operator.
DirectMLSuperResolution: A sample that uses DirectML to execute a basic super-resolution model to upscale video from 540p to 1080p in real time.
yolov4: YOLOv4 is an object detection model capable of recognizing up to 80 different classes of objects in an image. This sample contains a complete end-to-end implementation of the model using DirectML, and is able to run in real time on a user-provided video stream.

DirectML Python sample code is available under Python/samples. The samples require PyDirectML, an open source Python projection library for DirectML, which can be built and installed to a Python executing environment from Python/src. Refer to the Python/README.md file for more details.

MobileNet: Adapted from the ONNX MobileNet model. MobileNet classifies an image into 1000 different classes. It is highly efficient in speed and size, ideal for mobile applications.
MNIST: Adapted from the ONNX MNIST model. MNIST predicts handwritten digits using a convolution neural network.
SqueezeNet: Based on the ONNX SqueezeNet model. SqueezeNet performs image classification trained on the ImageNet dataset. It is highly efficient and provides results with good accuracy.
FNS-Candy: Adapted from the Windows ML Style Transfer model sample, FNS-Candy re-applies specific artistic styles on regular images.
Super Resolution: Adapted from the ONNX Super Resolution model, Super-Res upscales and sharpens the input images to refine the details and improve image quality.

Windows ML on DirectML

Windows ML (WinML) is a high-performance, reliable API for deploying hardware-accelerated ML inferences on Windows devices. DirectML provides the GPU backend for Windows ML.

DirectML acceleration can be enabled in Windows ML using the LearningModelDevice with any one of the DirectX DeviceKinds.

For more information, see Get Started with Windows ML.

Windows Machine Learning Overview (docs.microsoft.com)
Windows Machine Learning GitHub
WinMLRunner, a tool for executing ONNX models using WinML with DirectML

ONNX Runtime on DirectML

ONNX Runtime is a cross-platform inferencing and training accelerator compatible with many popular ML/DNN frameworks, including PyTorch, TensorFlow/Keras, scikit-learn, and more.

DirectML is available as an optional execution provider for ONNX Runtime that provides hardware acceleration when running on Windows 10.

For more information about getting started, see Using the DirectML execution provider.

TensorFlow with DirectML

TensorFlow is a popular open source platform for machine learning and is a leading framework for training of machine learning models.

DirectML acceleration for TensorFlow 1.15 is currently available for Public Preview. TensorFlow on DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

TensorFlow on DirectML is supported on both the latest versions of Windows 10 and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started, see GPU accelerated ML training (docs.microsoft.com)

PyTorch with DirectML

DirectML acceleration for PyTorch 1.8.0 is currently available for Public Preview. PyTorch with DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

PyTorch on DirectML is supported on both the latest versions of Windows 10 and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started, see GPU accelerated ML training (docs.microsoft.com)

Feedback

We look forward to hearing from you!

For TensorFlow with DirectML issues, bugs, and feedback; or for general DirectML issues and feedback, please file an issue or contact us directly at [email protected].
For PyTorch with DirectML issues, bugs, and feedback; or for general DirectML issues and feedback, please file an issue or contact us directly at [email protected].
For Windows ML issues, please file a GitHub issue at microsoft/Windows-Machine-Learning or contact us directly at [email protected].
For ONNX Runtime issues, please file an issue at microsoft/onnxruntime.

External Links

Documentation

DirectML programming guide
DirectML API reference

More information

Introducing DirectML (Game Developers Conference '19)
Accelerating GPU Inferencing with DirectML and DirectX 12 (SIGGRAPH '18)
Windows AI: hardware-accelerated ML on Windows devices (Microsoft Build '20)
Gaming with Windows ML (DirectX Developer Blog)
DirectML at GDC 2019 (DirectX Developer Blog)
DirectX ❤ Linux (DirectX Developer Blog)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Comments

DirectML is x2.8 slower than CUDA

I tested training the same deepfake model on the same hardware using tensorflow-cuda and tensorflow-directml. (my project https://github.com/iperov/DeepFaceLab)

DirectML: avg iter time 626ms

CUDA: avg iter time 222ms

DirectML is x2.8 slower :-(

I think that's what I was talking about here https://github.com/microsoft/DirectML/issues/104

So what is the point of using DirectML if every millisecond of training acceleration is important in today's world?

x2.8 slower is serious performance degradation. I reached the same speed in my weekend OpenCL NN library in pure python (https://github.com/iperov/litenn)

But you are guys from microsoft company. Don't you think there is no point in further development of DirectML until you reach the level of CUDA performance?

opened by iperov 36

Could not load dynamic library 'libcuda.so.1'

Followed the instructions here

~ » cat /proc/version                                                                                                                                                             1 ↵ [email protected]
Linux version 4.4.0-20150-Microsoft ([email protected]) (gcc version 5.4.0 (GCC) ) #1000-Microsoft Thu Jun 12 17:34:00 PST 2020

I'm running build 20150, but am getting this error:

Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
>>>
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>>
>>> print(tf.add([1.0, 2.0], [3.0, 4.0]))
2020-06-17 16:36:05.469811: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-06-17 16:36:05.469926: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-06-17 16:36:05.470029: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (MAKERPC): /proc/driver/nvidia/version does not exist
2020-06-17 16:36:05.470532: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-17 16:36:05.483133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3400000000 Hz
2020-06-17 16:36:05.487879: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fffe52ac420 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-17 16:36:05.488038: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

opened by jflam 23

[installation] Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)

Hi,

After following the steps described in https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl till pip install tensorflow-directml,

the error appeared as

ERROR: Could not find a version that satisfies the requirement tensorflow-directml (from versions: none) ERROR: No matching distribution found for tensorflow-directml

BTW, I am using python 3.8

and I did python list tensorflow*, which outputed

Package Version

certifi 2020.6.20 pip 20.1.1 setuptools 49.2.0.post20200714 wheel 0.34.2

opened by shuwang1 19

How to get available devices and set a specific device in Pytorch-DML?

Hi, For accessing available devices in Pytorch we'd normally do :

    print(f'available devices: {torch.cuda.device_count()}')
    print(f'current device: { torch.cuda.current_device()}')

However, I noticed this fails (AssertionError: Torch not compiled with CUDA enabled).
I thought the transition would be minimal, and stuff like this would work out of the box! especially so, after noting we cant write:

    print(f'available devices: {torch.dml.device_count()}')
    print(f'current device: { torch.dml.current_device()}')

as it fails with the error :

AttributeError: module 'torch.dml' has no attribute 'device_count'

Apart from this, trying to specify a device using the form "dml:number" fails if number>1! that is this fails for "dml:1":

import torch 
import time
def bench(device ='cpu'):
    print(f'running on {device}:')
    a = torch.randn(size=(2000,2000)).to(device=device)
    b = torch.randn(size=(2000,2000)).to(device=device)
   
    start = time.time()
    c = a+b
    end = time.time()
    
    # print(f'available devices: {torch.dml.device_count()}')
    # print(f'current device: { torch.dml.current_device()}')
    print(f'--took {end-start:.2f} seconds')

bench('cpu')
bench('dml')
bench('dml:0')
bench('dml:1')

it outputs :

running on cpu:
--took 0.00 seconds
running on dml:
--took 0.01 seconds
running on dml:0:
--took 0.00 seconds
running on dml:1:

and thats it, it doesnt execute when it comes to "dml:1".

also trying to do :

import torch 
import time
def bench(device ='cpu'):
    print(f'running on {device}:')
    a = torch.randn(size=(2000,2000)).to(device=device)
    b = torch.randn_like(a).to(device=device)
    
    start = time.time()
    c = a+b
    end = time.time()
    
    # print(f'available devices: {torch.dml.device_count()}')
    # print(f'current device: { torch.dml.current_device()}')
    print(f'--took {end-start:.2f} seconds')

bench('cpu')
bench('dml')
bench('dml:0')
bench('dml:1')

Fails with the following error :

running on cpu:
--took 0.00 seconds
running on dml:
Traceback (most recent call last):
  File "g:\tests.py", line 1246, in <module>
    bench('dml')
  File "g:\tests.py", line 1235, in bench
    b = torch.randn_like(a).to(device=device)
RuntimeError: Could not run 'aten::normal_' with arguments from the 'UNKNOWN_TENSOR_TYPE_ID' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom 
build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::normal_' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at D:\a\_work\1\s\build\aten\src\ATen\RegisterCPU.cpp:5926 [kernel]
BackendSelect: fallthrough registered at D:\a\_work\1\s\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: fallthrough registered at D:\a\_work\1\s\aten\src\ATen\core\NamedRegistrations.cpp:11 [kernel]
AutogradOther: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradCPU: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradCUDA: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradXLA: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradNestedTensor: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradPrivateUse1: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradPrivateUse2: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradPrivateUse3: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
Tracer: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\TraceType_4.cpp:10612 [kernel]
Autocast: fallthrough registered at D:\a\_work\1\s\aten\src\ATen\autocast_mode.cpp:250 [backend fallback]
Batched: registered at D:\a\_work\1\s\aten\src\ATen\BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: registered at D:\a\_work\1\s\aten\src\ATen\VmapModeRegistrations.cpp:37 [kernel]

pytorch-directml

opened by Coderx7 11

Conv2D-Fail: internal compiler error, abnormal program termination

I ran across directML a few hours ago and am currently playing around with it on a Surface Pro 6 with an Intel HD Graphics 620. To set it all up, I followed this article to the letter: https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-windows

For testing purposes, I used a slightly modified version of my small go-to script:

import tensorflow.compat.v1 as tf 

tf.enable_eager_execution(tf.ConfigProto(log_device_placement=False)) 

fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()


class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

train_images = train_images.reshape(60000, 28, 28, 1)
train_images = train_images / 255.0

test_images = test_images.reshape(10000, 28, 28, 1)
test_images = test_images / 255.0

#model = tf.keras.Sequential([
#    tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
#    tf.keras.layers.Dense(128, activation=tf.nn.relu),
#    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
#])

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(64, (3,3), activation=tf.nn.relu, input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation=tf.nn.relu),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=5)

test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('Test accuracy:', test_acc)

The version of the model without convolutions runs absolutely fine. But as soon as I add the Conv2D layer, nothing works anymore.

The entire output I get is:

2021-04-23 21:23:05.241248: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library C:\Users\cyphus309\.conda\envs\directml\lib\site-packages\tensorflow_core\python/directml.b6e3bc69b89cfca5486e178bb9d51724d0c4a94a.dll
2021-04-23 21:23:05.298554: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:249] DirectML device enumeration: found 1 compatible adapters.
2021-04-23 21:23:05.299189: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-04-23 21:23:05.331743: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:185] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 620)
2021-04-23 21:23:05.363568: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library Kernel32.dll
Train on 60000 samples
Epoch 1/5

internal compiler error, abnormal program termination

Any ideas?

bug

opened by kampfhamster309 11

Tensorflow directml crashes my python session

Hi,

I've recently purchased a 6900 xt GPU which I would like to use with tensorflow. I followed the installation guide on https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-windows which worked but the issue I have now is that whenever I try to use tensorflow it closes my python environment.

I've attached an image to show what I mean. I can import tensorflow fine and it shows me that I have version 1.15.5 available. The problem is when I want to check if my GPU is available I get two messages and then it crashes me out of my python environment.

Does anybody know how to solve this issue and what is going on?

Thank you in advance!

bug

opened by bwintertkb 9
C++ DirectML.dll causes crash in debug x64 mode when using NuGet package Microsoft.AI.MachineLearning 1.5.2

Hello,

I'm experiencing a runtime crash with the C++ DirectML API in Debug x64 mode after upgrading my NuGet package Microsoft.AI.MachineLearning from version 1.4.0 to 1.5.2. There is no error in Release x64 mode.

The reason why I'm using this package is because the included DirectML.dll improves DirectML performance greatly. There seems to be an issue when creating a DirectMLOperator. The operator type is DML_OPERATOR_JOIN.

Can you please help me identify the issue? Also how can I find the latest DirectML.dll file without downloading the package?

opened by momower1 9
Performance will be improved by setting input strides=output strides for Clip in DirectMLX

I am investigating for the performance of MobileNet V2 from TFLite models with "nhwc" layout and MobileNet V2 from ONNX models with "nchw" layout on the implementation with DirectML and DirectMLX API.

I find that nhwc MobileNetV2 model has lots of Clip after Conv2d, the Clip will cost much time on inference. I guess that the Clip will do memory copy and hasn't be optimized in compilation stage.

I have a workaround to resolve this problem: set Clip's input strides same as its' output strides by changing this lineto TensorDesc outputTensor = inputTensor in DirectMLX.h, the Clip will be optimized just like fused into Conv2d, and then the inference time will be significantly reduced to be as same as nchw MobileNetV2.

When building nhwc MobileNetV2 model, we need append Identity after each Conv2d to transpose output tensor from default nchw to nhwc, then transpose this output tensor from nhwc to nchw as the next Conv2d's input tensor. In my opinion, I suppose that the Identity and Reinterpret can be optimized by DML in this model like: Conv0->Identity(nchw->nhwc)->Reinterpret strides(nhwc->nchw)->Conv1 just like transpose sinking in OpenVINO backend.

I guess that the Identity and Reinterpret sinking may be blocked when there is Clip like: Conv0->Identity(nchw->nhwc)->Clip->Reinterpret strides(nhwc->nchw)->Conv1 . I verified that if I remove Identity to run Conv0->Reinterpret strides(nchw->nhwc)->Clip(input strides = output strides)->Reinterpret strides(nhwc->nchw)->Conv1, the inference time will be much lower than before.

So in conclusion, I suggest setting Clip's input strides same as its' output strides by changing this line to TensorDesc outputTensor = inputTensor in DirectMLX.h.

opened by mingmingtasd 8
TensorFlow & DirectML & ROCm performance and roadmap

The current DirectML library for GPU is more 2x slower than the TensorFlow CPU library. When DirectML team will improve the performance of the library? Could you share a roadmap of DirectML? Will DirectML team cooperate with ROCm team (https://github.com/RadeonOpenCompute/ROCm), Intel and Nvidia for improving performance?

opened by YuriyTigiev 8

pytorch-directml simple command error

just trying simple command with pytorch-directml 1.8.0a0.dev220224 and getting error

>>> torch.tensor([1], dtype=torch.float32, device='dml')

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\tensor.py", line 193, in __repr__
    return torch._tensor_str._str(self)
  File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 383, in _str
    return _str_intern(self)
  File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 358, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 242, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 90, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: Could not run 'aten::masked_select' with arguments from the 'UNKNOWN_TENSOR_TYPE_ID' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::masked_select' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at D:\a\_work\1\s\pytorch-directml\build\aten\src\ATen\RegisterCPU.cpp:5926 [kernel]
BackendSelect: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\core\NamedRegistrations.cpp:11 [kernel]
AutogradOther: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradCPU: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradCUDA: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradXLA: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradNestedTensor: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradPrivateUse1: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradPrivateUse2: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
AutogradPrivateUse3: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
Tracer: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\TraceType_4.cpp:10612 [kernel]
Autocast: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\autocast_mode.cpp:250 [backend fallback]
Batched: registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]

cpu is fine

>>> torch.tensor([1], dtype=torch.float32, device='cpu')
tensor([1.])

pytorch-directml

opened by iperov 7

Is there any low power mode for DirectML

hi, now I have a quick enough model (120fps) and will run at 20fps, what i need is use as low as possible gpu power. but i find the gpu frequency jump to 1150mhz too many times. as compare to "https://voovmeeting.com/download-center.html?from=1001" tencent meeting , I found when I enable human segmentation , in a 8xxx laptop, the gpu frequency hold below 400mhz , but GPU load over 75%, that is strange for frequency policy.
so I guess , maybe directx12 or dx11 has some low power mode ? or some other ways, for ex. add some wait in each OP (for ex. convolution op）

opened by liyuming1978 7
pytorch-directml produce "[W dml_heap_allocator.cc:97] DML allocator out of memory!"

I was trying to run the simple code below:

import torch import torch_directml dml = torch_directml.device()

print(f"dml={dml}")

tensor1 = torch.tensor([1]) print(tensor1) tensor1=tensor1.to(dml)

when runing tensor1.to(dml), i got the following error: [W dml_heap_allocator.cc:97] DML allocator out of memory! Traceback (most recent call last): File "/home/fnz/workspace/direct-ml/main.py", line 9, in tensor1=tensor1.to(dml) RuntimeError: Unknown error -2147024882

It seems that my pytorch-directml doesn't work at all.

below is my package in conda: (direct_ml) [email protected]:~/workspace/direct-ml$ conda list | grep torch torch 1.13.1 pypi_0 pypi torch-directml 0.1.13.dev221216 pypi_0 pypi

BTW, my environment is wsl2 on top of windows 11 pro .

The tensorflow directml seems working well.

any idea ?

thanks

Feng

opened by virtual-feng 1
torch-directml : torch.div with trunc rounding on int64 fails with RuntimeError
Hi, Because 'aten::fmod.Tensor_out' is not implemented, I tried to implement it myself. I encountered a new error when using the rounding mode trunc with a int64 tensor.

Code:

import torch import torch_directml dml = torch_directml.device() a = torch.tensor([1,2,3]).to(dml) # b = 2 a = a - torch.div(a, b, rounding_mode="trunc") * b
opened by Theucalyptus 0
Very low validation and testing accuracy on CNN

Hello everyone. I am facing an issue. I am explaining what I am trying to do. I have a Traffic and Road sign dataset that contains 43 classes. I am trying to classify the images. I am using the resnet34 pre-trained model. I have AMD RX6600 GPU that I use for running the model. For running the model on my AMD GPU I am using Pytorch Directml. Until now everything has worked fine. Training speed is fast enough, and GPU utilization is near 100%. Training loss decreases per epoch. But when I check the model using validation data after one training phase, validation loss increases and validation accuracy is too low. But training is ok. When I run the same code on my friend’s PC who has NVIDIA GPU, all is ok. Validation loss decreases and it converges. And I got an accuracy of 98% when running the same code on NVIDIA GPU. I can not figure out what the problem is. I also tune the hyperparameter but had no luck. And one strange thing is that this problem arises when I use CNN based model. I had run NLP pre-trained model BERT on my AMD GPU and there is no Issue. Validation loss decreases and it converges. Can anyone help me with this issue? I am giving the code below. Thanks in advance.

opened by AtiqurRahmanAni 0
Spacy seems outdated + problems running attention...

Disclaimer: NOT a coder. Generally curious individual with just enough copy-paste and google skills. I may not know what I'm talking about.

Just playing around with the repo. The install failed because of spacy version in requirements.txt for me. Using python 3.10 on Ubuntu 22.10. Changing Spacy to 3.4.4 (which I had cached, so I just did pip install spacy - to see whichever worked)

It installed, but gave further warnings like ⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full pipeline package name 'en_core_web_sm' instead. Collecting en-core-web-sm==3.4.1... and

⚠ As of spaCy v3.0, shortcuts like 'de' are deprecated. Please use the full pipeline package name 'de_core_news_sm' instead. Collecting de-core-news-sm==3.4.0

opened by Vidyut 0
Operator 'aten::amax.out' is not currently supported on the DML backend.

C:\ProgramData\Anaconda3\envs\torchdml\lib\site-packages\torch\optim\adamax.py:231: UserWarning: The operator 'aten::amax.out' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:16.) torch.amax(norm_buf, 0, keepdim=False, out=exp_inf)

opened by rmskmr05 0

Releases(tensorflow-directml-1.15.3.dev200626)

tensorflow-directml-1.15.3.dev200626(Jun 30, 2020)

Preview build of tensorflow-directml built on June 26, 2020.

The Python packages are available as a PyPI release. To download the appropriate python package automatically, simply pip install tensorflow-directml.

Changes in dev200626:

• Certain OOM errors will now gracefully fail instead of crashing the python interpreter. • Allow usage of entire graphics memory when grappler optimizations are enabled. • GRUBlockCell operator implemented.
Source code(tar.gz)
Source code(zip)
libtensorflow-linux-x86_64.tar.gz(43.71 MB)
libtensorflow-windows-x86_64.tar.gz(24.96 MB)
tensorflow_directml-1.15.3.dev200626-cp35-cp35m-manylinux1_x86_64.whl(92.97 MB)
tensorflow_directml-1.15.3.dev200626-cp35-cp35m-win_amd64.whl(60.87 MB)
tensorflow_directml-1.15.3.dev200626-cp36-cp36m-manylinux1_x86_64.whl(92.97 MB)
tensorflow_directml-1.15.3.dev200626-cp36-cp36m-win_amd64.whl(60.88 MB)
tensorflow_directml-1.15.3.dev200626-cp37-cp37m-manylinux1_x86_64.whl(92.97 MB)
tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64.whl(60.88 MB)
tensorflow-directml-1.15.3.dev200615(Jun 17, 2020)

First preview of TensorFlow 1.15.3 with DirectML support in both Windows and WSL.

Please visit tensorflow-directml on PyPI.org for the latest release!
Source code(tar.gz)
Source code(zip)
tensorflow_directml-1.15.3.dev200615-cp35-cp35m-linux_x86_64.whl(92.94 MB)
tensorflow_directml-1.15.3.dev200615-cp35-cp35m-win_amd64.whl(60.86 MB)
tensorflow_directml-1.15.3.dev200615-cp36-cp36m-linux_x86_64.whl(92.95 MB)
tensorflow_directml-1.15.3.dev200615-cp36-cp36m-win_amd64.whl(60.86 MB)
tensorflow_directml-1.15.3.dev200615-cp37-cp37m-linux_x86_64.whl(92.95 MB)
tensorflow_directml-1.15.3.dev200615-cp37-cp37m-win_amd64.whl(60.86 MB)

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub Repository

Neural Machine Translation (NMT) tutorial with OpenNMT-py

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

29 Jan 09, 2023

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Hivemind: decentralized deep learning in PyTorch Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage

1.3k Jan 08, 2023

Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

Payment-Date-Prediction Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

15 Sep 09, 2022

Apple-voice-recognition - Machine Learning

Apple-voice-recognition Machine Learning How does Siri work? Siri is based on large-scale Machine Learning systems that employ many aspects of data sc

1 Oct 22, 2021

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

31 Dec 16, 2022

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

Related tags

Overview

DirectML

Getting Started with DirectML

Hardware requirements

For application developers

For users, data scientists, and researchers

DirectML Samples

Windows ML on DirectML

ONNX Runtime on DirectML

TensorFlow with DirectML

PyTorch with DirectML

Feedback

External Links

Documentation

More information

Contributing

Comments

Code:

Releases(tensorflow-directml-1.15.3.dev200626)

tensorflow-directml-1.15.3.dev200626(Jun 30, 2020)

Changes in dev200626:

tensorflow-directml-1.15.3.dev200615(Jun 17, 2020)

Owner

Microsoft

Neural Machine Translation (NMT) tutorial with OpenNMT-py

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

Apple-voice-recognition - Machine Learning

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

The Ultimate FREE Machine Learning Study Plan

This is an auto-ML tool specialized in detecting of outliers

Time series forecasting with PyTorch

GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

Python package for concise, transparent, and accurate predictive modeling

distfit - Probability density fitting

PyHarmonize: Adding harmony lines to recorded melodies in Python

MiniTorch - a diy teaching library for machine learning engineers

Python ML pipeline that showcases mltrace functionality.

Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Timeseries analysis for neuroscience data

Microsoft Machine Learning for Apache Spark

MLR - Machine Learning Research

ThunderGBM: Fast GBDTs and Random Forests on GPUs

2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.