Hi,
I am trying to pretrain the code on Celeb-HQ dataset and I sucessfully created respective grayscale depth maps(PNG) and grayscale segmentation(PNG) for pretraining.
However, when i try to train "OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_pretraining_multimae.py --config cfgs/pretrain/multimae-b_98_rgb+-depth-semseg_1600e.yaml --data_path /home/gargatik/gargatik/Datasets/copy/multimae/train"
I am facing the issue:
Start_______________________
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [92,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [92,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize
failed.
Traceback (most recent call last):
File "run_pretraining_multimae.py", line 585, in
main(opts)
File "run_pretraining_multimae.py", line 414, in main
train_stats = train_one_epoch(
File "run_pretraining_multimae.py", line 501, in train_one_epoch
preds, masks = model(
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 312, in forward
input_task_tokens = {
File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 313, in
domain: self.input_adaptersdomain
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/input_adapters.py", line 232, in forward
x_patch = rearrange(self.proj(x), 'b d nh nw -> b (nh nw) d')
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f1c685031ee in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0xf3c2d (0x7f1caad91c2d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: + 0xf6f6e (0x7f1caad94f6e in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: + 0x463418 (0x7f1cba0f6418 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f1c684ea7a5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x35f2f5 (0x7f1cb9ff22f5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x679288 (0x7f1cba30c288 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f1cba30c655 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ccad3]
frame #9: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5d270c]
frame #10: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ec780]
frame #11: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5441f8]
frame #12: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a]
frame #13: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a]
frame #14: PyDict_SetItemString + 0x536 (0x5d1686 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #15: PyImport_Cleanup + 0x79 (0x684619 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #16: Py_FinalizeEx + 0x7f (0x67f8af in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #17: Py_RunMain + 0x32d (0x6b70fd in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #18: Py_BytesMain + 0x2d (0x6b736d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #19: __libc_start_main + 0xf3 (0x7f1cd8fc10b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x2e (0x5fa5ce in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
Traceback (most recent call last):
File "run_pretraining_multimae.py", line 585, in
main(opts)
File "run_pretraining_multimae.py", line 414, in main
train_stats = train_one_epoch(
File "run_pretraining_multimae.py", line 501, in train_one_epoch
preds, masks = model(
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 312, in forward
input_task_tokens = {
File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/multimae.py", line 313, in
domain: self.input_adaptersdomain
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/inpaint_proj/MultiMAE/multimae/input_adapters.py", line 232, in forward
x_patch = rearrange(self.proj(x), 'b d nh nw -> b (nh nw) d')
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([256, 64, 56, 56], dtype=torch.half, device='cuda', requires_grad=True).to(memory_format=torch.channels_last)
net = torch.nn.Conv2d(64, 768, kernel_size=[4, 4], padding=[0, 0], stride=[4, 4], dilation=[1, 1], groups=1)
net = net.cuda().half().to(memory_format=torch.channels_last)
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [4, 4, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0xc853ff10
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 256, 64, 56, 56,
strideA = 200704, 1, 3584, 64,
output: TensorDescriptor 0xc8540270
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 256, 768, 14, 14,
strideA = 150528, 1, 10752, 768,
weight: FilterDescriptor 0x819c34f0
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NHWC
nbDims = 4
dimA = 768, 64, 4, 4,
Pointer addresses:
input: 0x7f12aa000000
output: 0x7f12ca000000
weight: 0x7f13d9200c00
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f147c2811ee in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0xf3c2d (0x7f14beb0fc2d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: + 0xf6f6e (0x7f14beb12f6e in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: + 0x463418 (0x7f14cde74418 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f147c2687a5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x35f2f5 (0x7f14cdd702f5 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x679288 (0x7f14ce08a288 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f14ce08a655 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ccad3]
frame #9: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5d270c]
frame #10: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5ec780]
frame #11: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x5441f8]
frame #12: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a]
frame #13: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python() [0x54424a]
frame #14: PyDict_SetItemString + 0x536 (0x5d1686 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #15: PyImport_Cleanup + 0x79 (0x684619 in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #16: Py_FinalizeEx + 0x7f (0x67f8af in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #17: Py_RunMain + 0x32d (0x6b70fd in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #18: Py_BytesMain + 0x2d (0x6b736d in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
frame #19: __libc_start_main + 0xf3 (0x7f14ecd3f0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x2e (0x5fa5ce in /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182198 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182199 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182200 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182202 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182203 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182204 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182205 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 182201) of binary: /mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/python
Traceback (most recent call last):
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/bin/torchrun", line 8, in
sys.exit(main())
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/train-data-3-ssd/gargatik/virtual_env/multimae/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_pretraining_multimae.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-07-09_16:11:15
host : Norwalk
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 182201)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 182201
END______________
Thanks for the help