A NumPy-compatible array library accelerated by CUDA

Overview

CuPy : A NumPy-compatible array library accelerated by CUDA

pypi GitHub license coveralls Gitter Twitter

Website | Docs | Install Guide | Tutorial | Examples | API Reference | Forum

CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it.

Installation

Wheels (precompiled binary packages) are available for Linux (x86_64) and Windows (amd64). Choose the right package for your platform.

Platform Command
CUDA 9.0 pip install cupy-cuda90
CUDA 9.2 pip install cupy-cuda92
CUDA 10.0 pip install cupy-cuda100
CUDA 10.1 pip install cupy-cuda101
CUDA 10.2 pip install cupy-cuda102
CUDA 11.0 pip install cupy-cuda110
CUDA 11.1 pip install cupy-cuda111
CUDA 11.2 pip install cupy-cuda112
ROCm 4.0 pip install cupy-rocm-4-0 (experimental; see docs for details)

See the Installation Guide if you are using Conda/Anaconda or to build from source.

Run on Docker

Use NVIDIA Container Toolkit to run CuPy image with GPU.

$ docker run --gpus all -it cupy/cupy

More information

License

MIT License (see LICENSE file).

CuPy is designed based on NumPy's API and SciPy's API (see docs/LICENSE_THIRD_PARTY file).

CuPy is being maintained and developed by Preferred Networks Inc. and community contributors.

Reference

Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido and Crissman Loomis. CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations. Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS), (2017). URL

@inproceedings{cupy_learningsys2017,
  author       = "Okuta, Ryosuke and Unno, Yuya and Nishino, Daisuke and Hido, Shohei and Loomis, Crissman",
  title        = "CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations",
  booktitle    = "Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS)",
  year         = "2017",
  url          = "http://learningsys.org/nips17/assets/papers/paper_16.pdf"
}
Comments
  • Build the `cupy.cuda.cub` module by default

    Build the `cupy.cuda.cub` module by default

    Close #3078. Close #3075. Close #3108. Close #3507.

    This PR includes all the CUB v1.8.0 headers (see notes below) so that the cupy.cuda.cub module can be built by default. This also avoids the need of documenting how to build it.

    Note that the CUB_PATH env variable is no longer needed.

    TODO:

    • [ ] ~~Write a unit test for cupy.cuda.cub~~ (related: #2579) UPDATE: see #2598
    • [x] Discuss with CuPy core devs whether we should set cupy.cuda.cub_enabled = False during import for backward compatibility, and whether the CUB_DISABLED env variable is still needed.

    Note:

    1. The CUB project is open source under the BSD license, so redistributing its source code is allowed if the license file is included.
    2. Since CUB is a header-only library, many of its files are actually needed. Below is the list of CUB headers included when compiling cupy/cuda/cupy_cub.cu (obtained via cpp -MM -I(...) cupy/cuda/cupy_cub.cu):
    cub/agent/agent_reduce.cuh
    cub/agent/agent_reduce_by_key.cuh
    cub/agent/agent_scan.cuh
    cub/agent/single_pass_scan_operators.cuh
    cub/block/block_discontinuity.cuh
    cub/block/block_exchange.cuh
    cub/block/block_load.cuh
    cub/block/block_raking_layout.cuh
    cub/block/block_reduce.cuh
    cub/block/block_scan.cuh
    cub/block/block_store.cuh
    cub/block/specializations/block_reduce_raking.cuh
    cub/block/specializations/block_reduce_raking_commutative_only.cuh
    cub/block/specializations/block_reduce_warp_reductions.cuh
    cub/block/specializations/block_scan_raking.cuh
    cub/block/specializations/block_scan_warp_scans.cuh
    cub/device/device_reduce.cuh
    cub/device/device_segmented_reduce.cuh
    cub/device/dispatch/dispatch_reduce.cuh
    cub/device/dispatch/dispatch_reduce_by_key.cuh
    cub/device/dispatch/dispatch_scan.cuh
    cub/grid/grid_even_share.cuh
    cub/grid/grid_mapping.cuh
    cub/grid/grid_queue.cuh
    cub/iterator/arg_index_input_iterator.cuh
    cub/iterator/cache_modified_input_iterator.cuh
    cub/iterator/constant_input_iterator.cuh
    cub/thread/thread_load.cuh
    cub/thread/thread_operators.cuh
    cub/thread/thread_reduce.cuh
    cub/thread/thread_scan.cuh
    cub/thread/thread_store.cuh
    cub/util_arch.cuh
    cub/util_debug.cuh
    cub/util_device.cuh
    cub/util_macro.cuh
    cub/util_namespace.cuh
    cub/util_ptx.cuh
    cub/util_type.cuh
    cub/warp/specializations/warp_reduce_shfl.cuh
    cub/warp/specializations/warp_reduce_smem.cuh
    cub/warp/specializations/warp_scan_shfl.cuh
    cub/warp/specializations/warp_scan_smem.cuh
    cub/warp/warp_reduce.cuh
    cub/warp/warp_scan.cuh
    

    Since so many of them are needed, we might as well copy the entire cub folder for future extensibility and easier maintainability. 3. CUB v1.8.0 is quite stable (no new release since Feb 2018), suitable to be a dependency.

    cat:enhancement no-compat 
    opened by leofang 96
  • Support cuFFT callbacks

    Support cuFFT callbacks

    Close #4105.

    UPDATE: This is a very unusual PR due to the static linking requirement, please read along.

    Design considerations

    1. cuFFT static (libcufft_static.a) and shared (libcufft.so) libraries cannot mix and match: For example, one cannot generate a plan handle using cufftCreate from the shared library, and call cufftXtSetCallback from the static library on it. This leads to the distinction between a "static" plan and a "shared" plan, based on the libraries they are associated with.
    2. We want to be able to do things in the Python space, so we would like to reuse the cupy.cuda.cufft module as much as possible.
    3. The load/store callbacks have to be visible at the module build time; it's not possible to retrieve a pointer to a device function at runtime and "link" it against libcufft_static.a. This also means in the distributed wheel or Conda package we cannot link against the static library, as it makes zero sense (callbacks can only be supplied at runtime) but only inflates the file size.
    4. If we don't do things correctly, there'd be a bunch of undefined symbols leaking into the Python module, causing import cupy to fail; see the discussion in #4105 for detail.

    Approach

    For every distinct pair of load and store callbacks (either of them could be None but not both), we generate a stub containing the callback implementations, copy cupy/cuda/cufft.pyx and friends to a temporary directory, and compile at runtime a new Python module cupy_callback_<hash>.cpython-XXm-x86_64-linux-gnu.so in which all things are statically linked together. This is basically a static version of cupy.cuda.cufft, from which we can generate static plans, set the callbacks, and execute them. The generated modules are cached on disk (default: ~/.cupy/callback_cache) and can be reused for all kinds of transforms as long as they use the same pair of callbacks. To avoid collisions, the compile time options along with the callbacks are used to generate a distinct <hash> string (XX stands for the Python version).

    This approach is backward compatible in that the cupy.cuda.cufft module is still linked to cuFFT dynamically and continues to function as usual. Also, the static plans can be cached in the new cuFFT plan cache just like the shared plans. The only two drawbacks of this approach are:

    1. The generated Python module can be fat (on my system it's 159M each...)
    2. Runtime compilation is slow for each first use.

    See the docstring in cupy.fft.config.set_cufft_callbacks for more detail.

    Example

    This works now:

    import cupy as cp
    
    
    code = r'''
    __device__ cufftComplex CB_ConvertInputC(
        void *dataIn, 
        size_t offset, 
        void *callerInfo, 
        void *sharedPtr) 
    {
        cufftComplex x;
        x.x = 1.;
        x.y = 0.;
        return x;
    }
    __device__ cufftCallbackLoadC d_loadCallbackPtr = CB_ConvertInputC;
    '''
    
    a = cp.random.random((64, 128, 128)).astype(cp.complex64)
    
    # this fftn call uses callback
    with cp.fft.config.set_cufft_callbacks(cb_load=code):
        b = cp.fft.fftn(a, axes=(1,2))
    
    # this does not use
    c = cp.fft.fftn(cp.ones(shape=a.shape, dtype=cp.complex64), axes=(1,2))
    
    # result agrees
    assert cp.allclose(b, c)
    
    # static plans are also cached
    cp.fft.config.show_plan_cache_info()
    
    cat:feature st:test-and-merge prio:medium 
    opened by leofang 87
  • Add a cuFFT plan cache

    Add a cuFFT plan cache

    UPDATE: Close #3588.

    This PR implements a least recently used (LRU) cache for cuFFT plans. The implementation is done in Cython to minimize the Python overhead; yet, I still use cdef classes (instead of pointers to structs) to avoid managing memory myself, and cdef'ing as much as I can.

    Properties of this cache:

    • Per-thread, per-device
    • The "size" of the cache can be set by both the number of plans and the amount of (GPU) memory used by the plans (i.e., the work areas)
    • Enabled by default (with size = 16, which I picked ungroundedly)
    • Good performance:
      • Greatly reduced the CPU overhead of plan allocation, especially with non-prime lengths (#3556): https://github.com/cupy/cupy/pull/3730#issuecomment-670061799
      • Fast access time (< 1 us get/set): https://github.com/cupy/cupy/pull/3730#issuecomment-671725253
      • Bonus: a bit of GPU time reduction (as certain plan allocations would launch a few kernels): https://github.com/cupy/cupy/pull/3730#issuecomment-670061799
    • Adding and removing a multi-GPU plan is done collectively among all participating GPUs' caches to respect the memory limitation.
    • A few handles are exposed to cupy.fft.config:
      • Documented: get_plan_cache(), show_plan_cache_info()
      • Undocumented: the five APIs from scipy/scipy#12512
      • I do not expect they need to be used, though
      • Question: How can I generate a doc page for PlanCache without explicitly referencing it in the autosummary?

    What is NOT done in this PR (see the discussions in the replies below):

    • Lock stream on to work area
    • Manage a work area pool by the cache

    I think it's out of scope, requires careful planning, and the performance gain, if any, is unknown.


    ~~Work in progress. Description to follow. All tests passed locally.~~ ~~Aim to address #3588 and follow scipy/scipy#12512.~~

    TODO:

    • [ ] Decide to what extent we would like to expose the cache to end users (the public API from scipy/scipy#12512 is a different thing)
    • [x] Per-device cache
    • [x] Merge add_multi_gpu_plan() and __setitem__()
    • [x] Add PR description
    • [x] Add tests
    • [x] Add comments to code
    • [x] Mark the public API from scipy/scipy#12512 experimental
    • [ ] ~~Add get_fft_plan outcomes to the cache?~~
    cat:enhancement to-be-backported 
    opened by leofang 78
  • Small fixes for CUB block reduction kernels

    Small fixes for CUB block reduction kernels

    1. Remove all of the type constraints: I remember I set the limitations due to some errors when running the full test suite, but I could no longer reproduce it (with the latest master). @grlee77's new norm kernels might also need the support for complex numbers.
    2. Add a possible exception to the optimizer: during the optuna optimization CUDADriverError could be raised due to out of resource. This was first observed in https://github.com/cupy/cupy/pull/3244#issuecomment-641479088, and I thought by constraining the search range it'd be remedied, but today I encountered it a few more times for different tasks, so apparently this is necessary. After adding this, I see that the error is gracefully handled:
    [I 2020-06-30 22:29:07,612] Finished trial#1 with value: inf with parameters: {'block_size_log': 9, 'items_per_thread': 28}. Best is trial#0 with value: 0.0029116286219972552.
    
    1. Allow compiler exceptions to propagate upward.
    2. (UPDATE) Make complex<T> (almost) obey the rule of three (to fix fp16 -> complex conversion): This is basically a follow-up of #2629 and #2741. It turns out that by ensuring the rule of three (except for the destructor, which is trivial), we get the float16 -> complex<T> conversion for free (through C++ implicit conversion fp16->fp32->complex) without additional change. I should have done this when working on #2741...😢 Note the changes are in line with the Thrust implementation.
    3. (UPDATE) Fix tests
    cat:enhancement 
    opened by leofang 77
  • Adds nvcc as a RawKernel backend

    Adds nvcc as a RawKernel backend

    Adds nvcc as a backend for RawKernel (issue https://github.com/cupy/cupy/issues/1928). The nvcc.py is a cut and paste of much of the functionality in cupy_setup_build.py and install/*.py -- I've avoided refactoring this at the moment as I'd like to invite input as to whether you think this PR is useful.

    cat:feature st:test-and-merge 
    opened by sjperkins 70
  • Implementation of ndimage filters

    Implementation of ndimage filters

    So far this PR includes improved correlate and convolve functions (in terms of speed, uses similar technique to #3179) along with implementations of correlate1d and convolve1d along with tests for them. The underlying kernel creation has been generalized so that it can be used to implement all of the other filters and those will be progressively added to this PR (I have an implementation of them but haven't tested them yet).

    This works to address #2099 and #3111.

    Current status of tests is that it is passing 7120 tests (in test_filters.py) and failing 8 (they are all with the 1d functions along axis 0 using mode=mirror with 4D images but no other axis or mode or less-dimension image).

    cat:feature st:test-and-merge 
    opened by coderforlife 64
  • Add compressed sparse `__setitem__`

    Add compressed sparse `__setitem__`

    Closes #3115 Closes #2676 Closes #2677

    This PR builds upon #3486, also porting over over from Scipy the functions necessary to set values in both major and minor axes, using integers, slices, and arrays.

    cat:feature to-be-backported blocking 
    opened by cjnolet 60
  • CUDA 11 Test: `TestFftAllocate`

    CUDA 11 Test: `TestFftAllocate`

    I built the latest master and fixed #3757 with #3775, and the only error I got from all FFT tests we have is this:

    $ pytest tests/cupy_tests/fft_tests/test_fft.py
    ========================================================================= test session starts =========================================================================
    platform linux -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
    rootdir: /home/leofang/cupy_cuda11, configfile: setup.cfg
    collected 717 items                                                                                                                                                   
    
    tests/cupy_tests/fft_tests/test_fft.py ........................................................................................................................ [ 16%]
    ............................................................................................................................................................... [ 38%]
    ............................................................................................................................................................... [ 61%]
    ............................................................................................................................................................... [ 83%]
    .....................................................................................................................F..                                        [100%]
    
    ============================================================================== FAILURES ===============================================================================
    __________________________________________________________________ TestFftAllocate.test_fft_allocate __________________________________________________________________
    
    self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>
    
        def test_fft_allocate(self):
            # Check CuFFTError is not raised when the GPU memory is enough.
            # See https://github.com/cupy/cupy/issues/1063
            # TODO(mizuno): Simplify "a" after memory compaction is implemented.
            a = []
            for i in range(10):
                a.append(cupy.empty(100000000))
            del a
            b = cupy.empty(100000007, dtype=cupy.float32)
    >       cupy.fft.fft(b)
    
    tests/cupy_tests/fft_tests/test_fft.py:336: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    cupy/fft/fft.py:567: in fft
        return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
    cupy/fft/fft.py:182: in _fft
        a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
    cupy/fft/fft.py:152: in _fft_c2c
        a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
    cupy/fft/fft.py:109: in _exec_fft
        plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
    cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
        self._single_gpu_get_plan(plan, nx, fft_type, batch)
    cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
        check_result(result)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
    >   raise CuFFTError(result)
    E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
    
    cupy/cuda/cufft.pyx:147: CuFFTError
    ======================================================================= short test summary info =======================================================================
    FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
    =================================================================== 1 failed, 716 passed in 13.16s ====================================================================
    
    issue-checked 
    opened by leofang 54
  • Add initial cupyx.spatial.distance support from pylibraft

    Add initial cupyx.spatial.distance support from pylibraft

    This is an initial PR to establish integration of pylibraft within cupy. For now, the intention is to use RAFT for supported distances and types and cupy kernels for other distances and types until we support them in RAFT. This provides us a path to make use of what's available and then continue optimizing RAFT underneath, allowing cupy to reap the benefits immediately.

    This is really just a start in the direction of backing the entire cupyx.scipy.spatial package with RAFT, as well as the cupyx.scipy.cluster and cupyx.scipy.optimize packages (and potentially cupyx.scipy.stats / cupyx.scipy.qmc in the future).

    Tasks:

    • [x] Add cdist
    • [x] Add scipy.spatial.distance_matrix, minkowski_distance
    • [x] Add tests
    cat:feature prio:medium 
    opened by cjnolet 48
  • Provide full coverage for NCCL APIs in CuPy

    Provide full coverage for NCCL APIs in CuPy

    The main purpose of PR is to add a factory function to NcclCommunicator so that it is possible to create a group of NCCL communicators for multiple devices in a single process:

    from cupy.cuda import nccl
    
    # Use GPU #0, #2, and #3
    # comms is a list of NcclCommunicator
    comms = nccl.NcclCommunicator.initAll([0, 2, 3])
    

    Since Python/Cython does not support overloading multiple constructors, in order to preserve backward compatibility creating a factory function seems to be a necessary design decision to me. Please let me know if you have any better alternatives, thanks!

    In addition, this PR also wraps several other NCCL APIs (mainly to serve the above need). With this PR, now CuPy supports the full NCCL APIs! See below for a complete and finalized list for the changes:

    === FINAL UPDATE ===

    • added initAll(), groupStart() and groupEnd() to support controlling multiple devices in a single process.
    • added size() that returns the total number of NCCL ranks.
    • added tests for the above functions
    • added documentation for NCCL APIs.
    • backward compatibility is still preserved

    For a working demo enabled by this PR, see https://github.com/cupy/cupy/pull/2325#issuecomment-515299903.

    cat:feature st:test-and-merge 
    opened by leofang 44
  • CUDA 11.2: Support the built-in Stream Ordered Memory Allocator

    CUDA 11.2: Support the built-in Stream Ordered Memory Allocator

    While this is working, I mark it as Work in Progress as there are some issues to be discussed with our NVIDIA friends 🙂

    May be blocked by #4443 (?)

    This PR exposes CUDA's new Stream Ordered Memory Allocator added since 11.2 to CuPy. A new memory type, MemoryAsync, is added, which is backed by cudaMallocAsync() and cudaFreeAsync().

    To use this feature, one simply sets the allocator to malloc_async, similar to what's done for managed memory:

    import cupy as cp
    
    cp.cuda.set_allocator(cp.cuda.malloc_async)
    # from now on the memory is allocated on the current stream from Stream Ordered Memory Allocator
    

    On older CUDA (<11.2) or unsupported devices, using this new allocator will raise an error at runtime.

    (I didn't add the support with a customized mempool cudaMemPool*()/cudaMallocFromPoolAsync() -- which could be the next PR -- as it's unclear to me the benefit of using non-default mempools. Also, note that there is no API to expose any current information of the mempool, so it wouldn't be compatible with CuPy's MemoryPool API, such as used_bytes() etc.)

    Currently observed issues

    I think nothing is wrong with my implementation, most likely these are from CUDA 😁

    1. UPDATE: This is irrelevant of this PR, see https://github.com/cupy/cupy/pull/4537#issuecomment-757429743 and #4538.
    2. It is unclear from the CUDA documentation if a stream is allowed to be destroyed before all memory allocated on it is freed. It could be that the driver performs a ref count internally (so we don't have to), but we need to make sure. If it's not the case, then in MemoryAsync we will also need to hold the reference to the stream (object), not just its pointer.
    3. nvprof python my_script.py will fail if malloc_async is used in the workload:
    $ nvprof --device-buffer-size 2048 --profiling-semaphore-pool-size 128000 pytest tests/cupy_tests/fft_tests/test_fft.py -k TestFFt
    ========================================================================= test session starts =========================================================================
    platform linux -- Python 3.7.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
    rootdir: /home/leofang/dev/cupy_cuda112, configfile: setup.cfg
    collecting ... ==31333== NVPROF is profiling process 31333, command: /home/leofang/miniconda3/envs/cupy_cuda112_dev/bin/python /home/leofang/miniconda3/envs/cupy_cuda112_dev/bin/pytest tests/cupy_tests/fft_tests/test_fft.py -k TestFFt
    collected 717 items / 410 deselected / 307 selected                                                                                                                   
    
    tests/cupy_tests/fft_tests/test_fft.py ..................................................==31333== Error: Internal profiling error 3938:999.
    .....................................^C======== Warning: 569 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
    ======== Warning: 293 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size.
    ======== Profiling result:
    ...
    

    We need to confirm if this is nvprof's problem/limitation (very likely it is), as it could be annoying to our users.

    TODO

    • [x] Add tests
    • [x] Add tutorial to docs/source/reference/memory.rst?
    • [x] Fix/update docstrings
    • [x] Mark it experimental (as I did)?

    cc: @jakirkham @pentschev @maxpkatz Could you help address the three observed issues? 🙂

    cat:feature st:test-and-merge prio:medium 
    opened by leofang 42
  • Support passing int as shape to `broadcast_to`

    Support passing int as shape to `broadcast_to`

    Reported internally. broadcast_to should accept int as shape.

    https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html

    shape : tuple or int The shape of the desired array. A single integer i is interpreted as (i,).

    opened by kmaehashi 0
  • cupy raw kernel cannot handle view of cupy ndarray

    cupy raw kernel cannot handle view of cupy ndarray

    Description

    When feeding a view of cupy ndarray into a kernel, for example, a slice of a big ndarray, the result looks like the kernel read the original big ndarray not a slice of it.

    To Reproduce

    import cupy as cp
    x = cp.arange(10, dtype=cp.complex64).reshape(2,5)
    
    show = cp.RawKernel(r'''
    #include <cuComplex.h>
    
    extern "C" __global__
    void show(const cuFloatComplex* x, const int N){
      int i = blockDim.x * blockIdx.x + threadIdx.x;
      
      if(i == 0 ){
          printf("%f\n",cuCrealf(x[N]));
      }
    }
    ''', 'show')
    

    When call the kernel:

    show((2,), (5,), (x,cp.int32(6)))
    cp.cuda.runtime.deviceSynchronize()
    

    It will print:

    6.000000
    

    But if feed a slice of x:

    x_slice = x[:,:4]
    show((2,), (5,), (x_slice,cp.int32(6)))
    cp.cuda.runtime.deviceSynchronize()
    

    It also print:

    6.000000
    

    which is not as wanted.

    However, if a copy is fed:

    x_slice = x[:,:4].copy()
    show((2,), (5,), (x_slice,cp.int32(6)))
    cp.cuda.runtime.deviceSynchronize()
    

    It print:

    7.000000
    

    as expected.

    Installation

    Conda-Forge (conda install ...)

    Environment

    OS                           : Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-glibc2.17
    Python Version               : 3.9.13
    CuPy Version                 : 11.2.0
    CuPy Platform                : NVIDIA CUDA
    NumPy Version                : 1.23.4
    SciPy Version                : 1.9.3
    Cython Build Version         : 0.29.32
    Cython Runtime Version       : None
    CUDA Root                    : /users/kangl/miniconda3/envs/rapids-22.10
    nvcc PATH                    : /users/kangl/miniconda3/envs/rapids-22.10/bin/nvcc
    CUDA Build Version           : 11020
    CUDA Driver Version          : 11040
    CUDA Runtime Version         : 11070
    cuBLAS Version               : (available)
    cuFFT Version                : 10702
    cuRAND Version               : 10301
    cuSOLVER Version             : (11, 4, 0)
    cuSPARSE Version             : (available)
    NVRTC Version                : (11, 7)
    Thrust Version               : 101000
    CUB Build Version            : 101000
    Jitify Build Version         : 343be31
    cuDNN Build Version          : None
    cuDNN Version                : None
    NCCL Build Version           : 21403
    NCCL Runtime Version         : 21403
    cuTENSOR Version             : None
    cuSPARSELt Build Version     : None
    Device 0 Name                : Tesla V100-SXM2-32GB
    Device 0 Compute Capability  : 70
    Device 0 PCI Bus ID          : 0000:15:00.0
    

    Additional Information

    No response

    cat:bug 
    opened by kanglcn 5
Releases(v12.0.0b2)
Owner
CuPy
A NumPy-compatible array library accelerated by CUDA
CuPy
Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

Lev E. Givon 924 Dec 26, 2022
QPT-Quick packaging tool 前项式Python环境快捷封装工具

QPT - Quick packaging tool 快捷封装工具 GitHub主页 | Gitee主页 QPT是一款可以“模拟”开发环境的多功能封装工具,一行命令即可将普通的Python脚本打包成EXE可执行程序,与此同时还可轻松引入CUDA等深度学习加速库, 尽可能在用户使用时复现您的开发环境。

GT-Zhang 545 Dec 28, 2022
📊 A simple command-line utility for querying and monitoring GPU status

gpustat Just less than nvidia-smi? NOTE: This works with NVIDIA Graphics Devices only, no AMD support as of now. Contributions are welcome! Self-Promo

Jongwook Choi 3.2k Jan 04, 2023
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 29, 2022
jupyter/ipython experiment containers for GPU and general RAM re-use

ipyexperiments jupyter/ipython experiment containers and utils for profiling and reclaiming GPU and general RAM, and detecting memory leaks. About Thi

Stas Bekman 153 Dec 07, 2022
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Jan 04, 2023
A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

Squidward Tentacles 2 Feb 11, 2022
CUDA integration for Python, plus shiny features

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about P

Andreas Klöckner 1.4k Jan 02, 2023
Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

Nicolas Hennion 95 Jan 01, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code her

NVIDIA Corporation 6.9k Dec 28, 2022
Library for faster pinned CPU <-> GPU transfer in Pytorch

SpeedTorch Faster pinned CPU tensor - GPU Pytorch variabe transfer and GPU tensor - GPU Pytorch variable transfer, in certain cases. Update 9-29-1

Santosh Gupta 657 Dec 19, 2022
Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

Fergal Cotter 212 Jan 04, 2023
Conda package for artifact creation that enables offline environments. Ideal for air-gapped deployments.

Conda-Vendor Conda Vendor is a tool to create local conda channels and manifests for vendored deployments Installation To install with pip, run: pip i

MetroStar - Tech 13 Nov 17, 2022
cuSignal - RAPIDS Signal Processing Library

cuSignal The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. In some cases, cuSignal is

RAPIDS 646 Dec 30, 2022
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Resources cuDF Reference Documentation: Python API refe

RAPIDS 5.2k Jan 08, 2023
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases.

Vulkan Kompute The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabl

The Institute for Ethical Machine Learning 1k Dec 26, 2022
cuGraph - RAPIDS Graph Analytics Library

cuGraph - GPU Graph Analytics The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames

RAPIDS 1.2k Jan 01, 2023
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

NVIDIA DALI The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provi

NVIDIA Corporation 4.2k Jan 08, 2023
A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

GPUtil GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi. GPUtil locates all GPUs on the computer, determines thei

Anders Krogh Mortensen 927 Dec 08, 2022
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

BlazingSQL 1.8k Jan 02, 2023