Recent Releases of CUDA

CUDA - v5.8.3

CUDA v5.8.3

Diff since v5.8.2

Merged pull requests: - More tests for diagm (#2791) (@kshyatt) - Add JLD2 to test env (#2792) (@christiangnrd) - cuTENSOR: Destroy plan description and preference after construction. (#2794) (@maleadt) - More tests for sparse matrix dimension checks (#2796) (@kshyatt) - Better error messages and tests for sm2 (#2797) (@kshyatt) - Reorganize interfaces tests, lower allocations (#2799) (@kshyatt) - Cleanup and less memory use for cusparse linalg tests (#2800) (@kshyatt) - Separately version the CUDA compiler (#2801) (@maleadt) - Remove shape-preserving Diagonal conversion constructors. (#2805) (@maleadt) - More accumulation and reduction benchmarks (#2808) (@christiangnrd) - Rationalize and try to fix failing ldiv tests (#2809) (@kshyatt) - Simplify specifying benchmark output file (#2814) (@christiangnrd) - Add KA unified memory support (#2819) (@christiangnrd) - Augment docs about setting runtime version (#2822) (@david-macmahon) - Fix a list numbering problem in docs (#2824) (@david-macmahon) - Move things to GPUToolbox. (#2826) (@maleadt) - Initial compatibility with CUDA 13 (#2834) (@maleadt)

Closed issues: - Array constructors for ones, zeros, rand, ... (#159) - CuSparse documentation (#135) - gemmstridedbatched throws error on windows (#132) - Documentation: An example for allocating Unified Memory arrays (#33) - norm function errors on big arrays (#598) - CuSparse factorizations (#1396) - opnorm(::CuMatrix, p) for p = (1, Inf) (#1533) - CUSPARSE: support broadcasting for CuSparseVectors (#2699) - Remove erroneous CuArray(::Diagonal) methods (#2734) - Matrix-Matrix-Multiplication fails with CuSparseMatrixBSR. (#2745) - Possible CPU memory leak in cuTENSOR plans (#2793) - Test fail for libraries/cublas/level1 (#2810) - CI is failing due to CUSPARSEVEC (#2817) - Support for CUDA 13 (#2831)

- Julia
Published by github-actions[bot] 10 months ago

CUDA - v5.8.2

CUDA v5.8.2

Diff since v5.8.1

Merged pull requests: - Fix spdiagm with specified pairs (#2784) (@ErikQQY) - Add diagm in CUBLAS (#2786) (@ErikQQY)

Closed issues: - Where to host extension(s) (#2735) - spdiagm doesn't support specified diagonal elements (#2783) - CUDA failed to create a diagonal matrix of CuArray(u) (#2785)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.8.1

CUDA v5.8.1

Diff since v5.8.0

Merged pull requests: - CUSPARSE: Bugfixes for sparse vector broadcast. (#2780) (@maleadt)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.8.0

CUDA v5.8.0

Diff since v5.7.3

Merged pull requests: - SparseMatricesCSR Dispatch (#2720) (@Abdelrahman912) - Very rough implementation of bcast for CuSparseVector (#2733) (@kshyatt) - Possible fix for #2745, change args in call to cusparseCreateBsr (#2747) (@manuelbb-upb) - Simple tests for check and explaineltype (#2748) (@kshyatt) - Test for printing OutOfGPUMemoryError (#2749) (@kshyatt) - Fix logmessage pileup (#2750) (@fps) - Test for parselimit (#2751) (@kshyatt) - unsafewrap for symbols (#2753) (@vchuravy) - Use thread adoption to handle log messages. (#2754) (@maleadt) - Add pre-commit configuration (#2755) (@vchuravy) - Broaden check for eltypes to make sure we don't allow invalid stuff (#2756) (@kshyatt) - Prefer alignedsizeof (#2757) (@vchuravy) - More array tests (#2758) (@kshyatt) - A few more tests for CUSOLVER Q mats (#2759) (@kshyatt) - More tests for CuArrayPtr (#2760) (@kshyatt) - [CUSOLVER] Update gesvdp! (#2763) (@amontoison) - Get rid of unneeded version checks (#2765) (@kshyatt) - Remove second import of alignedsizeof (#2767) (@vchuravy) - CUSPARSE SpGEMM: Support algorithms 2 and 3 (#2769) (@maleadt) - Update to CUDA 12.9. (#2772) (@maleadt) - Fix SPGEMM_ALGOS setup (#2773) (@jonas-schulze) - Support new functionality from KA 0.9.32 (#2774) (@michel2323) - cuTENSOR: Preserve storage type when multiplying (#2775) (@christiangnrd) - Update subpackages. (#2776) (@maleadt) - Remove the unnecessary reshape during mapreduce. (#2778) (@maleadt)

Closed issues: - Type conversions in broadcast fails when compiling with always_inline=true (#2722) - cuDNN loses memory to log messages in Pluto.jl context (#2743) - Xgesvdp! failure when only requesting singular values (#2761) - CUDA 5.7.3 fails to precompile on Julia 1.12.0-beta2 (#2762) - alignedsizeof with an existing identifier (#2766) - CUSPARSESPGEMM_ALG2 not working (#2768) - sum! throws dispatch error beyond a threshold number of rows (#2777)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.7.3

CUDA v5.7.3

Diff since v5.7.2

Merged pull requests: - Merge CSC/CSR broadcast kernels (#2731) (@kshyatt) - GPUToolbox v0.2 take 2 (#2736) (@christiangnrd) - Add dispatches to access device matrix data via SparseArrays interface (#2738) (@termi-official) - More tests for CuContext (#2739) (@kshyatt) - Fill in missing KA functionality (KA.functional + sparse matrices adaption from CUDAbackend) (#2740) (@Abdelrahman912) - Small tests and changes for coverage (#2742) (@kshyatt) - More tests and better error type for cusparse generic (#2744) (@kshyatt) - Restore the descriptors in CUSPARSE (#2746) (@amontoison)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.7.2

CUDA v5.7.2

Diff since v5.7.1

Merged pull requests: - Support disabling implicit synchronization (#2662) (@vchuravy) - More tests and bugfixes for CUSOLVER (#2707) (@kshyatt) - Set neutral element to zero for sparse reduce (#2710) (@kshyatt) - Bugfix and tests for cusolver/base (#2712) (@kshyatt) - Small fixes and missed tests for CUTENSORNET (#2713) (@kshyatt) - Even more tests and small fixes for CUTENSORNET (#2715) (@kshyatt) - Tests for CUSTATEVEC errors (#2716) (@kshyatt) - Add compat entries for recent devices and toolkits. (#2717) (@maleadt) - Split out copyto for texture arrays and add more tests (#2719) (@kshyatt) - Add a docstring for pointer (#2721) (@maleadt) - More CUSOLVER dense tests (#2723) (@kshyatt) - Tests for some helper functions (#2724) (@kshyatt) - More tests and bugfixing for CUSPARSE (#2725) (@kshyatt) - Add more methods for all versions to unstick tests (#2726) (@kshyatt)

Closed issues: - Ability to opt out of / improved automatic synchronization between tasks for shared array usage (#2617) - maximum(abs, CuSparseMatrixCSR) returns Inf (#2705) - mapreduce(f, op, A) for sparse A is wrong if f(0) =/= 0 (#2709)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.7.1

CUDA v5.7.1

Diff since v5.7.0

Merged pull requests: - Tests for MIME printing and indexing (#2686) (@kshyatt) - Loosen VERSION check for sketchy test (#2688) (@kshyatt) - CompatHelper: bump compat for GPUToolbox to 0.2, (keep existing compat) (#2689) (@github-actions[bot]) - Even more sparse printing and tril/triu tests (#2692) (@kshyatt) - Even more sparse tests (#2695) (@kshyatt) - More tests and a matmatmul fix (#2697) (@kshyatt) - Sparse conversion tests (#2698) (@kshyatt) - Tests for descriptors (#2700) (@kshyatt) - More tests for some missing kron methods (#2701) (@kshyatt) - Don't duplicate const defs (#2703) (@kshyatt) - Exclude device-side sorting code from coverage (#2704) (@kshyatt) - More tests for CuRef/CuRefArray (#2706) (@kshyatt) - Update Project.toml (#2708) (@kshyatt)

Closed issues: - GC corruption on 1.10 during cusparse/reduce tests (#2027) - Launch bounds interface (#2674) - Precompilation errors: ERROR: LoadError: invalid redefinition of constant CUSPARSE.CuSparseUpperOrUnitUpperTriangular (#2690)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.7.0

CUDA v5.7.0

Diff since v5.6.1

Merged pull requests: - Bugfix for batched gemv (#2481) (@kose-y) - Split out level 3 gemm tests (#2610) (@kshyatt) - Switch CUBLAS to device-side pointer mode (#2616) (@kshyatt) - Elide bounds checks when kernels contains manual ones. (#2621) (@maleadt) - Support passing symbols as arguments (#2624) (@vchuravy) - Remove eager synchronization with HtoD copies. (#2625) (@maleadt) - Don't prefetch on multi-device systems (#2626) (@vchuravy) - Cooperative groups: add a boundscheck to avoid confusing inexact errors. (#2631) (@maleadt) - NFC fixes (#2632) (@maleadt) - Update to CUDA 12.8 (#2634) (@maleadt) - [CUSOLVER] Update the test of syevBatched! (#2636) (@amontoison) - Improve NSight Systems activation by inspecting the session list. (#2638) (@maleadt) - [CUSPARSE] Support CuSparseMatrixBSR in the generic mm! (#2639) (@amontoison) - [CUSOLVER] Support symmetric factorization without pivoting (#2640) (@amontoison) - Wrap the Givens rotation methods (#2642) (@kshyatt) - Remove kron methods and use those in GPUArrays (#2643) (@kshyatt) - Add a simpler CuRefValue. (#2645) (@maleadt) - Use GPUToolbox.jl (#2646) (@christiangnrd) - DtoH copies: perform a nonblocking sync before calling into libcuda. (#2648) (@maleadt) - Support Adjoint/Transpose -> COO (#2649) (@kshyatt) - Support cuTENSOR contractors for 1D views (#2650) (@kshyatt) - Re-enable mixed precision sparse mv (#2651) (@kshyatt) - Proper support for similar on CuSparseMats (#2652) (@kshyatt) - Test error throw for accumulate (#2656) (@kshyatt) - Lots more tests for CUBLAS (#2657) (@kshyatt) - MORE tests for CUBLAS and a bugfix (#2659) (@kshyatt) - Add tests for gemmEx in fast math mode (#2660) (@kshyatt) - More tests/better coverage for CUSPARSE (#2663) (@kshyatt) - Fixes and tests for CuStateVec (#2664) (@kshyatt) - Re-enable NVTX on Windows. (#2665) (@maleadt) - Protect against occupancy calculations with very large numbers. (#2666) (@maleadt) - Fixes and tests for COO indexing, exclude more kernels from coverage (#2668) (@kshyatt) - Exclude lib*jl from coverage also for CUSTATEVEC, CUTENSOR, and CUTENSORNET (#2669) (@kshyatt) - Even MORE tests and cov for CUBLAS (#2670) (@kshyatt) - Fix and test for mgpu batch measure (#2671) (@kshyatt) - Remove some invalid conversions and test more (#2673) (@kshyatt) - Exclude more device side code in CUSPARSE (#2676) (@kshyatt) - More tests, better errors, more exclusions for CUSPARSE (#2677) (@kshyatt) - Try re-enabling the convolution tests (#2678) (@kshyatt) - Fix Markdown formatting in overview.md (#2680) (@singularitti) - Even more CUSPARSE tests (#2682) (@kshyatt) - Fix inference of FFT plan creation (#2683) (@jipolanco) - Some cudadrv tests (#2684) (@kshyatt)

Closed issues: - Batched strided GEMM tests fail (#151) - CuArrays.CURAND.curand missing methods (#141) - Rationals behave badly (#118) - Matrix inversion for CuArray (#116) - Dot product of a complex CuArray with a real CuArray performance (#668) - Sporadic cudnn/convolution test failures (#725) - Support for LinearAlgebra.pinv (#883) - Update mv!, mm!, sv! and sm! with the future release of CUPARSE (#1610) - [CUSPARSE] changing size in similar returns a cpu array (#1667) - Mix precision sparse mul is not dispatched correctly (#1760) - Make CuRef(Value) behave more like Ref (#1803) - [cuTENSOR] Issue when contracting views of CuArrays with cuTENSOR (#2407) - versioninfo broken on Jetson Orin due to NVML lookup failure (#2542) - CUBLAS: Improve concurrency using device pointer mode (#2571) - NVML issues on Jetson Nano Orin (#2580) - Passing Symbol as a an argument fails (#2590) - Remove kron functionality (#2602) - Disable or make automatic prefecthing of unified memory optional (#2618) - Circular dependency in CUDA with Julia 1.10 (#2622) - Regression with nsys profile and CUDA.@profile (#2629) - PrecompileTools.jl with CUDA.jl causes kernels to fail to run on 1.11 (#2637) - Support Adjoint Sparse Matrices for CuSparseMatrixCOO (#2647) - Implicit stream sync in tasks serialise kernel execution (#2654) - Broadcasting on arrays larger than typemax(Int32) yields truncation error (#2658) - Problem with function in CUDA (#2667) - CUDA.limit errors with invalid argument (code 1, ERROR_INVALID_VALUE) (#2672) - CUDA.jl does not support tuples of UInt128 (#2675) - Can not permutedims! CuArray with length larger that typemax(Int32) (#2679) - Support for older GPUs (#2685)

- Julia
Published by github-actions[bot] about 1 year ago

CUDA - v5.6.1

CUDA v5.6.1

Diff since v5.6.0

Merged pull requests: - Support GPUArrays allocations cache (#2593) (@pxl-th) - Fix resize! when pool=none is in use (#2613) (@luraess) - Update to new alloc cache interface. (#2614) (@maleadt) - Work around NVML issue on Jetson Orin. (#2620) (@maleadt)

Closed issues: - Add strides, implement CUDA Array Interface (#1298) - Restore broken CUBLAS test (#2584) - Issues with multiple GPUs on a single node (#2615)

- Julia
Published by github-actions[bot] over 1 year ago

CUDA - v5.6.0

CUDA v5.6.0

Diff since v5.5.2

CUDA.jl v5.6 is a relatively minor release, which the most important change being behind the scenes: GPUArrays.jl v11 has switched to KernelAbstractions.jl (#2524).

Features

  • Update to CUDA 12.6.2 (#2512)
  • CUSOLVER: support for Xgeev! (#2513), XsyevBatched (#2577), gesv! and gels! (#2406)
  • CUBLAS: added multiplication of transpose / adjoint matrices by diagonal matrices (#2518, #2538)
  • Improve handle cache performance in the presence of many short-lived tasks (#2583)
  • CUFFT: Pre-allocate the buffer required for complex-to-real FFTs only once (#2578)
  • Improved batched pointer conversion for very large batches (#2608)

Bug fixes

  • Fix findall with an empty CuArray (#2554)
  • CUBLAS: Fix use of level 1 methods with strided arrays (#2528)
  • CUSOLVER: Fix Xgesvdr! (#2556)
  • Preserve the array buffer type with more linear algebra operations (#2534) Work around LinearAlgebra.jl breakage in Julia 1.11.2 concerning generic triangular (l/r)mul! - (#2585)
  • Fix ambiguity of LinearAlgebra.dot (#2569)
  • Native RNG: Fixes when working with very large arrays (#2561)
  • Avoid a deadlock due do union splitting in the mapreduce kernel (#2595)
  • Fix pinning of resized CPU memory by automatically re-pinning (#2599)

Merged pull requests: - [CUSOLVER] Interface gesv! and gels! (#2406) (@amontoison) - Update wrappers for CUDA v12.6.2 (#2512) (@amontoison) - [CUSOLVER] Interface Xgeev! (#2513) (@amontoison) - Added multiplication of transpose / adjoint matrices by diagonal matrices (#2518) (@amontoison) - CompatHelper: bump compat for GPUCompiler to 1, (keep existing compat) (#2521) (@github-actions[bot]) - Adapt to GPUArrays.jl transition to KernelAbstractions.jl. (#2524) (@maleadt) - Switch CI to 1.11. (#2525) (@maleadt) - CUTENSOR: Reduce amount of broadcasts compiled during tests. (#2527) (@maleadt) - CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. (#2528) (@maleadt) - Clarify the synchronize(ctx)/devicesynchronize() docstrings (#2532) (@JamesWrigley) - Issue #2533: Preserving the buffer type in linear algebra (#2534) (@kmp5VT) - Clarify description of how LocalPreferences.toml is generated in the docs (#2535) (@glwagner) - Adapt to JuliaGPU/GPUArrays.jl#567. (#2537) (@maleadt) - Removed allocations for transpose/adjoint - diagonal multiplications (#2538) (@RedRussianBear) - Consistent use of Nsight Compute (#2541) (@huiyuxie) - Fix formatting in profiling docs page (#2543) (@efaulhaber) - Fix typo in EnzymeCoreExt.jl (#2550) (@wsmoses) - Enhance warning under a profiler (#2552) (@huiyuxie) - Fix findall with an empty CuArray of Bool (#2554) (@amontoison) - [CUSOLVER] Fix Xgesvdr! (#2556) (@amontoison) - Test restore Enzyme.jl (#2557) (@wsmoses) - Native RNG fixes for very large arrays (#2561) (@maleadt) - [Enzyme] Mark launchconfiguration as inactive (#2563) (@wsmoses) - Update EnzymeCoreExt.jl (#2565) (@simenhu) - Fix ambiguity of LinearAlgebra.dot (#2569) (@amontoison) - [CUSOLVER] Add more tests for the dense SVD (#2574) (@amontoison) - [CUSOLVER] Interface XsyevBatched (#2577) (@amontoison) - [CUFFT] Preallocate a buffer for complex-to-real FFT (#2578) (@amontoison) - Run the GC when failing to find a handle, but lots are active. (#2583) (@maleadt) - Work around LinearAlgebra.jl breakage in 1.11.2. (#2585) (@maleadt) - mapreduce: avoid deadlock by forcing the accumulator type. (#2596) (@maleadt) - Switch to GitHub Actions-based benchmarks. (#2597) (@maleadt) - Re-pin variable sized memory (#2599) (@jipolanco) - Enzyme: add makezero of cuarrays (#2600) (@wsmoses) - Update cache.jl (#2604) (@jarbus) - Enzyme: mark devicesync as non-differentiable only downstream (@wsmoses) - Move strided batch pointer conversion to GPU (#2608) (@THargreaves) - Split linalg tests into multiple files (#2609) (@kshyatt)

Closed issues: - Inference failure with sort(::CuMatrix) after loading MLDatasets (#2258) - Kron Support for CuSparseMatrixCSC (#2370) - Broadcasting a function returning an anonymous function with a constructor over CUDA arrays fails to compile, "not isbits" (#2514) - CuArray view has different variable type outside x inside the cuda kernel (#2516) - Can't build cuDNN on centos7.8 (#2517) - Precompile errors (#2519) - Precompile errors (#2520) - Error returned from CUDA function in CUDA-aware MPI multi-GPU test (#2522) - Broadcasting over random static array errors on Julia 1.11 (#2523) - gemm_strided_batched only using strided CUDA kernel when first matrix is transposed (#2529) - CUDA runtime libraries are loaded from a system path due to LDLIBRARYPATH being set (#2530) - [Bug] UnifiedMemory buffer changes during LinearAlgebra operations (#2533) - Improve system library warning when running under profiler (#2540) - Local CUDA settings not propagated to Pkg.test (#2545) - Out of Memory when working with Distributed for Small Matricies (#2548) - findall is not working with an empty vector of bool (#2553) - CUDA code does not return when running under VSC Debugging mode (#2558) - dot is quite slow in multinest Arrays (#2559) - UndefVarError: backend not defined in GPUArrays (#2564) - view() returns CuArray instead of view for 1-D CuArrays (#2566) - dot ambiguity (#2568) - InvalidIRError thrown only if critical function is not previously compiled (#2573) - circular dependency during precompilation (#2579) - Sparse MatVec Is Nondeterministic? (#2582) - CUDA triggers long Circular dependency list (#2586) - Release v5.5.3 for GPUArray v11? (#2587) - 'dot' gives different answers when viewing rather than slicing multidimensional arrays (#2589) - Scalar indexing when performing kron on two CuVectors (#2591) - Faster strided-batched to batched wrapper (#2592) - Error when copying data to pinned and resized CPU array (#2594) - mapreducedim! size-dependent fail when narrowing float element types (#2595) - Missing Enzyme.make_zero in Enzyme extension leads to incorrect behaviour (#2598) - 'ArgumentError: array must be non-empty' when attempting to pop idle handles from HandleCache (#2603) - Do a release as current one doesn't support GPUArrays v11 (#2606)

- Julia
Published by github-actions[bot] over 1 year ago

CUDA - v5.5.2

CUDA v5.5.2

Diff since v5.5.1

Merged pull requests: - Fix type of AbstractFFTs.Plan for real-complex FFTs (#2504) (@jipolanco) - Profiler: Demangle kernel names. (#2505) (@maleadt) - Bump CUDNN. (#2507) (@maleadt) - Restore Enzyme checks (#2508) (@wsmoses)

- Julia
Published by github-actions[bot] over 1 year ago

CUDA - v5.5.1

What's Changed

  • Update wrappers for CUDA v12.6.1 by @amontoison in https://github.com/JuliaGPU/CUDA.jl/pull/2499
  • Enzyme: adapt to pending version breaking update by @wsmoses in https://github.com/JuliaGPU/CUDA.jl/pull/2490

Full Changelog: https://github.com/JuliaGPU/CUDA.jl/compare/v5.5.0...v5.5.1

- Julia
Published by maleadt over 1 year ago

CUDA - v5.5.0

CUDA v5.5.0

Blog post

Diff since v5.4.3

Merged pull requests: - Add support for arbitrary group sizes in gemm_grouped_batched! (#2334) (@lpawela) - Add kernel compilation requirements to docs (#2416) (@termi-official) - Enzyme: reverse mode kernels (#2422) (@wsmoses) - CUFFT: Support Float16 (#2430) (@eschnett) - Updated compute-sanitizer documentation (#2440) (@alexp616) - Add troubleshooting section for NSight Compute (#2442) (@efaulhaber) - Correct typo in documentation (#2445) (@eschnett) - Bump minimal Julia requirement to v1.10. (#2447) (@maleadt) - fix compute-sanitizer typo (#2448) (@alexp616) - Address a corner case when establishing p2p access (#2457) (@findmyway) - Implementation of spdiagm for CUSPARSE (#2458) (@walexaindre) - Update to CUDA 12.6. (#2461) (@maleadt) - CompatHelper: bump compat for GPUCompiler to 0.27, (keep existing compat) (#2462) (@github-actions[bot]) - Bump CUDA driver JLL. (#2463) (@maleadt) - CUSOLVER (dense): cache workspace in fat handle (#2465) (@bjarthur) - Revert "Run full GC when under very high memory pressure." (#2469) (@maleadt) - Fix a method deprecation. (#2470) (@maleadt) - Add Enzyme sum derivatives (#2471) (@wsmoses) - Re-use pre-converted kernel arguments when launching kernels. (#2472) (@maleadt) - Bump LLVM compat (#2473) (@maleadt) - Bump subpackage compat. (#2475) (@maleadt) - Enzyme: Reversemode cudaconvert (#2476) (@wsmoses) - Ignore Enzyme.jl CI failures (#2479) (@maleadt) - Re-enable enzyme testing (#2480) (@wsmoses) - Add missing GC.@preserves. (#2487) (@maleadt) - [CUSPARSE] Implement a sparse GEMV for CuSparseMatrixCSC * CuSparseVector (#2488) (@amontoison) - [CUSPARSE] Add conversions between CuSparseVector and CuSparseMatrices (#2489) (@amontoison) - Update to LLVM 9.1. (#2491) (@maleadt) - Use at-consistent_overlay for 1.11 compatibility. (#2492) (@maleadt) - Rework NNlib CI. (#2493) (@maleadt) - CUSPARSE: Fix sparse constructor with duplicate elements. (#2495) (@maleadt)

Closed issues: - LinearAlgebra.norm(x) falls back to generic implementation for x::Transpose and x::Adjoint (#1782) - dlclose'ing the compatibility driver can fail (#1848) - Creating a sparse diagonal matrix of CuArray(u) (#1857) - Support for Julia 1.11 (#2241) - CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328) - Adding CUDA to a PackageCompiler sysimage causes segfault (#2428) - Error using CUDA on Julia 1.10: Number of threads per block exceeds kernel limit (#2438) - Error when I load my model (#2439) - Driver JLL improvements (#2446) - Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449) - CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452) - Segmentation Fault on Loading CUDA (#2453) - Invalid instruction error when using CUDA (#2454) - Missing adapt for sparse and CUDABackend (#2459) - CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466) - Request: Option to disable the "full GC when under very high memory pressure". (#2467) - copyto! ambiguous (#2477) - NeuralODE training failed on GPU with Enzyme (#2478) - issue with atomic - when running standard test, @atomic modify expression missing field access (#2483) - Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484) - Issue with compiling CUDA and cuTENSOR using local libraries (#2486) - Memory Access error in sparse array constructor (#2494) - Forwards-compatible driver breaks CURAND (#2496) - CUDA 12.6 Update 1 (#2497)

- Julia
Published by github-actions[bot] over 1 year ago

CUDA - v5.4.3

CUDA v5.4.3

Diff since v5.4.2

Merged pull requests: - add cublasgetrsBatched (#2385) (@bjarthur) - add two quirks for rationals (#2403) (@lanceXwq) - Bump cuDNN (#2404) (@maleadt) - Add convert method for ScaledPlan (#2409) (@david-macmahon) - Conditionalize a quirk. (#2411) (@maleadt) - Relax signature of generic matvecmul! (#2414) (@dkarrasch) - Fix kron launch configuration. (#2418) (@maleadt) - Run full GC when under very high memory pressure. (#2421) (@maleadt) - Enzyme: Fix cuarray return type (#2425) (@wsmoses) - CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot]) - pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur) - Profiler tweaks. (#2432) (@maleadt) - Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison) - Correct workspace handling (#2437) (@maleadt)

Closed issues: - Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053) - Broadcasted multiplication with a rational doesn't work (#1926) - Incorrect grid size in kron (#2410) - GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412) - Failure of Eigenvalue Decomposition for Large Matrices. (#2413) - CUDADriverjll's lazy artifacts cause a precompilation-time warning (#2415) - Recurrence of integer overflow bug (#1880) for a large matrix (#2427) - CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429) - CUDARuntimeDiscovery Did not find cupti on Arm system with nvhpc (#2433) - CUDA.jl won't install/run on Jetson Orin NX (#2435)

- Julia
Published by github-actions[bot] almost 2 years ago

CUDA - v5.4.2

CUDA v5.4.2

Diff since v5.4.1

Merged pull requests: - Fix and test the legacy memory pool. (#2402) (@maleadt)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.4.1

CUDA v5.4.1

Diff since v5.4.0

Merged pull requests: - Fixup Enzyme: Mark CuArray as noalias (#2401) (@wsmoses)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.4.0

CUDA v5.4.0

Diff since v5.3.5

Merged pull requests: - Support CUDA 12.5 (#2392) (@maleadt) - Mark cuarray as noalias (#2395) (@wsmoses) - Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison) - Enable correct pool access for cublasXt. (#2398) (@maleadt) - More fine-grained CUPTI version checks. (#2399) (@maleadt)

Closed issues: - CUTENSOR breaks after devicereset! (#2319) - cuBLASXt's `xtgemm!` incompatible with stream-ordered allocated memory (#2320) - Add helper function to recompile CUDA stack (#2364)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.3.5

CUDA v5.3.5

Diff since v5.3.4

Merged pull requests: - Avoid constructing MulAddMuls on Julia v1.12+ (#2277) (@dkarrasch) - CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot]) - Enzyme: allocation functions (#2386) (@wsmoses) - Tweaks to prevent context construction on some operations (#2387) (@maleadt) - Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt) - CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt) - Backport: Enzyme allocation fns (#2393) (@wsmoses)

Closed issues: - Indexing a view uses scalar indexing (#1472) - EnzymeCore is an unconditional dependency. (#2380) - cuBLASLt wrappers ccall into cuBLAS (#2388) - generic_trimatmul! error (#2389)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.3.4

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests: - Add Enzyme Forward mode custom rule (#1869) (@wsmoses) - Handle cache improvements (#2352) (@maleadt) - Fix cuTensorNet compat (#2354) (@maleadt) - Optimize array allocation. (#2355) (@maleadt) - Change type restrictions in cuTENSOR operations (#2356) (@lkdvos) - Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot]) - Suggest use of 32 bit types over 64 instead of just Float32 over Float64 skip ci (@Zentrik) - Make generictrimatmul more specific (#2359) (@tgymnich) - Return the currect memory type when wrapping system memory. (#2363) (@maleadt) - Mark cublas version/handle as non-differentiable (#2368) (@wsmoses) - Enzyme: Forward mode sync (#2369) (@wsmoses) - Enzyme: support fill (#2371) (@wsmoses) - unsafewrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt) - Remove external_gvars. (#2373) (@maleadt) - Tegra support with artifacts (#2374) (@maleadt) - Backport Enzyme extension (#2375) (@wsmoses) - Add note about --check-bounds=yes (#2378) (@Zinoex) - Test Enzyme in a separate CI job. (#2379) (@maleadt) - Fix tests for Tegra. (#2381) (@maleadt) - Update Project.toml remove EnzymeCore unconditional dep (@wsmoses)

Closed issues: - Native Softmax (#175) - CUSOLVER: support eigendecomposition (#173) - backslash with gpu matrices crashes julia (#161) - at-benchmark captures GPU arrays (#156) - Support kernels returning Union{} (#62) - mul! falls back to generic implementation (#148) - \ on qr factorization objects gives a method error (#138) - Compiler failure if dependent module only contains a japi1 function (#49) - copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126) - Calling Flux.gpu on a view dumps core (#125) - Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121) - Guard against exceeding maximum kernel parameter size (#32) - Detect common API misuse in error handlers (#31) - rand and friends default to Float64 (#108) - \ does not work for least squares (#104) - ERRORILLEGALADDRESS when broadcasting modular arithmetic (#94) - CuIterator assumes batches to consist of multiple arrays (#86) - Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85) - Document (un)supported language features for kernel programming (#13) - Missing dispatch for indexing of reshaped arrays (#556) - Track array ownership to avoid illegal memory accesses (#763) - NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793) - Support for sm_80 cp.async: asynchronous on-device copies (#850) - Profiling Julia with Nsight Systems on Windows results in blank window (#862) - sort! and partialsort! are considerably slower than CPU versions (#937) - mul! does not dispatch on Adjoint (#1363) - Cross-device copy of wrapped arrays fails (#1377) - Memory allocation becomes very slow when reserved bytes is large (#1540) - Cannot reclaim GPU Memory; CUDA.reclaim() (#1562) - Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572) - devicereset! does not seem to work anymore (#1579) - device-side rand() are not random between successive kernel launches (#1633) - Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811) - `cusparseSetStreamv2` not defined (#1820) - Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821) - KernelAbstractions.jl-related issues (#1838) - lock failing in multithreaded plan_fft() (#1921) - CUSolver finalizer tries to take ReentrantLock (#1923) - Testsuite could be more careful about parallel testing (#2192) - Opportunistic GC collection (#2303) - Unable to use local CUDA runtime toolkit (#2367) - Enzyme prevents testing on 1.11 (#2376)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.3.3

CUDA v5.3.3

Diff since v5.3.2

Merged pull requests: - Rework context handling (#2346) (@maleadt) - fix kernel launch logic (#2353) (@xaellison)

Closed issues: - Excessive allocations when running on multiple threads (#1429) - Fix and test multigpu support (#2218) - Bitonic sort exceeds launch resources (#2331)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.3.2

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests: - Add EnzymeCore extension for parent_job (#2281) (@vchuravy) - Consider running GC when allocating and synchronizing (#2304) (@maleadt) - Refactor memory wrappers (#2335) (@maleadt) - Auto-detect external profilers. (#2339) (@maleadt) - Fix performance of indexing unified memory. (#2340) (@maleadt) - Improve exception output (#2342) (@maleadt) - Test multigpu on CI (#2348) (@maleadt) - cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt) - cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)

Closed issues: - CuArrays don't seem to display correctly in VS code (#875) - Task scheduling can result in delays when synchronizing (#1525) - Docs: add example on task-based parallelism with explicit synchronization (#1566) - Exception output from many threads is not helpful (#1780) - Autodetect external profiler (#2176) - LazyInitialized is not GC-safe (#2216) - Track CuArray stream usage (#2236) - Improve cross-device usage (#2323) - CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337) - Improve error message when assigning real valued arrray with complex numbers (#2341) - @device_code_sass broken (#2343) - Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345) - @gcsafe_ccall breaks inlining of ccall wrappers (#2347)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.3.1

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests: - [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison) - Regenerate headers (#2324) (@maleadt) - Add some installation tips to docs/README.md (#2326) (@jlchan) - fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3) - Diagnose kernel limits on launch failure. (#2329) (@maleadt) - Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)

Closed issues: - Missing CUBLASLt wrappers (#2322) - error when switching device (#2323) - v5.3.0: regression in Zygote performance (#2333)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.3.0

CUDA v5.3.0

Diff since v5.2.0

Merged pull requests: - CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj) - Slightly rework error handling (#2245) (@maleadt) - cuTENSOR improvements (#2246) (@maleadt) - Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt) - Added few SparseArrays functions (#2254) (@albertomercurio) - Reduce locking in the handle cache (#2256) (@maleadt) - Mark all CUDA ccalls as GC safe (#2262) (@vchuravy) - cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos) - cuTENSOR: refactor obtaining computetype as part of plan (#2264) (@lkdvos) - Re-generate headers. (#2265) (@maleadt) - Update to CUDNN 9. (#2267) (@maleadt) - [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison) - CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot]) - Minor improvements to nonblocking synchronization. (#2272) (@maleadt) - Add extension package for StaticArrays (#2273) (@trahflow) - Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4) - Cached workspace prototype for custatevec (#2279) (@kshyatt) - Update the Julia wrappers for v12.4 (#2282) (@amontoison) - Add support for CUDA 12.4. (#2286) (@maleadt) - Test suite changes (#2288) (@maleadt) - Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt) - Towards supporting Julia 1.11 (#2291) (@maleadt) - Fix typo in performance tips (#2294) (@Zentrik) - Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt) - Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt) - [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison) - [CUSOLVER] Interface larft! (#2301) (@amontoison) - Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt) - sortperm with dims (#2308) (@xaellison) - [CUBLAS] Interface gemmgroupedbatched (#2310) (@amontoison) - [CUSPARSE] Add a method convert for the type cusparseSpSMUpdatet (#2311) (@amontoison) - Avoid capturing AbstractArrays in BoundsError (#2314) (@lcw) - Clarify debug level hint. (#2316) (@maleadt)

Closed issues: - Failed to compile PTX code when using NSight on Win11 (#1601) - sortperm fails with dims keyword (#2061) - NVTX-related segfault on Windows under compute-sanitizer (#2204) - Inverse Complex-to-Real FFT allocates GPU memory (#2249) - cuDNN not available for your platform (#2252) - Cannot reset CuArray to zero (#2257) - Cannot take gradient of sort on 2D CuArray (#2259) - Multi-threaded code hanging forever with Julia 1.10 (#2261) - CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268) - Adjoint not supported on Diagonal arrays (#2275) - Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276) - Release v5.3? (#2283) - Wrap CUDSS? (#2287) - Bug concerning broadcast between device array and unified array (#2289) - StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292) - BUG: sortperm! seems to perform much slower than it should (#2293) - Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296) - BFloat16 support broken on Julia 1.11 (#2306) - does not emit line info for debbuging/profiling (#2312) - Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313) - Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v4.4.2

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests: - Added support for more transform directions (#1903) (@RainerHeintzmann) - CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj) - Add some performance tips to the documentation (#1999) (@Zentrik) - Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt) - Adapt to GPUCompiler#master. (#2062) (@maleadt) - Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj) - Use released GPUCompiler. (#2064) (@maleadt) - Fixes for Windows. (#2065) (@maleadt) - Switch to GPUArrays buffer management. (#2068) (@maleadt) - Update CUDA 12 to Update 2. (#2071) (@maleadt) - [CUSOLVER] Add generic routines (#2074) (@amontoison) - Update manifest (#2076) (@github-actions[bot]) - Test improvements (#2079) (@maleadt) - Rework and extend the cooperative groups API. (#2081) (@maleadt) - Update manifest (#2082) (@github-actions[bot]) - [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison) - Fix some typos in perfomance tips (#2086) (@Zentrik) - Improve PTX ISA selection (#2088) (@maleadt) - Update manifest (#2090) (@github-actions[bot]) - support ChainRulesCore inplaceability (#2091) (@piever) - Add a method inv(CuMatrix) (#2095) (@amontoison) - Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison) - Add CUDARuntimeDiscovery dependency to sublibraries. (#2097) (@maleadt) - Handle and test zero-size inputs to RNGs. (#2098) (@maleadt) - Add a withworkspaces function (#2099) (@amontoison) - [CUSOLVER] Add a method for getrf! (#2100) (@amontoison) - [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison) - Call exit when handling exceptions. (#2103) (@maleadt) - Bump packages. (#2104) (@maleadt) - Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot]) - Update manifest (#2107) (@github-actions[bot]) - Make Ref mutable on the GPU. (#2109) (@maleadt) - CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot]) - Small profiler improvements (#2113) (@maleadt) - Update manifest (#2114) (@github-actions[bot]) - [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison) - [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison) - Fix incorrect timing results for CUDA.@elapsed (#2118) (@thomasfaingnaert) - [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison) - Update manifest (#2123) (@github-actions[bot]) - Profiler: Show used local memory. (#2124) (@maleadt) - Support for CUDA 12.3 (#2125) (@maleadt) - [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison) - [CUSOLVER] Add Xgesvdp (#2128) (@amontoison) - Profiler: don't crop when rendering to a file. (#2131) (@maleadt) - Regenerate headers for CUDA 12.3. (#2132) (@maleadt) - [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison) - CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot]) - CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot]) - Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt) - Better support for unified and host memory (#2138) (@maleadt) - Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt) - Avoid allocations during derived array construction. (#2142) (@maleadt) - More performance tweaks for memory copying (#2143) (@maleadt) - Don't use libdevice's fmin/fmax. (#2144) (@maleadt) - Update documentation (#2146) (@maleadt) - Fixes for sm61 (#2151) (@maleadt) - Update sparse factorizations (#2152) (@amontoison) - Don't call into LLVM's fmin/fmax on <sm80. (#2154) (@maleadt) - Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt) - Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt) - Sanitizer improvements. (#2157) (@maleadt) - [CUSPARSE] Update the wrapper of cusparseSpSVupdateMatrix (#2159) (@amontoison) - Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt) - [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison) - [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison) - Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt) - expand docs on launch parameters (#2167) (@simonbyrne) - Make CUDA.setruntimeversion force the default behavior. (#2169) (@maleadt) - kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne) - [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison) - Added kronecker product support for dense matrices (#2177) (@albertomercurio) - Update to CUTENSOR 2.0 (#2178) (@maleadt) - Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik) - provide more information on kernel compilation error (#2180) (@simonbyrne) - [CUSPARSE] Test CUSPARSESPMVCOOALG2 (#2182) (@amontoison) - [CUSPARSE] Use cusparseSpMMpreprocess (#2183) (@amontoison) - [CUSPARSE] Use cusparseSDDMMpreprocess (#2184) (@amontoison) - Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison) - [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison) - Support more kwarg syntax with kernel launches (#2189) (@maleadt) - Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt) - NVML: Add support for clock queries. (#2194) (@maleadt) - Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth) - Improvements to context handling (#2200) (@maleadt) - Add a concurrent kwarg to profiling macros. (#2201) (@maleadt) - Rework unique context management. (#2202) (@maleadt) - Preserve the buffer type when broadcasting. (#2203) (@maleadt) - Fixes for Windows (#2206) (@maleadt) - Bump Aqua. (#2207) (@maleadt) - Updates for new CUQUANTUM (#2210) (@kshyatt) - CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt) - CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot]) - Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt) - Default to testing with only a single device. (#2221) (@maleadt) - Backports for v5.1 (#2224) (@maleadt) - Take care not to spawn tasks during precompilation. (#2226) (@maleadt) - cuTensor fixes (#2228) (@maleadt) - Bump versions. (#2229) (@maleadt) - Add a note about threaded for-blocks. (#2232) (@kshyatt) - cuTENSOR plan handling changes. (#2234) (@maleadt) - Fix dynamic dispatch issues (#2235) (@MilesCranmer) - CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt) - Fixes for nightly (#2240) (@maleadt) - CUBLAS: Support more strided inputs (#2242) (@maleadt) - CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj) - Slightly rework error handling (#2245) (@maleadt) - cuTENSOR improvements (#2246) (@maleadt) - Make `@devicecodesass` work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt) - Added few SparseArrays functions (#2254) (@albertomercurio) - Reduce locking in the handle cache (#2256) (@maleadt) - Mark all CUDA ccalls as GC safe (#2262) (@vchuravy) - cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos) - cuTENSOR: refactor obtaining computetype as part of plan (#2264) (@lkdvos) - Re-generate headers. (#2265) (@maleadt) - Update to CUDNN 9. (#2267) (@maleadt) - [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison) - CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot]) - Minor improvements to nonblocking synchronization. (#2272) (@maleadt) - Add extension package for StaticArrays (#2273) (@trahflow) - Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4) - Cached workspace prototype for custatevec (#2279) (@kshyatt) - Update the Julia wrappers for v12.4 (#2282) (@amontoison) - Add support for CUDA 12.4. (#2286) (@maleadt) - Test suite changes (#2288) (@maleadt) - Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt) - Fix typo in performance tips (#2294) (@Zentrik) - Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt) - Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt) - [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison) - [CUSOLVER] Interface larft! (#2301) (@amontoison) - Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt) - [CUBLAS] Interface gemmgroupedbatched (#2310) (@amontoison) - [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)

Closed issues: - Element-wise conversion to Duals (#127) - IDEA: CuHostArray (#28) - Make Ref pass by-reference (#267) - Failed to compile PTX code when using NSight on Win11 (#1601) - view(data, idx) boundschecking is disproportionately expensive (#1678) - [CUSOLVER] Add a withworkspaces function to allocate two buffers (Device / Host) (#1767) - Trouble using nsight systems for profiling CUDA in Julia (#1779) - dlopen("libcudart") results in duplicate libraries (#1814) - Support for JLD2 (#1833) - Windows Defender mis-labels artifacts as threat (#1836) - Support Cholesky factorization of CuSparseMatrixCSR (#1855) - Runtime not re-selected after driver upgrade (#1877) - Failure to initialize with CUDAVISIBLEDEVICES='' (#1945) - Cannot precompile GPU code with PrecompileTools (#2006) - Evaluating sparse matrices in the REPL has a huge memory footprint (#2016) - CUDASDKjll: cuda.h in different locations depending on the platform (#2066) - StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069) - Support for LinearAlgebra.pinv (#2070) - PTX ISA 8.1 support (#2080) - Segmentation fault when importing CUDA (#2083) - "No system CUDA driver found" on NixOS (#2089) - CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093) - Missing CUDARuntimeDiscovery as a dependency in cuDNN (#2094) - Binaries for Jetson (#2105) - Minimum/maximum of array of NaNs is infinity (#2111) - Performance regression for multiple @sync copyto! on CUDA v5 (#2112) - [CUBLAS] Regenerate the wrappers with updated argument types (#2115) - More informative errors when parameter size is too big (#2119) - Unable to allocate unified memory buffers (#2120) - CUDA 12.3 has been released (#2122) - atomic min, max for Float32 and Float64 (#2129) - Native profiler output is limited to around 100 columns when printing to a file (#2130) - Intermittent CI failure: Segfault during nonblocking synchronization (#2141) - LLVM generates max.NaN which only works on sm80 (#2148) - Unified memory-related error on Tegra T194 (#2149) - Errors on sm61 (#2150) - First test for Julia/CUDA with 15 failures (#2158) - High CPU load during GPU syncronization (#2161) - Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171) - Update to CUTENSOR 2.0 (#2174) - Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175) - Support for combining duplicate elements in sparse matrices (#2185) - Interactive sessions: periodically trim the memory pool (#2190) - Broadcast does not preserve buffer type (#2191) - CUDA doesn't precompile on Julia nightly/1.11 (#2195) - Latest julia: UndefVarError: `makeseednot defined inRandom(#2198) - NVTX-related segfault on Windows under compute-sanitizer (#2204) - CUDA installation fails on Apple Silicon/Julia 1.10 (#2211) - Most recent package versions not supported on CUDA.jl (#2212) - Testing of CUDA fails (#2222) - Tests fail for CUDA#master (#2223) ---debug-info=2makesNNlibCUDACUDNNExtprecompilation run forever (#2225) - Test failures on Nvidia GH200 (#2227) - mul! should support strided outputs (#2230) - Please add support for older cuda versions (cuda 8 and older) (#2231) - NSight Compute: prevent API calls during precompilation (#2233) - Integrated profiler: detect lack of permissions (#2237) - Inverse Complex-to-Real FFT allocates GPU memory (#2249) - cuDNN not available for your platform (#2252) - Cannot reset CuArray to zero (#2257) - Cannot take gradient ofsorton 2D CuArray (#2259) - Multi-threaded code hanging forever with Julia 1.10 (#2261) - CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268) - Adjoint not supported on Diagonal arrays (#2275) - Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276) - Release v5.3? (#2283) - Wrap CUDSS? (#2287) - Bug concerning broadcast between device array and unified array (#2289) -StackOverflowErrortrying to throwOutOfGPUMemoryError, subsequent errors (#2292) - BUG: sortperm! seems to perform much slower than it should (#2293) - MultiplyingCuSparseMatrixCSCbyCuMatrixresults inOut of GPU memory` (#2296) - BFloat16 support broken on Julia 1.11 (#2306)

- Julia
Published by github-actions[bot] about 2 years ago

CUDA - v5.2.0

CUDA v5.2.0

Diff since v5.1.2

Merged pull requests: - CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj) - Update to CUTENSOR 2.0 (#2178) (@maleadt) - Updates for new CUQUANTUM (#2210) (@kshyatt) - Take care not to spawn tasks during precompilation. (#2226) (@maleadt) - cuTensor fixes (#2228) (@maleadt) - Bump versions. (#2229) (@maleadt) - Add a note about threaded for-blocks. (#2232) (@kshyatt) - cuTENSOR plan handling changes. (#2234) (@maleadt) - Fix dynamic dispatch issues (#2235) (@MilesCranmer) - CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt) - Fixes for nightly (#2240) (@maleadt) - CUBLAS: Support more strided inputs (#2242) (@maleadt)

Closed issues: - Trouble using nsight systems for profiling CUDA in Julia (#1779) - Evaluating sparse matrices in the REPL has a huge memory footprint (#2016) - Intermittent CI failure: Segfault during nonblocking synchronization (#2141) - First test for Julia/CUDA with 15 failures (#2158) - Update to CUTENSOR 2.0 (#2174) - Tests fail for CUDA#master (#2223) - Test failures on Nvidia GH200 (#2227) - mul! should support strided outputs (#2230) - Please add support for older cuda versions (cuda 8 and older) (#2231) - NSight Compute: prevent API calls during precompilation (#2233) - Integrated profiler: detect lack of permissions (#2237)

- Julia
Published by github-actions[bot] over 2 years ago

CUDA - v5.1.2

CUDA v5.1.2

Diff since v5.1.1

Merged pull requests: - kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne) - [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison) - Added kronecker product support for dense matrices (#2177) (@albertomercurio) - Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik) - provide more information on kernel compilation error (#2180) (@simonbyrne) - [CUSPARSE] Test CUSPARSESPMVCOOALG2 (#2182) (@amontoison) - [CUSPARSE] Use cusparseSpMMpreprocess (#2183) (@amontoison) - [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison) - Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison) - [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison) - Support more kwarg syntax with kernel launches (#2189) (@maleadt) - Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt) - NVML: Add support for clock queries. (#2194) (@maleadt) - Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth) - Improvements to context handling (#2200) (@maleadt) - Add a concurrent kwarg to profiling macros. (#2201) (@maleadt) - Rework unique context management. (#2202) (@maleadt) - Preserve the buffer type when broadcasting. (#2203) (@maleadt) - Fixes for Windows (#2206) (@maleadt) - Bump Aqua. (#2207) (@maleadt) - CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt) - CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot]) - Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt) - Default to testing with only a single device. (#2221) (@maleadt) - Backports for v5.1 (#2224) (@maleadt)

Closed issues: - More informative errors when parameter size is too big (#2119) - Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171) - Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175) - Support for combining duplicate elements in sparse matrices (#2185) - Interactive sessions: periodically trim the memory pool (#2190) - Broadcast does not preserve buffer type (#2191) - CUDA doesn't precompile on Julia nightly/1.11 (#2195) - Latest julia: UndefVarError: make_seed not defined in Random (#2198) - CUDA installation fails on Apple Silicon/Julia 1.10 (#2211) - Most recent package versions not supported on CUDA.jl (#2212) - Testing of CUDA fails (#2222) - --debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)

- Julia
Published by github-actions[bot] over 2 years ago

CUDA - v5.1.1

CUDA v5.1.1

Diff since v5.1.0

Merged pull requests: - Sanitizer improvements. (#2157) (@maleadt) - [CUSPARSE] Update the wrapper of cusparseSpSVupdateMatrix (#2159) (@amontoison) - Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt) - [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison) - [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison) - Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt) - expand docs on launch parameters (#2167) (@simonbyrne) - Make CUDA.setruntime_version force the default behavior. (#2169) (@maleadt)

Closed issues: - High CPU load during GPU syncronization (#2161)

- Julia
Published by github-actions[bot] over 2 years ago

CUDA - v5.1.0

CUDA v5.1.0

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.

Diff since v5.0.0

Merged pull requests: - [CUSOLVER] Add generic routines (#2074) (@amontoison) - Rework and extend the cooperative groups API. (#2081) (@maleadt) - [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison) - Fix some typos in perfomance tips (#2086) (@Zentrik) - Improve PTX ISA selection (#2088) (@maleadt) - Update manifest (#2090) (@github-actions[bot]) - support ChainRulesCore inplaceability (#2091) (@piever) - Add a method inv(CuMatrix) (#2095) (@amontoison) - Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison) - Add CUDARuntimeDiscovery dependency to sublibraries. (#2097) (@maleadt) - Handle and test zero-size inputs to RNGs. (#2098) (@maleadt) - Add a withworkspaces function (#2099) (@amontoison) - [CUSOLVER] Add a method for getrf! (#2100) (@amontoison) - [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison) - Call exit when handling exceptions. (#2103) (@maleadt) - Bump packages. (#2104) (@maleadt) - Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot]) - Update manifest (#2107) (@github-actions[bot]) - Make Ref mutable on the GPU. (#2109) (@maleadt) - CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot]) - Small profiler improvements (#2113) (@maleadt) - Update manifest (#2114) (@github-actions[bot]) - [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison) - [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison) - Fix incorrect timing results for CUDA.@elapsed (#2118) (@thomasfaingnaert) - [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison) - Update manifest (#2123) (@github-actions[bot]) - Profiler: Show used local memory. (#2124) (@maleadt) - Support for CUDA 12.3 (#2125) (@maleadt) - [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison) - [CUSOLVER] Add Xgesvdp (#2128) (@amontoison) - Profiler: don't crop when rendering to a file. (#2131) (@maleadt) - Regenerate headers for CUDA 12.3. (#2132) (@maleadt) - [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison) - CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot]) - CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot]) - Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt) - Better support for unified and host memory (#2138) (@maleadt) - Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt) - Avoid allocations during derived array construction. (#2142) (@maleadt) - More performance tweaks for memory copying (#2143) (@maleadt) - Don't use libdevice's fmin/fmax. (#2144) (@maleadt) - Update documentation (#2146) (@maleadt) - Fixes for sm61 (#2151) (@maleadt) - Update sparse factorizations (#2152) (@amontoison) - Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt) - Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt) - Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)

Closed issues: - Element-wise conversion to Duals (#127) - IDEA: CuHostArray (#28) - Make Ref pass by-reference (#267) - view(data, idx) boundschecking is disproportionately expensive (#1678) - [CUSOLVER] Add a withworkspaces function to allocate two buffers (Device / Host) (#1767) - dlopen("libcudart") results in duplicate libraries (#1814) - Support for JLD2 (#1833) - Windows Defender mis-labels artifacts as threat (#1836) - Support Cholesky factorization of CuSparseMatrixCSR (#1855) - Runtime not re-selected after driver upgrade (#1877) - Failure to initialize with CUDAVISIBLEDEVICES='' (#1945) - Cannot precompile GPU code with PrecompileTools (#2006) - CUDASDKjll: cuda.h in different locations depending on the platform (#2066) - PTX ISA 8.1 support (#2080) - Segmentation fault when importing CUDA (#2083) - "No system CUDA driver found" on NixOS (#2089) - CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093) - Missing CUDARuntimeDiscovery as a dependency in cuDNN (#2094) - Binaries for Jetson (#2105) - Minimum/maximum of array of NaNs is infinity (#2111) - Performance regression for multiple @sync copyto! on CUDA v5 (#2112) - [CUBLAS] Regenerate the wrappers with updated argument types (#2115) - Unable to allocate unified memory buffers (#2120) - CUDA 12.3 has been released (#2122) - atomic min, max for Float32 and Float64 (#2129) - Native profiler output is limited to around 100 columns when printing to a file (#2130) - LLVM generates max.NaN which only works on sm80 (#2148) - Unified memory-related error on Tegra T194 (#2149) - Errors on sm_61 (#2150)

- Julia
Published by github-actions[bot] over 2 years ago

CUDA - v5.0.0

CUDA v5.0.0

Blog post: https://info.juliahub.com/cuda-jl-5-0-changes

This is a breaking release, but the breaking changes are minimal (see the blog post for details): - Julia 1.8 is now required, and only CUDA 11.4+ is supported - selection of local toolkits has changed slightly


Diff since v4.4.1

Merged pull requests: - Added support for more transform directions (#1903) (@RainerHeintzmann) - Add some performance tips to the documentation (#1999) (@Zentrik) - Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt) - Adapt to GPUCompiler#master. (#2062) (@maleadt) - Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj) - Use released GPUCompiler. (#2064) (@maleadt) - Fixes for Windows. (#2065) (@maleadt) - Switch to GPUArrays buffer management. (#2068) (@maleadt) - Update CUDA 12 to Update 2. (#2071) (@maleadt) - Update manifest (#2076) (@github-actions[bot]) - Test improvements (#2079) (@maleadt) - Update manifest (#2082) (@github-actions[bot])

Closed issues: - StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069) - Support for LinearAlgebra.pinv (#2070)

- Julia
Published by github-actions[bot] over 2 years ago

CUDA - v4.4.1

CUDA v4.4.1

Diff since v4.4.0

Closed issues: - CUDA driver device support does not match toolkit (#70) - Launching kernels should not allocate (#66) - syncthreads() appears to not be sync'ing threads (#61) - Exception when using CuArrays with Flux (#129) - Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45) - Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538) - Improve 'VS C++ redistributable' error message (#764) - CUSPARSE does not support reductions (#1406) - CUDA test failed (#1690) - Type constructor in broadcast doesn't compile (#1761) - accumulate(+) gives different results for CuArray compared to Array. (#1810) - Compat driver: preload all libraries (#1859) - Stream synchronization is slow when waiting on the event from CUDA (#1910) - cuDNN: Store convolution algorithm choice to disk. (#1947) - Disable 'No CUDA-capable device found' error log (#1955) - CUDNNSTATUSNOTSUPPORTED using 1D CNN model (#1977) - Memory allocations during in-place sparse matrix-vector multiplication (#1982) - CUSPARSE.sum_dim1 sums the absolute values of elements (#1983) - Update to CUDA 12.2 (#1984) - unsafe_wrap fails on zero element CuArrays (#1985) - rand in kernel works in a deterministic way (#2008) - Scalar indexing with CuArray * ReshapedArray{SubArray{CuArray}}} (#2009) - volumerhs performance regression (#2010) - CuSparseMatrix constructors allocate too much memory? (#2015) - Native profiler using CUPTI (#2017) - libLLVM-15jl.so (#2018) - "symbol multiply defined" error (#2021) - Confusion on row major vs column major (#2023) - Printing of CuArrays gives zeros or random numbers (#2033) - sortperm! fails when output is UInt vector (#2046) - Re-introduce spinning loop before nonblocking synchronization (#2057)

Merged pull requests: - Check mathType only if not Float32 (#1943) (@RomeoV) - 1.10 enablement (#1946) (@dkarrasch) - Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. (#1948) (@RomeoV) - Wrapper with tests for gemmBatchedEx! (#1975) (@lpawela) - Add wrappers for gemv_batched! (#1981) (@lpawela) - Update CUSPARSE.sum_dim<n> to allow for arbitrary function on elements (#1987) (@lpawela) - Update manifest (#1988) (@github-actions[bot]) - Add vectorized cached loads (#1993) (@Zentrik) - Update manifest (#1995) (@github-actions[bot]) - Fix typo in captured macro example (#1996) (@Zentrik) - Adapt Type call broadcasting to a function (#2000) (@simonbyrne) - [CUSPARSE] Added support for generalized dot product dot(x, A, y) = dot(x, A * y) without allocating A * y (#2001) (@albertomercurio) - Update manifest (#2002) (@github-actions[bot]) - Support for printing types. (#2003) (@maleadt) - Fix accumulate bug (#2005) (@chrstphrbrns) - Update manifest (#2013) (@github-actions[bot]) - Add a raw mode to code_sass. (#2019) (@maleadt) - Update manifest (#2022) (@github-actions[bot]) - Add a native profiler. (#2024) (@maleadt) - Perform synchronization on a worker thread (#2025) (@maleadt) - Remove broken video link in docs (#2028) (@christiangnrd) - When freeing memory, use the high-level device getter. (#2029) (@maleadt) - Add support for @cuda fastmath (#2030) (@maleadt) - Make "CUDA.jl" a link on the doc entry page (#2031) (@carstenbauer) - Add support for CUDA 12.2. (#2034) (@maleadt) - rand: seed kernels from the host. (#2035) (@maleadt) - Update wrappers for CUDA 12.2. (#2039) (@maleadt) - On CUDA 12.2, have the memory pool enforce hard memory limits. (#2040) (@maleadt) - Delay all initialization errors until run time. (#2041) (@maleadt) - JLL/CI/Julia changes. (#2042) (@maleadt) - Add support for NVTX events to the integrated profiler. (#2043) (@maleadt) - Update cuStateVec to cuQuantum 23.6. (#2044) (@maleadt) - Add some more fastmath functions (#2047) (@Zentrik) - Fixup wrong key lookup. (#2048) (@RomeoV) - Update manifest (#2049) (@github-actions[bot]) - Make sortperm! resilient to type mismatches. (#2051) (@maleadt) - Disable tests that cause GC corruption on 1.10. (#2053) (@maleadt) - enable dependabot for GitHub actions (#2054) (@ranocha) - Bump actions/checkout from 2 to 3 (#2055) (@dependabot[bot]) - Bump peter-evans/create-pull-request from 3 to 5 (#2056) (@dependabot[bot]) - Rework how local toolkits are selected. (#2058) (@maleadt) - Busy-wait before doing nonblocking synchronization. (#2059) (@maleadt)

- Julia
Published by github-actions[bot] almost 3 years ago

CUDA - v4.4.0

CUDA v4.4.0

Diff since v4.3.2

Closed issues: - Unreachable control flow leads to illegal divergent barriers (#1746) - CUBLAS fails on new CUDA.jl v4 (#1852) - Sort fails on Lovelace (sm8.9) GPUs (#1874) - gesvd! crashes on Pascal and v12.0 (#1932) - No effect for calling "nsys launch" (#1938) - Basic math operations with nested adjoint and transpose (#1940) - CPU and GPU implementations return results at dissimilar scales, even in double precision arithmetics (#1950) - Failed CUDA.jl initialization breaks Flux? (#1952) - Recent mul! changes break multiplication with matrices that have StaticArray elements (#1953) - Test infrastructure: define test groups (#1961) - Strange rand errors when sampling large matrices (#1963) - Add aqua tests (#1964) - Support of Orin GPU from Nvidia ? (#1966) - Crash in LLVM (#1971) - Warning cuDNN Convolution (#1972) - Strange behaviour when installed at system level (#1973)

Merged pull requests: - Update benchmarks for 1.8 and 1.9 (#1933) (@maleadt) - CUSOLVER: Explicitly pass NULL when not requesting svd outputs. (#1934) (@maleadt) - Detect and complain about loading system libraries. (#1935) (@maleadt) - Update manifest (#1936) (@github-actions[bot]) - Avoid stack overflow with eary OOM reporting. (#1937) (@maleadt) - [CUSPARSE] Improved support for UniformScaling ad Diagonal (#1941) (@albertomercurio) - Update manifest (#1949) (@github-actions[bot]) - Update GPUCompiler to fix unreachable control flow. (#1951) (@maleadt) - Allow StaticArray eltype in matmat{vec,mul} (#1954) (@lcw) - Bump CUDNN to v8.9. (#1959) (@maleadt) - Bump CUTENSOR to v1.7. (#1960) (@maleadt) - Add and fix some aqua tests (#1965) (@charleskawczynski) - Fix compatibility of CUDA 11.4 to support Orin. (#1967) (@maleadt) - Don't use Int32 indices in rand kernels. (#1969) (@maleadt) - CI simplifications (#1970) (@maleadt) - Use Base.pkgversion on 1.9. (#1974) (@maleadt) - Update to LLVM.jl 6. (#1976) (@maleadt) - fix launch config bug in bitonic sort (#1979) (@xaellison) - Update manifest (#1980) (@github-actions[bot])

- Julia
Published by github-actions[bot] almost 3 years ago

CUDA - v4.3.2

CUDA v4.3.2

Diff since v4.3.1

Merged pull requests: - Reduce load time by shifting mul! definition (#1904) (@dkarrasch)

- Julia
Published by github-actions[bot] almost 3 years ago

CUDA - v4.3.1

CUDA v4.3.1

Diff since v4.3.0

Closed issues: - Array testsuite compiles kernel with large types (#1902) - CUDA.jl v4 installs CUDA runtime despite version=local (#1922) - Occaisonal "CUSOLVERError: an internal operation failed (code 7, CUSOLVERSTATUSINTERNAL_ERROR)" (#1924) - Does cuDNN@v1.0.4 need CUDA@v4.3? (#1929)

Merged pull requests: - Simplify libdevice linking. (#1927) (@maleadt) - Add a show method for kernel objects. (#1928) (@maleadt) - Update manifest (#1930) (@github-actions[bot]) - Pass a higher capability to ptxas. (#1931) (@maleadt)

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.3.0

CUDA v4.3.0

Diff since v4.2.0

Closed issues: - Multidimensional reverse (#1126) - Test errors on master (#1866) - Integer overflow error with svd for large matrix (#1880) - Erratic behaviour of CUDA.jl if used in the REPL of VSCode. (#1892) - QR decomposition requires scalar indexing (#1893) - BSOD during package tests (#1898) - Insufficient coverage of CuArrays in the documentation (#1901) - Failed to compile with Julia v1.9 on PowerPC (#1911) - CUDA test failed in wmma.jl (#1914) - Fix deprecation warnings (#1920)

Merged pull requests: - CUSOLVER: Fix workspace size passing. (#1890) (@maleadt) - Lovelace fixes (#1894) (@maleadt) - Update manifest (#1897) (@github-actions[bot]) - Reverse with multiple dimensions (#1899) (@RainerHeintzmann) - Restrict number of test jobs based on available memory. (#1900) (@maleadt) - Avoid unneeded macros to cut down on generated code (#1905) (@maleadt) - Avoid unneeded macros to cut down on generated code (#1906) (@maleadt) - Update manifest (#1907) (@github-actions[bot]) - Bump GPUCompiler. (#1908) (@maleadt) - Don't use Float64 atomics on unsupported platforms. (#1912) (@maleadt) - Report package versions as part of versioninfo(). (#1913) (@maleadt) - Align variables in constant memory by 256 bit (#1915) (@Zentrik) - Add norm functions for 3 floats (#1916) (@Zentrik) - cuDNN: only choose conv algorithms if they match descriptor mathType (#1917) (@ToucheSir) - Update manifest (#1918) (@github-actions[bot]) - Skip Integer WMMA tests on older devices. (#1919) (@maleadt)

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.2.0

CUDA v4.2.0

Diff since v4.1.4

Closed issues: - NVTX: consider using Start/End for ranges (#1485) - Limitations of CuIterator (#1768) - Testing fails on unsupported devices. (#1815) - Local runtime discovery does not work for external libraries (CUDNN, CUTENSOR) (#1850) - Passing tests using Github CI workflow errors with libcuda not defined (#1867) - Cannot precompile GPU code with SnoopPrecompile (#1870) - Incorrect kernel execution with bounds checking using Julia 1.9.0-rc2 (#1875) - Fake CUDA library (#1879) - Error thrown when launching Julia with Nsight systems or compute. (#1886) - Cannot construct CuDeviceArray (#1887) - Incorrect colVal array when using CuSparseMatrixCSR command on sparse matrix (#1888)

Merged pull requests: - Use adapt symmetrically in CuIterator (#1769) (@mcabbott) - Allow but warn when testing on not fully-supported devices. (#1818) (@maleadt) - Support runtime discovery for non-toolkit libraries (CUTENSOR, CUDNN, CUQUANTUM) (#1858) (@mloubout) - Add KernelAbstractions.jl unsafe_free! (#1863) (@pxl-th) - Allow precompiling CUDA code. (#1865) (@maleadt) - Assert CUDA.jl is functional when creating the TLS. (#1868) (@maleadt) - Update manifest (#1871) (@github-actions[bot]) - Don't collect AbstractQ objects in tests (#1872) (@dkarrasch) - Add compatibility entry for Lovelace (#1873) (@xaellison) - remove some type-piracy from cusparse (#1876) (@vtjnash) - Remove more unneeded ndims methods. (#1878) (@maleadt) - Guard the initialization-time CUDA driver check in a try/catch. (#1881) (@maleadt) - Update manifest (#1882) (@github-actions[bot]) - Update CUDA 12.1 to 12.1.1. (#1883) (@maleadt) - Use atomics for allocation statistics. (#1884) (@maleadt) - Fix atomic increment of alloc stats. (#1885) (@maleadt) - Update manifest (#1889) (@github-actions[bot])

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.1.4

CUDA v4.1.4

Diff since v4.1.3

Closed issues: - Buggy precompilation of init-defined symbols can break CUDADriverjll initialization (#1798) - Calling CUDA.setruntimeversion!() with float parameter makes CUDA.jl unusable. (#1831) - Unexpexted memory allocation when using randn! (#1856) - The memory copy speed seems to exceed the hardware limit (#1860) - PCG produces different output on GPU (via Krylov.jl) (#1864)

Merged pull requests: - Fix systemdriverversion on platforms not supported by CUDADriverjll. (#1854) (@maleadt) - Update manifest (#1861) (@github-actions[bot])

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.1.3

CUDA v4.1.3

Diff since v4.1.2

Closed issues: - CUDA.versioninfo() triggers download of lazy artifacts (#1844)

Merged pull requests: - Choose parallel tests based on CPUs, not threads. (#1842) (@maleadt) - Adapt to LLVM.jl 5 and GPUCompiler.jl 0.19. (#1847) (@maleadt)

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.1.2

CUDA v4.1.2

Diff since v4.1.1

Closed issues: - Flux's gradient differentiatingrfft leads to non-bit error (#1835)

Merged pull requests: - switch to using defined globals (#1832) (@simonbyrne) - Update manifest (#1837) (@github-actions[bot])

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.1.1

CUDA v4.1.1

Diff since v4.1.0

Merged pull requests: - Fix export of CUDABackend (#1834) (@vchuravy)

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.1.0

CUDA v4.1.0

Diff since v4.0.1

Closed issues: - ERROR: LoadError: bin\cublas6411.dll when installing CUDA (#1750) - System-wide CUDA in LDLIBRARYPATH breaks CUBLAS (#1755) - CuDeviceTexture getindex breaks when executed on the CPU (#1757) - cuDNN.version can cause Julia to crash, missing `cudnnopsinfer648.dll` (#1777) - cuDNN compile error "ERROR: LoadError: ArgumentError: invalid version string: local" (#1783) - "Error: No CUDA Runtime library found" for ≥v4.0.0 (#1808) - sqrt broken in kernels 'Format of nvvmreflect function not recognized' (#1817)

Merged pull requests: - Add support for CUDA 12.0. (#1742) (@maleadt) - Add more fixes and tests for CUDA toolkit 12.0 (#1756) (@amontoison) - Update manifest (#1758) (@github-actions[bot]) - Fix test/cusparse/interfaces.jl (#1762) (@amontoison) - Simplify the function sig. (#1763) (@N5N3) - Update manifest (#1770) (@github-actions[bot]) - Make versioninfo() resilient against NVML EPERM. (#1771) (@maleadt) - Move CUDAKernels to CUDA.jl (#1772) (@vchuravy) - [CUSPARSE] Improve conversion and tests between sparse matrices (#1774) (@amontoison) - Use geam for + and - operations with CuMatrix{<:CublasFloat} (#1775) (@amontoison) - Update manifest (#1776) (@github-actions[bot]) - Update manifest (#1781) (@github-actions[bot]) - Update manifest (#1784) (@github-actions[bot]) - [CUSPARSE] Update preconditioners.jl (#1785) (@amontoison) - [CUSOLVER] Avoid the conversion to CSR format for reordering routines (#1786) (@amontoison) - Bump GPUCompiler. (#1787) (@maleadt) - Remove unneeded variable. (#1788) (@maleadt) - [CUSPARSE] Update conversions.jl (#1791) (@amontoison) - Update to CUDNN 8.8.1 for CUDA 12 compatibility. (#1792) (@maleadt) - Add support for CUDA 12.1 (#1793) (@maleadt) - [CUSPARSE] Interface color reordering (#1794) (@amontoison) - [CUSPARSE] Interface gtsv2 (#1795) (@amontoison) - Update manifest (#1796) (@github-actions[bot]) - Adapt to GPUCompiler 0.18 (#1799) (@maleadt) - Follow Array's behavior when initializing (#1800) (@lcw) - [CUSOLVER] Support A \ b for rectangular matrices (#1802) (@amontoison) - Use symbols instead of values when emitting code, when possible. (#1804) (@maleadt) - Refactor CI pipeline a little. (#1805) (@maleadt) - [CUSOLVER] Improve the dispatch for LAPACK routines (#1806) (@amontoison) - Diagonal for lower triangular of LU decomposition set incorrectly (#1813) (@tgymnich) - CompatHelper: add new compat entry for "KernelAbstractions" at version "0.9" (#1824) (@github-actions[bot]) - Rebuild CUPTI API with support for STRUCT_SIZE (#1827) (@vchuravy) - Release CUDA 4.1 (#1828) (@vchuravy)

- Julia
Published by github-actions[bot] about 3 years ago

CUDA - v4.0.1

What's Changed

  • Warn when using old devices by @maleadt in https://github.com/JuliaGPU/CUDA.jl/pull/1752
  • Silence some errors to support conditional use. by @maleadt in https://github.com/JuliaGPU/CUDA.jl/pull/1754

Full Changelog: https://github.com/JuliaGPU/CUDA.jl/compare/v4.0.0...v4.0.1

- Julia
Published by vchuravy over 3 years ago

CUDA - v4.0.0

CUDA v4.0.0

Diff since v3.13.1

Closed issues: - Missing implementation of right multiply for QR decomposition (#1738) - [CUSPARSE] Type error with mm! (#1743)

Merged pull requests: - Implement rmul for qr. (#1739) (@maleadt) - Update manifest (#1741) (@github-actions[bot]) - Update CUSPARSE for CUDA v12.0 (#1744) (@amontoison) - Fix nvprof command (#1745) (@lucifer1004) - Update manifest (#1747) (@github-actions[bot]) - Fix grammar (#1748) (@lucifer1004)

- Julia
Published by github-actions[bot] over 3 years ago

CUDA - v3.13.1

CUDA v3.13.1

Diff since v3.13.0

Closed issues: - CUDA.jl cuFFT underperforming against CuPy cuFFT (#1682) - Is block-spmm supported? (#1736)

Merged pull requests: - Introduce cuFFT plan cache; switch to auto-managed memory. (#1734) (@maleadt) - Stop pirating GPUArrays' RNG methods. (#1735) (@maleadt)

- Julia
Published by github-actions[bot] over 3 years ago

CUDA - v3.12.2

CUDA v3.12.2

Diff since v3.12.1

Closed issues: - CUDA.jl cuFFT underperforming against CuPy cuFFT (#1682) - Error during CUDA test (#1718) - Kernel error from bad broadcast (should be regular error?) (#1720) - Freeze into StackOverflow when JULIA_DEBUG=CUDA set (#1721) - Use of linear operators in CUDA.jl (#1727) - Is block-spmm supported? (#1736)

Merged pull requests: - Allow copy(::RNG) (#1719) (@mcabbott) - Update manifest (#1722) (@github-actions[bot]) - Simplify CuError rendering before library initialization. (#1723) (@maleadt) - Simplify CuError rendering before library initialization (master branch version) (#1724) (@maleadt) - Make device RNG test more robust. (#1725) (@maleadt) - Rely on LLVM.jl's typed_ccall for more intrinsics. (#1728) (@maleadt) - Backports for 3.13 (#1729) (@maleadt) - Simplify CUBLAS and CUSPARSE wrappers, reducing code generated. (#1730) (@maleadt) - Add Julia 1.9 CI. (#1731) (@maleadt) - Use released dependencies. (#1732) (@maleadt) - Remove NVTX. (#1733) (@maleadt) - Introduce cuFFT plan cache; switch to auto-managed memory. (#1734) (@maleadt) - Stop pirating GPUArrays' RNG methods. (#1735) (@maleadt)

- Julia
Published by github-actions[bot] over 3 years ago

CUDA - v3.13.0

CUDA v3.13.0

Diff since v3.12.1

Closed issues: - Error during CUDA test (#1718) - Kernel error from bad broadcast (should be regular error?) (#1720) - Freeze into StackOverflow when JULIA_DEBUG=CUDA set (#1721) - Use of linear operators in CUDA.jl (#1727)

Merged pull requests: - Allow copy(::RNG) (#1719) (@mcabbott) - Update manifest (#1722) (@github-actions[bot]) - Simplify CuError rendering before library initialization. (#1723) (@maleadt) - Simplify CuError rendering before library initialization (master branch version) (#1724) (@maleadt) - Make device RNG test more robust. (#1725) (@maleadt) - Rely on LLVM.jl's typed_ccall for more intrinsics. (#1728) (@maleadt) - Backports for 3.13 (#1729) (@maleadt) - Simplify CUBLAS and CUSPARSE wrappers, reducing code generated. (#1730) (@maleadt) - Add Julia 1.9 CI. (#1731) (@maleadt) - Use released dependencies. (#1732) (@maleadt) - Remove NVTX. (#1733) (@maleadt)

- Julia
Published by github-actions[bot] over 3 years ago

CUDA - v3.12.1

CUDA v3.12.1

Diff since v3.12.0

Closed issues: - Accumulate doesn't work on >=4 dim Arrays with dims <= ndims(A) - 3 (#1039) - CUSPARSE does not support dense-sparse matrix multiplication (#1403) - Scalar indexing when comparing a CuArray to the identity matrix (#1557) - CUBLASSTATUSNOTINITIALIZED (#1567) - LinearAlgebra./ and LinearAlgebra.\ breaks CuArray (#1568) - Window size in grid-stride loop (#1573) - Matrix multiplication works for primitive and non-primitive custom number types on the CPU, but it fails for primitive custom number types on the GPU. (#1574) - CuIterator doesn't specify IteratorSize but has no length() (#1583) - Garbage collection doesn't work as shown in the documentation (#1586) - Adding sparse adjoint results in kernel error (#1591) - sparse - sparse matrix multiplication partially missing (#1599) - FastMath sincos(), cis(), exp(im..) aren't as fast as C++ (#1606) - wrong type in wrapper of a cusolver function (#1621) - Adding CUDNN support for 3D convolutions/cross-correlations (#1631) - copyto! does not work between a CuArray and a view(Array) (#1634) - Minor issue with sparse function (#1641) - Scalar indexing when displaying Diagonal{Int64, CuSparseVector{Int64, Int32}} (#1645) - Many errors running test suite on GTX 960 4GB (#1650) - Driver discovery broken on platforms without compat driver (#1653) - Aliasing/Polluted Result from rfftplan for Float32 2^n 3D array (#1656) - Re-instate memory limit (#1670) - Split libnvToolsExt from CUDARuntime_jll? (#1672) - accumulate(op, a) causes scalar indexing (#1680) - CUSPARSE CI failures (#1692) - axpy! for nested base types (reshapedarray/adjoint/view) (#1696) - copyto! between a PermutedDimsArray view and a CuArray doesn't work (#1697) - WMMA test failure (#1700) - UndefVarError when a binary is not found (#1701) - Is CUSPARSELT supported? (#1702) - Best practices to reduce startup time (#1707) - 1.9 compatibility (#1710) - WARNING: unused variadic paramters. (#1712)

Merged pull requests: - Remove/rework CuDeviceArray constructors (#1308) (@maleadt) - Add always_inline kernel parameter (#1554) (@lcw) - Update manifest (#1564) (@github-actions[bot]) - Update manifest (#1569) (@github-actions[bot]) - Update manifest (#1571) (@github-actions[bot]) - Fix native RNG window calculation. (#1575) (@maleadt) - Use Base.activeproject. (#1576) (@maleadt) - Fixes for and tests using JET. (#1577) (@maleadt) - Update manifest (#1578) (@github-actions[bot]) - Docs, remove global variables in intro benchmark (#1580) (@SteffenPL) - Update manifest (#1581) (@github-actions[bot]) - Update manifest (#1582) (@github-actions[bot]) - Bugfixes when using \ operator with non square matrices (#1584) (@GVigne) - remove unbound type parameters (#1585) (@nsajko) - added --openacc-profiling off to the nvprof (#1587) (@mbeltagy) - Update manifest (#1588) (@github-actions[bot]) - Wrap at-cuda's code in a let block. (#1589) (@maleadt) - Revert: Use JET during test suite. (#1590) (@maleadt) - [CUSPARSE] Update mv! and mm! functions for CuSparseMatrixCOO and CuSparseMatrixCSC (#1592) (@amontoison) - [CUSPARSE] Add sv! and sm! routines (#1593) (@amontoison) - CompatHelper: bump compat for "BFloat16s" to "0.3" (#1594) (@github-actions[bot]) - Update wrap.jl (#1595) (@amontoison) - Provide more useful explanation why an eltype is unsupported. (#1596) (@maleadt) - CompatHelper: bump compat for "BFloat16s" to "0.4" (#1597) (@github-actions[bot]) - Improve eltype error reporting. (#1598) (@maleadt) - Add () at the end of the library name in all ccall (#1600) (@amontoison) - Define length for CuIterator (#1602) (@mcabbott) - Added more sparse functions like: kron, tril, triu, reshape, adjoint, transpose, sparse-sparse multiplication (#1603) (@albertomercurio) - Fix rotate! and reflect! for the generic fallback in GPUArrays.jl (#1604) (@amontoison) - Update manifest (#1605) (@github-actions[bot]) - Update manifest (#1609) (@github-actions[bot]) - [CUSPARSE] Interface generic routines (#1611) (@amontoison) - [CUSPARSE] Update sparse-sparse GEMM (#1613) (@amontoison) - [CUSPARSE] Add sddmm! and gemvi! routines (#1615) (@amontoison) - Update manifest (#1616) (@github-actions[bot]) - Don't use isbitsunion to support structs of union types. (#1617) (@maleadt) - Update CUDA driver compatibility package to 11.8. (#1618) (@maleadt) - Update CUDA artifacts to 11.7 Update 1. (#1619) (@maleadt) - Update to CUDA 11.8 (#1620) (@maleadt) - Update to CUDNN 8.6. (#1622) (@maleadt) - Move CUDNN and CUTENSOR into separate packages (#1624) (@maleadt) - Bump BFloat16s. (#1625) (@maleadt) - fix #1621 (#1626) (@jemiryguo) - Restore functionality of FastMath.sincos. (#1627) (@maleadt) - Update manifest (#1628) (@github-actions[bot]) - Switch from manual artifact handling to automated JLLs (#1629) (@maleadt) - [CUSPARSE] Add CuMatrix * CuSparseMatrix products (#1632) (@amontoison) - Silence some test warnings. (#1635) (@maleadt) - Update CUTENSOR to v1.6 (#1636) (@maleadt) - [CUSPARSE] Add SparseMatrix * SparseVector products (#1637) (@amontoison) - Upgrade CUSTATEVEC to v1.1 (#1638) (@maleadt) - Upgrade CUTENSORNET to v1.1 (#1639) (@maleadt) - [CUSPARSE] Add CuSparseVector ± CuSparseVector (#1640) (@amontoison) - CompatHelper: add new compat entry for "Preferences" at version "1" (#1642) (@github-actions[bot]) - Fix #1641 (#1643) (@amontoison) - Update manifest (#1646) (@github-actions[bot]) - [CUSPARSE] Add dot(CuSparseVector,CuVector) and vice-versa (#1647) (@amontoison) - [CUSPARSE] Add ldiv! for CuSparseMatrixCOO and geam for CuSparseMatrixCSC (#1648) (@amontoison) - Update autogenerated headers (#1649) (@maleadt) - Remove deprecations (#1651) (@maleadt) - Don't warn about the old JULIACUDAUSEBINARYBUILDER env var when using preferences (#1652) (@maleadt) - Update CUTENSORNET to use new slice group (#1654) (@kshyatt) - [CUSPARSE] Fix conversions between CuSparseMatrixCOO and CuSparseMatrixCSC (#1655) (@amontoison) - Include compiler options in error log. (#1657) (@maleadt) - Discover the system driver when CUDADriverjll isn't available. (#1658) (@maleadt) - Preserve buffer type when adapting to CuArray. (#1659) (@maleadt) - Update manifest (#1661) (@github-actions[bot]) - Extend conversion of QRPackedQ object to CuArray (#1662) (@GVigne) - [CUSPARSE] Add CuSparseMatrixCSC * CuSparseMatrixCSC (#1663) (@amontoison) - Update manifest (#1665) (@github-actions[bot]) - [CUSPARSE] Add more tests (#1668) (@amontoison) - Update manifest (#1671) (@github-actions[bot]) - Update manifest (#1676) (@github-actions[bot]) - Fix eigen when using Hermitian or Symmetric matrices (#1677) (@GVigne) - Update manifest (#1679) (@github-actions[bot]) - adding defaults for accumulate(op, a) with modified code from Base.accumulate (#1681) (@leios) - Add right division operator for Diagonal matrices (#1683) (@GVigne) - Update manifest (#1686) (@github-actions[bot]) - Bump CUQUANTUM libraries (#1688) (@maleadt) - typo (#1689) (@ArnoStrouwen) - Retry CUSOLVER handle creation when encountering an internal error. (#1691) (@maleadt) - Fix #1692 (#1693) (@amontoison) - Update manifest (#1694) (@github-actions[bot]) - [CUSPARSE] Support kron with Diagonal arguments (#1695) (@albertomercurio) - Re-introduce memory limits. (#1698) (@maleadt) - Adapt to GPUCompiler changes. (#1699) (@maleadt) - WMMA: Don't wrap fragments of size 1 in a struct. (#1704) (@maleadt) - Update manifest (#1708) (@github-actions[bot]) - Use plain llvmcall calling convention for WMMA intrinsics. (#1709) (@maleadt) - Reclaim in cuDNN conv algorithm search (#1711) (@ToucheSir) - CUBLAS: test against generic axp(b)y, not the BLAS-specific one. (#1713) (@maleadt) - Fix LU getproperty invoke. (#1714) (@maleadt) - Backports for 3.12.1 (#1715) (@maleadt) - Specialize cholcopy to avoid scalar indexing. (#1716) (@maleadt) - Fix handling of inline-allocated structures with unions. (#1717) (@maleadt)

- Julia
Published by github-actions[bot] over 3 years ago

CUDA - v3.12.0

CUDA v3.12.0

Diff since v3.11.0

Closed issues: - Implement Base.repeat (#177) - repeat performs scalar indexing for multi-dimensional arrays (#1051) - The GPU compiler fails on a call to maximum (#1548) - versioninfo triggers artifact downloads (#1549) - Error when broadcasting composed functions (#1550) - overload Base.copy! for AbstractGPUArray{<:Any,1} (#1555)

Merged pull requests: - Fix math quirk. (#1546) (@maleadt) - Wrap cusolverRf.h and cusolverSp_LOWLEVEL_PREVIEW.h (#1547) (@frapac) - Update manifest (#1551) (@github-actions[bot]) - tighten unsafe_wrap signature on scalar length (#1552) (@sjkelly) - Update Documenter key. (#1553) (@maleadt) - Update manifest (#1556) (@github-actions[bot]) - Import factorisation internal types from LinearAlgebra (#1558) (@theabhirath) - Update manifest (#1560) (@github-actions[bot]) - add reshape for CuDeviceArray (#1561) (@omlins)

- Julia
Published by github-actions[bot] almost 4 years ago

CUDA - v3.11.0

CUDA v3.11.0

Diff since v3.10.1

Closed issues: - CUSPARSE: Diagonal + CSC/CSR gives dense array (#1469) - CUBLAS: Multiplication of UpperTriangular/LowerTriangular not supported (#1486) - CUTENSOR tests consume lots of memory, breaking other tests (#1501) - CUFFT doesn't work for ComplexF64 C2C in-place (#1519) - Inconsistency of == and isequal for CuArray (#1524) - Setting CUDA seed the first time changes Random's RNG non-deterministically (#1526) - Undefined exported symbols (#1527) - Could not load library libLLVMExtra-14.dll (#1535) - Add an rrule for cholesky to CUDA.jl (#1541)

Merged pull requests: - specialize +/- op for sparse diag (#1514) (@Roger-luo) - Make sure instantiating RNGs doesn't affect the global CPU RNG. (#1530) (@maleadt) - Update manifest (#1531) (@github-actions[bot]) - ldiv! for LU Decomposition (#1532) (@SBuercklin) - Lower dmax for contraction tests (#1534) (@kshyatt) - Fix convolution algorithm search (#1536) (@maxfreu) - Update manifest (#1537) (@github-actions[bot]) - add specializations for some triangular-triangular multiplications (#1538) (@Red-Portal) - Add a utility to download artifacts without a functional driver. (#1539) (@maleadt) - Update manifest (#1543) (@github-actions[bot]) - Explicit tests for type conversion (#1544) (@kshyatt) - Remove unused exports. (#1545) (@maleadt)

- Julia
Published by github-actions[bot] almost 4 years ago

CUDA - v3.10.1

CUDA v3.10.1

Diff since v3.10.0

Closed issues: - Overflow in randn using CUDA.jl's native RNG (#1464) - Segmentation fault with pre-compiled library importing CUDA (#1465) - Julia freezes when using Polynomials with CuArray (#1497) - Launch overhead regression (#1503) - CUSOLVER: Matrix division requires identical types (#1512) - Incorrect distribution for complex standard normals when using CUDA.default_rng() (#1515) - loggamma (#1528)

Merged pull requests: - CUSPARSE: Support mixed type mv (#1475) (@Roger-luo) - Add method for LinearAlgebra.opnorm2 (#1516) (@danielwe) - Promote to common eltype in matrix division (#1517) (@danielwe) - Fix Box-Muller transformation for complex eltypes (#1518) (@danielwe) - Update manifest (#1521) (@github-actions[bot]) - Use at-dispose for LLVM.jl resource cleanup. (#1523) (@maleadt) - loggamma (#1529) (@cossio)

- Julia
Published by github-actions[bot] about 4 years ago

CUDA - v3.10.0

CUDA v3.10.0

Diff since v3.9.1

Closed issues: - Error while freeing DeviceBuffer-warning when using multiple GPUs (#1454) - CUDNN cache locking prevents finalizers resulting in OOMs (#1461) - EOFError from pool_cleanup when closing REPL (#1495) - TypeError in compiler with custom kernel (#1496)

Merged pull requests: - expose sparse mv/mm algo selection (#1201) (@Roger-luo) - Always inspect the task-local context when verifying before freeing. (#1462) (@maleadt) - support sparse opnorm (#1466) (@Roger-luo) - Move CUSTATEVEC and CUTENSORNET into lib/ (#1478) (@vchuravy) - Adapt to GPUCompiler 0.15 changes (#1488) (@maleadt) - Limit time held by CUDNN locks. (#1491) (@maleadt) - Docstring for cu (#1493) (@mcabbott) - Update manifest (#1499) (@github-actions[bot]) - Silence EOFError in pool_cleanup (#1502) (@Octogonapus) - Adapt to GPUCompiler changes (#1504) (@maleadt) - Fixes for CUSPARSE 11.7.1. (#1505) (@maleadt) - Update artifacts (#1507) (@maleadt) - Update manifest (#1509) (@github-actions[bot]) - Add a new cache for HostKernel objects. (#1510) (@maleadt)

- Julia
Published by github-actions[bot] about 4 years ago

CUDA - v3.9.1

CUDA v3.9.1

Diff since v3.9.0

Closed issues: - Issue with copy_cublasfloat (#1476) - Errors when broadcasting random number generators (#1480) - CPU version of linear algebra routine is dispatched when using Zygote.gradient (#1481) - scan! fails on vectors of structs (#1482) - InexactError when getting CUDA version info (#1489)

Merged pull requests: - Allow more integer argument types for byteperm (#1420) (@eschnett) - support CuSparseMatrix(::Diagonal) (#1470) (@Roger-luo) - Don't emit debug info until the next CUDA version. (#1473) (@maleadt) - Update manifest (#1474) (@github-actions[bot]) - Update manifest (#1479) (@github-actions[bot]) - fix unsafewrap docstring and widen signature (#1483) (@piever) - Update manifest (#1484) (@github-actions[bot]) - Check whether cudaRuntimeGetVersion succeeded. (#1490) (@maleadt) - Update manifest (#1494) (@github-actions[bot]) - Fix #1476: Allow any container in copy_cublasfloat (#1498) (@danielwe)

- Julia
Published by github-actions[bot] about 4 years ago

CUDA - v3.9.0

CUDA v3.9.0

Diff since v3.8.5

Closed issues: - Tests for showing (#35) - Support LU factorizations (#1193) - Int8 WMMA not working in 3.8.4 and 3.8.5 despite merged PR. Add more unit tests? (#1442) - Optional CPU cpu kernel call with @cuda (#1443) - Add library/artifact management for NCCL (#1446) - permutedims returns a lowertriangular matrix (#1451) - New broadcast corrupts memory? (#1457) - norm does not dispatch on CuSparseMatrixCSC (#1460) - scalar * sparse multiplication (#1468)

Merged pull requests: - CUTENSOR: axpy! and axpby! not mutating fixed (#1416) (@yapanuwan) - Initial wrap of cuquantum (#1437) (@kshyatt) - CompatHelper: bump compat for "GPUCompiler" to "0.14" (#1441) (@github-actions[bot]) - Fix return type of nrm2 for ComplexF16 (#1444) (@danielwe) - Use a build matrix. (#1445) (@maleadt) - Update manifest (#1447) (@github-actions[bot]) - Rework factorizations (#1449) (@maleadt) - Add NCCL binaries. (#1450) (@maleadt) - Support general eltypes in matrix division and SVD (#1453) (@danielwe) - Update manifest (#1456) (@github-actions[bot]) - Look at more environment variables to find nsys. (#1459) (@maleadt) - Fixes for 1.8 (#1463) (@maleadt)

- Julia
Published by github-actions[bot] about 4 years ago

CUDA - v3.8.5

CUDA v3.8.5

Diff since v3.8.4

Merged pull requests: - Update manifest (#1440) (@github-actions[bot])

- Julia
Published by github-actions[bot] about 4 years ago

CUDA - v3.8.4

CUDA v3.8.4

Diff since v3.8.3

Closed issues: - sparse-sparse and sparse-constant multiplication lose sparsity (output dense matrix) (#1264) - LLVMExtra fails to load on Julia 1.8 and PPC (#1387) - compute-sanitizer CUDAERRORINVALID_VALUE on CUDA.jl 3.0+ (#1415) - @cudnnDescriptor is not threadsafe (#1421) - Precomplication of CUDA 3.8.3 broken on 1.7.1 due to changes in Random123.jl (#1422) - OOM error should include memory status (#1427) - WMMA kernel works with Julia 1.7.2 but fails with illegal memory access for Julia 1.8.0-beta1 (#1431) - Non Int64 local memory size leads to dynamic function invocation (#1434) - "initialization" test failing (#1435) - cuda with julia 1.8 not working on windows (working fine(?) on wsl2) (#1436)

Merged pull requests: - Add Int8 WMMA Support (#1119) (@max-Hawkins) - Wrap generic sparse-sparse GEMM (#1285) (@kshyatt) - Fix sparse COO to CSR conversion. (#1412) (@maleadt) - Drop support for CUDA 10.1 and below (#1414) (@maleadt) - Update manifest (#1417) (@github-actions[bot]) - Report the OOM memory status at the time of the error. (#1428) (@maleadt) - Lock CUDNN descriptor cache lookups. (#1430) (@maleadt) - Switch to new LLVM context management for 1.9 compatibility. (#1432) (@maleadt) - Update manifest (#1433) (@github-actions[bot]) - Backports for 3.8.4 (#1438) (@maleadt)

- Julia
Published by github-actions[bot] about 4 years ago

CUDA - v3.8.3

CUDA v3.8.3

Diff since v3.8.2

Closed issues: - Sparse matrix addition not working (#528) - Native implementation of sparse arrays (#829) - CUSPARSE: Adding a value to the diagonal (#1372) - Conversion by cu casts Float64 to Float32 but not Int64 to Int32 (#1388) - CUDA.math_mode!(...; precision) option not working (#1392) - cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail (#1398) - axpby! support for BFloat16 (#1399) - CUSPARSE does not support integer matrices, breaks printing (#1402) - sparse(I, J, V) doesn't support unsorted inputs (#1407)

Merged pull requests: - General purpose broadcast for sparse CSR matrices. (#1380) (@maleadt) - Update manifest (#1389) (@github-actions[bot]) - Implement sparse operations with UniformScaling using broadcast. (#1390) (@maleadt) - Prevent toplevel compilation. (#1391) (@maleadt) - Fix and test math precision. (#1394) (@maleadt) - Bump artifacts (#1397) (@maleadt) - support BFloat16 for atomic_cas (#1400) (@bjarthur) - Implement sparse broadcasting with CSC matrices. (#1401) (@maleadt) - Always report issues with discovering CUDA. (#1404) (@maleadt) - Fix sparse 1-argument broadcast output type. (#1405) (@maleadt) - CUSPARSE BSR improvements (#1409) (@maleadt) - Support limited sparse integer arrays by bitcasting to floating point. (#1410) (@maleadt) - Support using sparse with unsorted inputs. (#1411) (@maleadt) - Backports for 3.8.3 (#1413) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.8.2

CUDA v3.8.2

Diff since v3.8.1

Closed issues: - CuSparseMatrixCSC missing lu and interactions with UniformScaling (#79) - CUSPARSE typo (#1231) - similar(A::CuSparse,eltype) returns an Array (#1316) - "errormonitor" undefined in julia1.6 (#1375) - Pool free can switch tasks (#1384)

Merged pull requests: - Define a compatibility shim for errormonitor (#1378) (@vchuravy) - Backport #1361 to 3.8 (#1379) (@vchuravy) - Backports for 3.8.2 (#1381) (@maleadt) - Remove broken errormonitor implementation, just don't use it on 1.6. (#1382) (@maleadt) - Memory pool improvements (#1383) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.8.1

CUDA v3.8.1

Diff since v3.8.0

Closed issues: - one(::CuMatrix) result on cpu (#142) - Broadcasted setindex! triggers scalar setindex! (#101) - OutOfGPUMemoryError With Available Memory (#1346) - Distributions.jl with CuArrays (#1347) - Views of Flux OneHotArrays (#1349) - synchronize(blocking = false) hangs in julia 1.7 eventually (#1350) - unsupported call through a literal pointer (call to log1pf) on Julia 1.6.5 (#1352) - SpecialFunctions ^1.8 compat entry? (#1354) - Performance deprecation using ^ on Float32 (#1358) - Method definition setindex!(LinearAlgebra.Diagonal{T, V} ... overwritten in module CUDA (#1364) - [PackageCompiler] Segmentation fault with CUDA.jl in multiversioning (#1365) - Vectors in customary structs make julia stuck (#1366) - sparseCSC-dense matrix multiplication yields unstable results (#1368) - UndefVarError: parameters not defined on Windows10 (#1371)

Merged pull requests: - Optimize memoization helpers. (#1345) (@maleadt) - Update manifest (#1348) (@github-actions[bot]) - Update manifest (#1355) (@github-actions[bot]) - Fastmath improvements (#1356) (@maleadt) - Make the default pool visible when doing P2P (#1357) (@maleadt) - Fix resize of empty arrays. (#1359) (@maleadt) - CUSPARSE: add COO ctors and similar with eltype. (#1360) (@maleadt) - Add device_override for SpecialFunctions.gamma (#1361) (@vchuravy) - Implement (limited) broadcast of sparse arrays (#1367) (@maleadt) - Make nonblocking synchronization robust to errors. (#1369) (@maleadt) - Update manifest (#1370) (@github-actions[bot]) - Backports for 3.8.1 (#1374) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.8.0

CUDA v3.8.0

Diff since v3.7.1

Closed issues: - Consider reserving memory (#1320)

Merged pull requests: - Slight changes to pool management (#1344) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.7.1

CUDA v3.7.1

Diff since v3.7.0

Closed issues: - Moving data between devices (#1136) - Repeated hascudagpu errors when CUDAVISIBLEDEVICES is empty (#1331) - Error when env var CUDAVISIBLEDEVICES is set but empty (#1336)

Merged pull requests: - Wrap and test peer to peer memory copies (#1284) (@kshyatt) - Update manifest (#1332) (@github-actions[bot]) - Have libcuda() fail repeatedly if anything (e.g. init) failed. (#1333) (@maleadt) - Simplify workarounds. (#1334) (@maleadt) - Properly detect a missing driver. (#1335) (@maleadt) - Various small fixes (#1337) (@maleadt) - Move CUDA.jl global state innto CUDAdrv wrapper "submodule" (#1338) (@maleadt) - Add CUDA.return_type (#1339) (@tkf) - Compute-sanitizer QOL improvements and docs (#1340) (@maleadt) - Fix regression in backwards CUFFT plans. (#1341) (@maleadt) - Don't assume host pointers are directly usable on the device. (#1342) (@maleadt) - Backports for 3.7.1 (#1343) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.7.0

CUDA v3.7.0

Diff since v3.6.4

Closed issues: - mul! is missing for plan_fft! (#1311) - Segfault with CUDA in a sysimage (#1314) - CuSparse does not support broadcast (#1317) - CUDA.functional(true) errors instead of printing "why" and returning false (#1318) - Interesting timings (#1323) - Syncronization how to? (#1324)

Merged pull requests: - Remove debug info hack. (#1259) (@maleadt) - Update manifest (#1312) (@github-actions[bot]) - CUFFT improvements (#1313) (@maleadt) - Add additional quirks. (#1315) (@maleadt) - Use pointer to async_send directly instead of a wrapper function (#1319) (@vchuravy) - Update manifest (#1325) (@github-actions[bot]) - Add support and test CUDA 11.6. (#1326) (@maleadt) - Bump CUTENSOR, expose libcutensorMg. (#1327) (@maleadt) - Bump CUDNN to v8.3.2. (#1328) (@maleadt) - Enable use of CUDA 11.6. (#1329) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.6.4

CUDA v3.6.4

Diff since v3.6.3

Closed issues: - Artifacts.toml has bad git-tree-sha1 values (#1309)

Merged pull requests: - Fix artifact-related issues (#1310) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.6.3

CUDA v3.6.3

Diff since v3.6.2

Closed issues: - CUDA.@atomic deadlocks when overwriting NaN (#1299) - Unreasonablely slow copy kernel (#1301) - Passing a LogicalIndex(::CuArray) fails (#1304)

Merged pull requests: - Allow sorting of tuples of numbers (#1196) (@mcabbott) - Use === for generic atomic updates with compare-and-swap (#1300) (@guyvdbroeck) - Update manifest (#1302) (@github-actions[bot]) - Store the array length next to its dimensions. (#1303) (@maleadt) - Disallow calling CUDA device array intrinsics on the host. (#1305) (@maleadt) - Support logical indexing with CPU sources. (#1306) (@maleadt) - Activate a context when calling device!. (#1307) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.6.2

CUDA v3.6.2

Diff since v3.6.1

Closed issues: - Norm of complex-typed CuArray is not real (#1290) - Calling @show on Symmetric of a CuArray triggers Scalar Indexing (#1294) - CUSPARSE Error when solving a linear system (#1296)

Merged pull requests: - Correctly handle missing cached_memory. (#1295) (@maleadt) - Update manifest (#1297) (@github-actions[bot])

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.6.1

CUDA v3.6.1

Diff since v3.6.0

Closed issues: - reduceblock error on Complex type (#1289) - cudnncnninfer648 could not be laoded (#1291) - Support to find the first k eigenvalues of a sparse matrix (#1292)

Merged pull requests: - Bump CUDNN artifacts (#1293) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.6.0

CUDA v3.6.0

Diff since v3.5.0

Closed issues: - Conversion issue (#157) - Extend new RNG to Complex numbers & normal distributions (#726) - Fatal errors during sorting tests (#916) - deepcopy failing (#1202) - Kernel compilation fails when specifying shared memory array size as a tuple consisting of block dimension and kernel argument (#1205) - ERROR: LoadError: The artifact at C:\Users\name.julia\artifacts\58bd87695e9ccdb508cb38be1ab717315ecc9152 is empty. (#1209) - InvalidIRError when displaying a model which is on the GPU (#1212) - CUDA.jl tries to load CUDA compat loaded via jll even though system package is installed (#1216) - Synchronizing over blocks (#1220) - assignment changes random seed (#1226) - accumulate gives wrong answer when init != 0 (#1227) - Generic dot kernel: use multiple kernels instead of atomics (#1244) - integer division error creating CuVector of missing and nothing (#1251) - unsupported dynamic function invocation with union type of more than 2 elements (#1252) - three CUDA.@atomic in a row result in out-of-bounds error (#1254) - Float16 CAS cannot use atom.cas.b16.global on sm61 (#1258) - cu(::SVector) gives SVector, cu(::MVector) gives CuArray (#1262) - Get back `unsafecopyto!methods for unified<-unified and unified<->device (#1263) - Passing and using a FFT plan in a CUDA kernel seems impossible (#1266) - Inplace Complex FFT and Threads (#1268) -sortreturns nothing (#1270) - Release a new version (#1276) -init_drivernot called in 3.5 (#1280) - Shared memory does not support isbits unions. (#1281) - NVIDIA Nsight Systems andCUDA.@profileerror (#1282) - nvprof withusing CUDA` crashes julia (#1283)

Merged pull requests: - Addition over CuSparseMatrix (#1195) (@yuehhua) - [CUSOLVER] Add ordering functions (#1198) (@amontoison) - Correctly handle multi-GPU instances with NVML. (#1199) (@maleadt) - CI improvements. (#1200) (@maleadt) - fix FFT workarea typo leading to memory corruption (#1204) (@marius311) - Update manifest (#1206) (@github-actions[bot]) - Minor improvements for library wrappers (#1207) (@maleadt) - Various small improvements (#1210) (@maleadt) - Extend CuDeviceArray ctors for mixed-int indices. (#1211) (@maleadt) - Deprecate non-blocking sync, and always call the synchronization API. (#1213) (@maleadt) - Generic CUSPARSE: use the index arguments. (#1214) (@maleadt) - Add bitonic sort implementation (#1217) (@xaellison) - Update manifest (#1218) (@github-actions[bot]) - Reverted deepcopy, added test (#1221) (@birkmichael) - Use broadcast instead of copies to initialize mapreduce buffers. (#1223) (@maleadt) - Remove some unneeded Base module prefixes. (#1224) (@maleadt) - Update manifest (#1225) (@github-actions[bot]) - Cherry-picked improvements (#1228) (@maleadt) - Update introduction.jl (#1232) (@aramirezreyes) - Update manifest (#1233) (@github-actions[bot]) - Fix SpMV for CUDA 11.5 (#1234) (@amontoison) - Add support for randn and randexp. (#1236) (@maleadt) - Avoid double-initializing partial accumulate results. (#1237) (@maleadt) - Fix cuTENSOR contractions not working for FP16 inputs (#1238) (@thomasfaingnaert) - Bump CUTENSOR and fix on CUDA 11.5 (#1239) (@maleadt) - Support dot product on GPU between CuArrays with inconsistent eltypes (#1240) (@findmyway) - Update manifest (#1241) (@github-actions[bot]) - Optimize CUTENSOR contraction. (#1243) (@maleadt) - Don't use nondeterministic atomics in dot when requested. (#1245) (@maleadt) - Remove CUBLAS decomposition tests without pivoting. (#1246) (@maleadt) - Update manifest (#1247) (@github-actions[bot]) - wrap CUBLAS spmv and spr (#1248) (@bjarthur) - CompatHelper: bump compat for "SpecialFunctions" to "2" (#1249) (@github-actions[bot]) - Update manifest (#1250) (@github-actions[bot]) - Store array offset as elements to fix all-singleton case. (#1255) (@maleadt) - Update CUDA to 11.5 Update 1. (#1256) (@maleadt) - Use Base functionality for iteration Union type components. (#1257) (@maleadt) - Bump CI to Julia 1.7. (#1260) (@maleadt) - Update manifest (#1261) (@github-actions[bot]) - Use CUDA APIs for unoptimized copies. (#1265) (@maleadt) - Bump CUDNN to 8.3.1, enable CUDA 11.5 by default. (#1267) (@maleadt) - Adding stream update for inplace complex FFT (#1269) (@ovanvincq) - Fix sort! return type. (#1272) (@maleadt) - Add const keyword to type aliases declarations. (#1273) (@eliascarv) - Update manifest (#1274) (@github-actions[bot]) - Avoid eager expansion of CUDAcompat artifact string. (#1275) (@maleadt) - Allow copies between unified arrays in different contexts. (#1277) (@maleadt) - fix zeros and ones for user defined types (#1278) (@GiggleLiu) - Make CUDNN depend on CUBLAS. (#1279) (@maleadt) - Update manifest (#1286) (@github-actions[bot]) - Restore call to initdriver. (#1287) (@maleadt) - Improvements for isbits union shared memory (#1288) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.5.0

CUDA v3.5.0

Diff since v3.4.2

Closed issues: - Illegal memory access on 3.3 (#975) - Forward compatibility (#1071) - ambiguous sparse constructor (#1088) - Map reduce with float 16 (#1124) - Allow invalid GPU pointers not allowed in unsafewrap (#1125) - Scalar Indexing error in the Introduction docs (#1127) - stackoverflow when printing a custom subtype of AbstractCuSparseMatrix (#1128) - missing rand methods (#1138) - Error mapreducing over a 0 dimensional array (#1141) - seed! is not thread safe (#1158) - Simplify Int32-based indices (#1160) - Concatenating a scalar to a CuArray gives an Array (#1162) - Calling `bytepermwithInt32values inserts sign checks (#1165) -sum!does not compile for large arrays (#1169) - Same random sequence on GPU and CPU? (#1170) - Specifying eltype and buffer type when adapting toCuArray? (#1171) - Inefficientlop3.lut` instructions generated (#1172) - Writing temporary PTX files can fail (#1173) - Switching devices doesn't switch the REPL's output task (#1175) - GC is not working for CuSparseMatrixCSR (#1178) - sparse*dense operations shouldn't drop sparseness (#1188) - Raises illegal memory access error randomly (#1189)

Merged pull requests: - CI fixes (#950) (@maleadt) - implement sparse (#1093) (@CarloLucibello) - Use the kernel state object to pass the exception flag location. (#1110) (@maleadt) - Update manifest (#1123) (@github-actions[bot]) - Improve show methods in sparse GPU arrays. (#1129) (@maleadt) - Use warp intrinsics for a wider range of reductions. (#1130) (@maleadt) - Support wrapping a host buffer with a CuArray (#1131) (@maleadt) - support transpose CSC to CUDA CSR (#1132) (@Roger-luo) - Small improvements to discovery of local toolkits. (#1134) (@maleadt) - Rework device and context getters. (#1135) (@maleadt) - Avoid memory operations during graph capture. (#1137) (@maleadt) - Streamline the random number interface. (#1146) (@maleadt) - Native device synchronization (#1147) (@maleadt) - support interpret(reshape) (#1149) (@Roger-luo) - add a gitignore (#1150) (@Roger-luo) - Fix normalize on complex number (#1151) (@maleadt) - Addition and multiplication over cuarray and cusparse (#1152) (@maleadt) - Preserve Int32 hardware indices (#1153) (@maleadt) - remove mutable to make device sparse type bitstype (#1154) (@Roger-luo) - Update manifest (#1155) (@github-actions[bot]) - CompatHelper: bump compat for "BFloat16s" to "0.2" (#1156) (@github-actions[bot]) - Perform actual synchronization API calls when we need the memory (#1157) (@maleadt) - Binary dependency changes (#1159) (@maleadt) - Bump dependencies. (#1161) (@maleadt) - Generalize Sparse Array Indices Type in Struct Def (#1163) (@Roger-luo) - Use unchecked type conversions for byte_perm arguments (#1166) (@eschnett) - Fix performance regressions (#1167) (@maleadt) - Fix big mapreduce kernel for inputs without neutral element. (#1174) (@maleadt) - Switch contexts before performing memory operations on arrays (#1176) (@maleadt) - Improvements to stream-ordered memory management (#1177) (@maleadt) - Update manifest (#1180) (@github-actions[bot]) - Consistently use chars instead of raw enums in CUSPARSE/CUSOLVER functions. (#1181) (@maleadt) - Implement forward compatibility (#1182) (@maleadt) - Bump GPUCompiler for 1.8 compat. (#1183) (@maleadt) - Bump GPUArrays. (#1186) (@maleadt) - Update documentation (#1187) (@maleadt)

- Julia
Published by github-actions[bot] over 4 years ago

CUDA - v3.4.2

CUDA v3.4.2

Diff since v3.4.1

Closed issues: - Broadcasting a datatype does not work (#261) - CUDA error: invalid argument during Zygote/Flux gradient computation (#1107) - EXCEPTIONACCESSVIOLATION when using shared memory allocations. (#1116)

Merged pull requests: - add symmetric support for mul (#217) (@Roger-luo) - adds a device array type for CuSparseMatrixCSR to support using it in kernel functions (#1106) (@Roger-luo) - Update manifest (#1108) (@github-actions[bot]) - Specialize Ref{<:Type} for GPU compatibility. (#1109) (@maleadt) - Use the documented version of the enable_finalizers API. (#1111) (@maleadt) - Don't embed the method table in the AST. (#1112) (@maleadt) - Remove the hacky unique'ing of shmem GVs. (#1114) (@maleadt) - Introduce a macro for marking multiple functions as device-only. (#1117) (@maleadt) - Simplify library loading. (#1121) (@maleadt) - Backports for 3.4.2 (#1122) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.4.1

CUDA v3.4.1

Diff since v3.4.0

Closed issues: - cudnnFindConvolutionAlgorithmWorkspaceSize uses removed function cached_memory (#1101)

Merged pull requests: - Update manifest (#1102) (@github-actions[bot]) - Release hotfixes (#1103) (@maleadt) - Reverse CI for NNlibCUDA.jl (#1104) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.4.0

CUDA v3.4.0

Diff since v3.3.6

Merged pull requests: - Update GPUArrays and GPUCompiler. (#1100) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.6

CUDA v3.3.6

Diff since v3.3.5

Closed issues: - LinearAlgebra.mul! with scalar arguments triggers scalar iteration (#790) - Kernel fails if input is struct with function (#1094) - cusparse: sparse matrix - matrix multiplication broken with transpose operation (#1095)

Merged pull requests: - lib cusparse: fix #1095 (broken sparse matrix-matrix multiplication with transpose operation) (#1096) (@frapac) - Only export the atomic macro on 1.6. (#1097) (@maleadt) - Support more inplace atomic operations. (#1098) (@maleadt) - Backports for 3.3.6 (#1099) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.5

CUDA v3.3.5

Diff since v3.3.4

Closed issues: - Integer division error for the product of sparse times empty matrices (#962) - Bad conversion from QR to CuArray (#969) - Errors during installation test (#1004) - Be explicit about imports (#1028) - Exponentiation with constants can produce bad GPU code compared to the CPU (#1031) - rem uses wrong intrinsic (#1040) - test Cuda fails on gpuarrays\reductions/minimum maximum (#1043) - Broadcasted type conversion on literal value doesn't work (#1044) - CUDA overrides somehow screwing up customized printing? (#1055) - Is it possible to copy any data into GPU via recursive CuDeviceArray construction? (#1057) - CUDA doesn't compile after upgrade to Julia 1.6.2 (#1065) - Timing discrepancy between CUDA.@time and Benchmarktools for Flux model (#1067) - cannot convert range to Curray (#1070) - Thread safety issue with gemv! (#1072) - CuSparseMatrixCSC conversion errors (#1075) - cublasHgemmStridedBatched (#1076) - ERROR: UndefKeywordError: keyword argument elements not assigned (#1077) - Support for generating Float16 random numbers (#1081) - Illegal memory access during complex exponential with large imaginary part as exponent (#1085) - "Error: CUDA.jl does not yet support CUDA with ptxas 11.3.109" when using "JULIACUDAUSE_BINARYBUILDER=false" (#1089)

Merged pull requests: - Add support for unified arrays. (#1023) (@maleadt) - Look for libcuda in more places. (#1030) (@maleadt) - Detect common integer exponentiations and handle them directly. (#1033) (@maleadt) - Allow strided inputs to various library functions. (#1038) (@maleadt) - Use correct intrinsics for rem (#1041) (@simonbyrne) - update Package Manager link (#1052) (@ehgus) - Update manifest (#1054) (@github-actions[bot]) - Add test for math_mode (#1056) (@kshyatt) - Streamline atomics. (#1059) (@maleadt) - Add support for device capability-dependent code. (#1060) (@maleadt) - Adapt to GPUArrays changes. (#1061) (@maleadt) - Add special constructors to work around Base AbstractQ size weirdness. (#1063) (@maleadt) - Update manifest (#1064) (@github-actions[bot]) - Small allocator improvements (#1068) (@maleadt) - Latency improvements (bis) (#1069) (@maleadt) - lib: cusparse: fix #962 (#1073) (@thazhemadam) - Make handle cache thread-safe. (#1074) (@maleadt) - Bump GPUCompiler. (#1079) (@maleadt) - add support for half-precision gemm (#1080) (@bjarthur) - Extend and switch to the new CUDA RNG (#1082) (@maleadt) - cusparse: fix conversion from sparse matrix to dense matrix (#1083) (@maleadt) - Support/bump for CUDA 11.4.1 and CUDNN 8.2.2 (#1084) (@maleadt) - Use sincos from libdevice to perform illegal global load. (#1086) (@maleadt) - Bump GPUCompiler; use our own opt pipeline. (#1087) (@maleadt) - Update manifest (#1090) (@github-actions[bot]) - Backports for 3.3.5 (#1091) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.4

CUDA v3.3.4

Diff since v3.3.3

Closed issues: - Cholesky on 1.8 doesn't dispatch correctly (#1046)

Merged pull requests: - restore lost tests (#1042) (@vchuravy) - Base.unsafe_lenght is deprecated on 1.8 (#1045) (@vchuravy) - Update manifest (#1048) (@github-actions[bot]) - Fix cholesky on 1.8, fix #1046 (#1049) (@kshyatt) - Backport changes for 3.3.4 (#1050) (@vchuravy)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.3

CUDA v3.3.3

Diff since v3.3.2

Merged pull requests: - Adapt to LLVM changes. (#1022) (@maleadt) - Update manifest (#1029) (@github-actions[bot]) - just some simple printing tests (#1032) (@kshyatt) - Test for is_capturing (#1034) (@kshyatt) - Tests for buffer printing (#1035) (@kshyatt) - Make it possible to change the pool alloc and handle types. (#1036) (@maleadt) - Backports for 3.3 (#1037) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.2

CUDA v3.3.2

Diff since v3.3.1

Closed issues: - Missing artifacts errors (#1003) - Relax restriction on types allowed in kernels? (#1005) - PPC: Atomic{Float64} is not supported (#1008) - Unexpected result in combination with Zygote.gradient() (#1019) - Both ExprTools and LLVM export "parameters"; uses of it in module CUDA must be qualified (#1025)

Merged pull requests: - Fixes for artifact loading. (#1006) (@maleadt) - dlopen CUBLAS before CUTENSOR. (#1007) (@maleadt) - Use a plain integer to keep track of pool last use time. (#1009) (@maleadt) - More fixes to artifact discovery. (#1010) (@maleadt) - add custom structs tutorial (#1011) (@jw3126) - big mapreduce performance (#1012) (@xaellison) - Fixes for Julia 1.7 (#1013) (@maleadt) - Update manifest (#1014) (@github-actions[bot]) - Remove memory pools (#1015) (@maleadt) - Move refcounting to an array storage type (#1016) (@maleadt) - Remove unneeded disambiguation method. (#1017) (@maleadt) - Simplify context validity check. (#1018) (@maleadt) - Improve LazyInitialized (#1020) (@maleadt) - More allocator clean-ups (#1021) (@maleadt) - CUDA 11.4 (#1024) (@maleadt) - Only import from ExprTools what we need. (#1026) (@maleadt) - Backports release 3.3 (#1027) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.1

CUDA v3.3.1

Diff since v3.3.0

Closed issues: - Reclaim with stream-ordered allocator (#952) - possible hanging with CUDA.@profile? (#961) - Upgrading from v3.2.1 to v3.3.0 broke my installation (#970) - Calls to has_cudnn running on wrong CuDevice? (#978) - Test does not run on MIT Supercloud after upgrading to 3.3.0 (#980) - Performance issue with complicated loops in function (#984) - Is it possible to set cache config in CUDA.jl? (#988) - @atomic should perform type conversions (#989) - Compatible NVIDIA driver but still got compatibility warning (#1001)

Merged pull requests: - Update manifest (#971) (@github-actions[bot]) - Fix disambiguation of CUDA 11.1 using CUSOLVER. (#972) (@maleadt) - Simplify initialization helper macro. (#973) (@maleadt) - Move at-typed_ccall to LLVM.jl. (#976) (@maleadt) - Replace workspace macro with function (#981) (@maleadt) - Implement and improve reclaim for the stream-ordered allocator (#983) (@maleadt) - Bump GPUCompiler to fix WMMA test issue. (#985) (@maleadt) - Rework memoization (#986) (@maleadt) - Fixes for CUBLAS/CUDNN logging (#987) (@maleadt) - Perform type conversions in at-atomic. (#990) (@maleadt) - Don't initialize the API when setting log callbacks. (#992) (@maleadt) - Create a helper for lazy, thread-safe initialization. (#993) (@maleadt) - Optimize library handles (#996) (@maleadt) - Optimize PerDevice for abstract element types. (#997) (@maleadt) - Update manifest (#999) (@github-actions[bot]) - Replace PerDevice with context-keyed dictionaries. (#1000) (@maleadt) - Improve launch latency (#1002) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.3.0

CUDA v3.3.0

Diff since v3.2.1

Closed issues: - PTX code missing DWARF debug information (#72) - Suggestion - Disable AbstractArray indexing fallback by default (#178) - Support isbits Union Arrays (#103) - Missing norm(x, p) kernel (#84) - CUDA enhanced compatibility (#832) - Support for CuSparseMatrixCSC{Float16} x CuVector{Float16} (#849) - CuArray to zeroth power returns Matrix (#897) - Fatal errors during sorting tests (#916) - Error when computing reductions into a view with reduce_blocks > 1 (#919) - CUDA FFT plan application runs Out of Memory in Pluto (#926) - has_cuda() errors in CPU-only environments on master (#928) - Race condition when computing mean! of large arrays? (#929) - Supporting union bits types (#934) - test failing in device/intrinsics (#942) - Memory allocation fails for multi-GPU (#943) - Scalar operations when using output of cu(::OffsetArray) (#954) - Quicksort kernel does not cope with reduced threads (#955) - CUDA.jl cannot find installed CUPTI libraries with local installation on linux (#956) - Error for complex sparse-dense Matrix-vector multiplication (#958) - "using CUDA" gives error in type inference of Ref{Bool} (#965)

Merged pull requests: - Override outlined throw functions. (#874) (@maleadt) - Enable location and debug info. (#891) (@maleadt) - Compile using the toolkit, not the driver. (#892) (@maleadt) - Rework timings (#898) (@maleadt) - Fix #849, allow CUSPARSE to use F16 (#904) (@kshyatt) - Add Windows CI. (#907) (@maleadt) - Split test for better parallelization. (#908) (@maleadt) - Update manifest (#909) (@github-actions[bot]) - Improve package latency. (#910) (@maleadt) - Just some missing tests for CUBLAS (#911) (@kshyatt) - Fix bug and add tests for iamax/iamin (#913) (@kshyatt) - Fix profiler initialization and exception handling. (#914) (@maleadt) - Add a show method for devices(). (#915) (@maleadt) - Fix update of CUFFT handle. (#921) (@maleadt) - Update manifest (#922) (@github-actions[bot]) - Reinstate compatibility with Kepler GPUs. (#923) (@maleadt) - Use multiple GPUs on CI when available. (#924) (@maleadt) - Fix two-step mapreduce with wrapped output. (#925) (@maleadt) - Eagerly free the CUFFT workspace when generating a new one. (#927) (@maleadt) - Fix CUDA.function without throwing. (#930) (@maleadt) - Fix the REPL synchronization hook. (#931) (@maleadt) - Re-initialize the random seed every time. (#932) (@maleadt) - Protect against race in iterating compute processes. (#933) (@maleadt) - Helper function to get the device given a cu ptr. (#935) (@akashkgarg) - Implement CUDA's Enhanced Compatibility when selecting a toolkit. (#936) (@maleadt) - Update manifest (#939) (@github-actions[bot]) - Re-introduce specialization of cufunction. (#940) (@maleadt) - Support isbits union element types with CuArray. (#941) (@maleadt) - Try generating code with unreachable control flow. (#944) (@maleadt) - Upgrade to CUDA 11.3 Update 1. (#945) (@maleadt) - Always use exit instead of trap. (#947) (@maleadt) - Select devices without NVML. (#948) (@maleadt) - Fixes for Julia 1.7. (#949) (@maleadt) - Query the CUBLAS version without requiring a handle. (#951) (@maleadt) - Improve CUBLAS and CUDNN logging. (#953) (@maleadt) - Update manifest (#957) (@github-actions[bot]) - Enable sorting with reduced block sizes (#959) (@xaellison) - Adapt to GPUCompiler changes, bump GPUArrays. (#963) (@maleadt) - Adapt to change in allowscalar. (#964) (@maleadt) - Don't disable the CUDNN log callback on Windows. (#966) (@maleadt) - Use released dependencies. (#968) (@maleadt)

- Julia
Published by github-actions[bot] almost 5 years ago

CUDA - v3.2.1

CUDA v3.2.1

Diff since v3.2.0

Closed issues: - adding constant to an array: performance regression compared to CUDAdrv (#838) - CUDA.abs() on vector input: performance regression compared to CUDAdrv (#839) - CUDA.@sync seems to be using a lot of CPU while waiting (#893) - Memory leaks with repeated use of fft of a CUDA Array (#894) - CUDA.jl v3.2 seems to download wrong version of CUDNN and CUTENSOR (#899)

Merged pull requests: - Rework synchronization: first spin, then yield, and finally block. (#896) (@maleadt) - Make cusolvermg really optional. (#900) (@maleadt) - Rebuild artifacts. (#901) (@maleadt) - Take back control over the CUFFT work area. (#902) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v3.2.0

CUDA v3.2.0

Diff since v3.1.0

Closed issues: - Explore CUDA graph API (#65) - Runtime functions are missing debug information (#53) - Native RNGs do not pass SmallCrush (#803) - Remaining threads/FFT/mult-gpu error (#876)

Merged pull requests: - Add wrappers for the CUDA graph API. (#877) (@maleadt) - Use the profiler API to start capture. (#878) (@maleadt) - Duplicate RNG state across block to avoid need for synchronization (#879) (@maleadt) - Support for printing tuples. (#880) (@maleadt) - Support unsigned inputs to integer intrinsics. (#881) (@maleadt) - Switch to Philox2x32 for device-side RNG (#882) (@maleadt) - Update manifest (#884) (@github-actions[bot]) - Treat CartesianIndices in views as scalars. (#886) (@maleadt) - Robustly get variables from the environment during init. (#887) (@maleadt) - Move Statistics functionality to GPUArrays. (#888) (@maleadt) - Update artifacts and use sources from unified JLLs. (#889) (@maleadt) - Lazy initialization of CUDNN and CUTENSOR (#890) (@maleadt) - Update manifest (#895) (@github-actions[bot])

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v3.1.0

CUDA v3.1.0

Diff since v3.0.3

Closed issues: - GPU Implementation of partialsort! (#93) - Document associativity requirements of scan/reduce operators (#819) - Problem in reduceblock? (#843) - CUDNN convolution incorrect for small images (#848) - Newly-spawned tasks should re-set the device (#851) - sort!(CUDA.zeros(2^25)) throws invalid configuration argument (code 9, cudaErrorInvalidConfiguration) (#852) - Type-preserving upload about cu in doc may be wrong (#855) - Memory corruption / segfault with Threads.@async and planned FFTs (#859) - Don't call nvmlErrorString (during init?) to prevent crashes on WSL (#860) - unsafecopy3d! does not work with stream-ordered allocations (#863) - CUDA3 seems to have memory leak (#866)

Merged pull requests: - Implement statistics functions: correlation and covariance (#509) (@berquist) - @atomic support * and / (#842) (@yuehhua) - CUDNN docstring revisions. (#844) (@GunnarFarneback) - Sorting perf (again) (#845) (@xaellison) - Update manifest (#846) (@github-actions[bot]) - Remove extraneous apostrophe (#847) (@kshyatt) - reduceblock fixes. (#853) (@maleadt) - Fix sorting large arrays. (#854) (@maleadt) - Remove unsupported config launch keyword. (#856) (@maleadt) - Identify the buffer during unsafewrap to support unified free. (#857) (@maleadt) - Add support for CUDA 11.3. (#858) (@maleadt) - Work around buggy NVML initialization on WSL (#861) (@maleadt) - ae/partialsort (#864) (@xaellison) - Update manifest (#865) (@github-actions[bot]) - Improve multitasking with CUFFT. (#867) (@maleadt) - Introduce a HandleCache type. (#868) (@maleadt) - Improve multitasking with CURAND (#869) (@maleadt) - Document associativity requirement of accumulate (#870) (@HenriDeh) - Half-Precision Intrinsics (#871) (@iyaja) - Work around offset calculation bug in cuMemcpy3DAsync. (#872) (@maleadt) - fix #848: CUDNN convolution incorrect for small images (#873) (@denizyuret)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v3.0.3

CUDA v3.0.3

Diff since v3.0.2

Closed issues: - CUDA.jl init error in the REPL without using a CUDA feature (#841)

Merged pull requests: - Only synchronize the REPL when CUDA is configured. (#840) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v3.0.2

CUDA v3.0.2

Diff since v3.0.1

Closed issues: - REPL display happens in different task, breaking synchronization (#831) - map! function raise a InvalidIRError (#833) - Compile error on shfldownsync (#834) - Error broadcasting some Base intrinsics (eq sqrt) over complex (but not real) CuArray (#836)

Merged pull requests: - Add an integration benchmark. (#835) (@maleadt) - Synchronize REPL expressions before returning. (#837) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v3.0.1

CUDA v3.0.1

Diff since v3.0.0

Closed issues: - Sort overwriting values in target array (#822)

Merged pull requests: - sort bugfix (#823) (@xaellison) - Update manifest (#824) (@github-actions[bot]) - Test validation of GPU-only function. (#827) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v3.0.0

CUDA v3.0.0

Diff since v2.6.3

Closed issues: - Driver crashes when running tests (#136) - dilation=0 causes CUDNNSTATUSBADPARAM (#122) - CUBLASXT test errors (#112) - Program used external function 'nvpowi' which could not be resolved! (#109) - Wrapping Thrust for sorting (#107) - Prevent CUDA.@cufunc to transform the type's Int parameter to Int32. (#420) - CUBLASError: the GPU program failed to execute (#447) - Julia crashes on windows when using CUDA together with a system image (#479) - Consider link-time optimization (#505) - Segmentation fault on exiting Julia (#533) - Heisenbug in NNlib.conv! with nonzero beta (#736) - Benchmark suite segfaults on PRs (#794) - Performance issue with Pluto.jl (#815) - missing kernel for partialsort (#817)

Merged pull requests: - Improvements to try and fix benchmark OOMs (#809) (@maleadt) - Use the function type, not its instance, as the *Kernel typevar. (#816) (@maleadt) - More OOM fixes for the stream-ordered allocator (#818) (@maleadt) - Make reduce_warp exception free (#820) (@vchuravy) - Remove unused Adapt rule. (#821) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v2.4.3

CUDA v2.4.3

Diff since v2.4.2

Closed issues: - Cannot select __powidf2 while lowering powi.f64 (#76) - Cannot select fpow while lowering pow.f32 (#71) - Any chance x^.a, for (a> 2.0) will be supported at some point? (#171) - support for vecnorm (#169) - Accidentally calling GPU intrinsics on the host causes segfaults (#60) - Partial support for Dual Numbers (#140) - Bug involving broadcasting, sqrt, and Flux (#130) - Add support for sincos, cis? (#42) - at-extalloc should not try/catch (#99) - Linalg support for non-contiguous views of arrays (#96) - Segfault during handle finalization (#95) - Error in julia': double free or corruption (out): 0x00007ffee6bd8d00 (#88) - Kernel launch overhead regression from 1.7.3 (#80) - Thread safety issue with free (#595) - Bounds checking very slow with @views (#597) - unspecified launch failure (code 719, ERROR_LAUNCH_FAILED) (#606) - cusolver errors during]test CUDA` (#616) - argmax (i.e. findmax) fails with Bool arrays (#659) - Set random seed is extremely slow (#685) - Modulo operator is actually the remainder operator (#748) - reinterpret on cuDynamicSharedMem throws ERRORILLEGALADDRESS (#752) - sampler randbinomial! for generating binomially distributed CuArrays directly on the GPU (#767) - compute-sanitizer out-of-bounds failure (#780) - Could not load the CUDA 11.1.0 artifact (#784) - precompile error "cicache not defined" (#787) - Conditional element-wise assignment still won't compile (#789) - performance regression compared to CUDAdrv (#799) - Slow contiguous view() on a CuArray (#802) - ConvTranspose with negative padding fails on GPU (#810) - Performance issue with Pluto.jl (#815) - missing kernel for partialsort (#817)

Merged pull requests: - cuSolverMg wrappers (#308) (@kshyatt) - Use contextual dispatch for device functions. (#750) (@maleadt) - Implement reinterpret on CuDeviceArray (#755) (@tkf) - Quicksort performance update (#762) (@xaellison) - Add a basic implementation of rand() for use inside kernels (#772) (@S-D-R) - Deduplicate launch code. (#773) (@maleadt) - typo fixes - oliver (#774) (@kw-fn) - CompatHelper: add new compat entry for "SpecialFunctions" at version "1.3" (#775) (@github-actions[bot]) - Additional uses of contextual dispatch (#776) (@maleadt) - Use more inner constructors to ensure handle validity. (#777) (@maleadt) - Don't free memory asynchronously from finalizers. (#778) (@maleadt) - Set exception flag asynchronously. (#779) (@maleadt) - Debug memory pinning. (#781) (@maleadt) - Additional fixes for finalization when using the stream-ordered allocator (#782) (@maleadt) - Use legacy streams from finalizers to block on other streams (#783) (@maleadt) - Simplify thread/task state management. (#785) (@maleadt) - CompatHelper: add new compat entry for "RandomNumbers" at version "1.4" (#786) (@github-actions[bot]) - Speed-up rand: Tausworthe RNG with shared random state. (#788) (@maleadt) - Update manifest (#791) (@github-actions[bot]) - Optimize array construction and allocation (#792) (@maleadt) - Don't use the stream-ordered memory pool for PR benchmarks. (#795) (@maleadt) - Update introduction.jl (#796) (@Satvik) - Various improvements (#801) (@maleadt) - Speed-up view boundscheck. (#804) (@maleadt) - Don't use mod from libdevice. (#805) (@maleadt) - fix bounds check for reverse kwarg and add tests (#806) (@kshyatt) - Update manifest (#808) (@github-actions[bot]) - Improvements to try and fix benchmark OOMs (#809) (@maleadt) - Fix detection of need-for-cudadevrt. (#811) (@maleadt) - Backport #811 (#813) (@maleadt) - Use the function type, not its instance, as the *Kernel typevar. (#816) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v2.6.3

CUDA v2.6.3

Diff since v2.6.2

Closed issues: - Cannot select __powidf2 while lowering powi.f64 (#76) - Cannot select fpow while lowering pow.f32 (#71) - Any chance x^.a, for (a> 2.0) will be supported at some point? (#171) - support for vecnorm (#169) - Accidentally calling GPU intrinsics on the host causes segfaults (#60) - Partial support for Dual Numbers (#140) - Bug involving broadcasting, sqrt, and Flux (#130) - Add support for sincos, cis? (#42) - at-extalloc should not try/catch (#99) - Linalg support for non-contiguous views of arrays (#96) - Segfault during handle finalization (#95) - Error in julia': double free or corruption (out): 0x00007ffee6bd8d00 (#88) - Kernel launch overhead regression from 1.7.3 (#80) - Thread safety issue with free (#595) - Bounds checking very slow with @views (#597) - unspecified launch failure (code 719, ERROR_LAUNCH_FAILED) (#606) - cusolver errors during]test CUDA` (#616) - argmax (i.e. findmax) fails with Bool arrays (#659) - Set random seed is extremely slow (#685) - Modulo operator is actually the remainder operator (#748) - reinterpret on cuDynamicSharedMem throws ERRORILLEGALADDRESS (#752) - sampler randbinomial! for generating binomially distributed CuArrays directly on the GPU (#767) - compute-sanitizer out-of-bounds failure (#780) - Could not load the CUDA 11.1.0 artifact (#784) - precompile error "cicache not defined" (#787) - Conditional element-wise assignment still won't compile (#789) - performance regression compared to CUDAdrv (#799) - Slow contiguous view() on a CuArray (#802) - ConvTranspose with negative padding fails on GPU (#810)

Merged pull requests: - cuSolverMg wrappers (#308) (@kshyatt) - Use contextual dispatch for device functions. (#750) (@maleadt) - Implement reinterpret on CuDeviceArray (#755) (@tkf) - Quicksort performance update (#762) (@xaellison) - Add a basic implementation of rand() for use inside kernels (#772) (@S-D-R) - Deduplicate launch code. (#773) (@maleadt) - typo fixes - oliver (#774) (@kw-fn) - CompatHelper: add new compat entry for "SpecialFunctions" at version "1.3" (#775) (@github-actions[bot]) - Additional uses of contextual dispatch (#776) (@maleadt) - Use more inner constructors to ensure handle validity. (#777) (@maleadt) - Don't free memory asynchronously from finalizers. (#778) (@maleadt) - Set exception flag asynchronously. (#779) (@maleadt) - Debug memory pinning. (#781) (@maleadt) - Additional fixes for finalization when using the stream-ordered allocator (#782) (@maleadt) - Use legacy streams from finalizers to block on other streams (#783) (@maleadt) - Simplify thread/task state management. (#785) (@maleadt) - CompatHelper: add new compat entry for "RandomNumbers" at version "1.4" (#786) (@github-actions[bot]) - Speed-up rand: Tausworthe RNG with shared random state. (#788) (@maleadt) - Update manifest (#791) (@github-actions[bot]) - Optimize array construction and allocation (#792) (@maleadt) - Don't use the stream-ordered memory pool for PR benchmarks. (#795) (@maleadt) - Update introduction.jl (#796) (@Satvik) - Various improvements (#801) (@maleadt) - Speed-up view boundscheck. (#804) (@maleadt) - Don't use mod from libdevice. (#805) (@maleadt) - fix bounds check for reverse kwarg and add tests (#806) (@kshyatt) - Update manifest (#808) (@github-actions[bot]) - Fix detection of need-for-cudadevrt. (#811) (@maleadt) - Backport #811 (#813) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v2.4.2

CUDA v2.4.2

Diff since v2.4.1

Closed issues: - High allocations and getindex (#150) - ResNet spending much time in CuArrays GC (#149) - Broadcast inference failure results in scalar iteration (#145) - Allocator very slow to reclaim memory after running for sufficiently long (#137) - Assignment using logical indexing (#131) - CUDNN convolution allocates outside of the memory pool (#111) - Logical indexing per-dim (#106) - Threading-related assertion failure in split allocator (#97) - dims support for softmax (#226) - Memory pinning needs more features (#242) - External allocations fail under high memory pressure (#340) - Incomplete CUDNN wrappers (#343) - softmax(x) and logsoftmax(x) update their arguments (#592) - Freeing large buffers takes a while (#594) - softmax has problem with dim parameter (#599) - CUDA 11.2 (#601) - gemmEx on sm52 results in CUBLASSTATUSARCHMISMATCH (#609) - could not load cublas6411.dll (#670) - LLVM not found (#681) - about the document of conditional use (#689) - GPU run out of memory if 2 workers use the same GPU (#692) - CURAND handles are collected early (#699) - cudnnConvolutionForward fails memory checking (#702) - Deadlock during OOM (#706) - Segfault during trampoline allocation when querying occupancy from multiple threads (#707) - Ballot intrinsics should use .sync variety (#711) - cfunction $shmemcint use after free (#713) - OOM when evaluating a small resnet (with both Flux and Knet) (#714) - Supprt CUDA 11.2 Update 1 (#715) - Base.mapreducedim returns wrong answer with non-zero target array (#720) - CUBLASSTATUSARCHMISMATCH (#722) - Test failures on linux (#727) - Switching devices causes GC errors (#731) - Pin CPU buffers when doing memory copies (#735) - Memory free error with CUDA 11.2 and multi threads/GPUs (#737) - Per-device memory pool (#742) - Could not load library cudnnopsinfer648.dll (#757) - CUDA.lgamma(x) crashes Julia (#758)

Merged pull requests: - New high level interface for cuDNN (#523) (@denizyuret) - bilinear upsampling (#636) (@maxfreu) - Automatic task-based concurrency using local streams (#662) (@maleadt) - Fix version lookups. (#671) (@maleadt) - add beta keyword to conv (#672) (@jw3126) - Update manifest (#673) (@github-actions[bot]) - Protect the kernel closure from GC collection. (#674) (@maleadt) - Track external globals, use it to avoid needless exception flags (#675) (@maleadt) - Adapt to GPUCompiler changes. (#676) (@maleadt) - Minor improvements (#677) (@maleadt) - CompatHelper: add new compat entry for "Memoize" at version "0.4" (#678) (@github-actions[bot]) - Use CUDA 11.2's stream-ordered allocator (#679) (@maleadt) - Support an additional nvdisasm version. (#680) (@maleadt) - Add fast getristridedbatch (#682) (@cfranken) - Use released GPUCompiler. (#683) (@maleadt) - v2.6.1 (#684) (@maleadt) - Fix race during multi-threaded init. (#687) (@maleadt) - Update manifest (#690) (@github-actions[bot]) - Change to Buildkite v1 plugins. (#691) (@maleadt) - CUPTI improvements for multithreading (#693) (@maleadt) - Fix exception flag linkage for linking. (#694) (@maleadt) - Update manifest (#695) (@github-actions[bot]) - Use simpler try/catch in show(CuError). (#696) (@maleadt) - fix bug in CURAND.jl's setstream function. (#698) (@norci) - Update CUDNN to 8.1. (#701) (@maleadt) - Remove special-cased algorithm selection for CUDNN convolution (#703) (@denizyuret) - Keep track of active handles to avoid early collection. (#704) (@maleadt) - Backports for Julia 1.5 (#705) (@maleadt) - Support for cushow-ing multiple values, including LLVMPtrs. (#709) (@maleadt) - Make CUDNN tests eagerly invoke at-test for better error reporting. (#710) (@maleadt) - Report JIT error log with linker errors. (#712) (@maleadt) - Update manifest (#716) (@github-actions[bot]) - Flip exceptionflag filter! predicate (#717) (@S-D-R) - Keep some memory reserved for external allocations. (#718) (@maleadt) - Upgrade CUDA 11.2 to Update 1. (#719) (@maleadt) - Add Abstract FFT compat (#721) (@DhairyaLGandhi) - Add support for and switch test to warp-synchrnous vote intrinsics. (#723) (@maleadt) - Specialize Base.toindex for AnyCuArray{Bool} (#724) (@pabloferz) - Update manifest (#728) (@github-actions[bot]) - Eagerly dlopen cublasLt to prevent a system library getting picked up. (#729) (@maleadt) - Switch tests over to compute-sanitizer. (#730) (@maleadt) - Perform pool operations in the correct context. (#732) (@maleadt) - Streamline use of retryreclaim. (#733) (@maleadt) - Threading fixes (#734) (@maleadt) - copied the old rnn.jl->rnncompat.jl for Flux compatibility (#738) (@denizyuret) - fix testmode batchnorm back (#739) (@CarloLucibello) - Don't error out if failing to parse the local CUDA version. (#740) (@maleadt) - Backport #739 (#741) (@maleadt) - Add back an older artifact for CUDNN on PPC with CUDA 10.2. (#743) (@maleadt) - Use the default memory pool. (#745) (@maleadt) - Use a memory pool per device. (#746) (@maleadt) - Test sort with at-test at the toplevel, for better reporting. (#749) (@maleadt) - remove NNlib (#753) (@CarloLucibello) - Update manifest (#754) (@github-actions[bot]) - Fix cuda-memcheck, don't use memory pools. (#756) (@maleadt) - Update generated wrappers (#759) (@maleadt) - Rework memory pinning and speed up async ops on unpinned memory (#760) (@maleadt) - Improve context switching (#761) (@maleadt) - Update manifest (#765) (@github-actions[bot]) - Docs on multitasking (#766) (@maleadt) - Update to CUDA 11.2 Update 2. (#768) (@maleadt) - Small backports for CUDA 2.4 / Julia 1.5 (#770) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v2.6.2

CUDA v2.6.2

Diff since v2.6.1

Closed issues: - High allocations and getindex (#150) - ResNet spending much time in CuArrays GC (#149) - Broadcast inference failure results in scalar iteration (#145) - Allocator very slow to reclaim memory after running for sufficiently long (#137) - Assignment using logical indexing (#131) - CUDNN convolution allocates outside of the memory pool (#111) - Logical indexing per-dim (#106) - Threading-related assertion failure in split allocator (#97) - dims support for softmax (#226) - Memory pinning needs more features (#242) - External allocations fail under high memory pressure (#340) - Incomplete CUDNN wrappers (#343) - softmax(x) and logsoftmax(x) update their arguments (#592) - Freeing large buffers takes a while (#594) - softmax has problem with dim parameter (#599) - gemmEx on sm52 results in CUBLASSTATUSARCHMISMATCH (#609) - about the document of conditional use (#689) - GPU run out of memory if 2 workers use the same GPU (#692) - CURAND handles are collected early (#699) - cudnnConvolutionForward fails memory checking (#702) - Deadlock during OOM (#706) - Segfault during trampoline allocation when querying occupancy from multiple threads (#707) - Ballot intrinsics should use .sync variety (#711) - cfunction $shmemcint use after free (#713) - OOM when evaluating a small resnet (with both Flux and Knet) (#714) - Supprt CUDA 11.2 Update 1 (#715) - Base.mapreducedim returns wrong answer with non-zero target array (#720) - CUBLASSTATUSARCHMISMATCH (#722) - Test failures on linux (#727) - Switching devices causes GC errors (#731) - Pin CPU buffers when doing memory copies (#735) - Memory free error with CUDA 11.2 and multi threads/GPUs (#737) - Per-device memory pool (#742) - Could not load library cudnnopsinfer64_8.dll (#757) - CUDA.lgamma(x) crashes Julia (#758)

Merged pull requests: - New high level interface for cuDNN (#523) (@denizyuret) - bilinear upsampling (#636) (@maxfreu) - Use CUDA 11.2's stream-ordered allocator (#679) (@maleadt) - Support an additional nvdisasm version. (#680) (@maleadt) - Add fast getristridedbatch (#682) (@cfranken) - Fix race during multi-threaded init. (#687) (@maleadt) - Update manifest (#690) (@github-actions[bot]) - Change to Buildkite v1 plugins. (#691) (@maleadt) - CUPTI improvements for multithreading (#693) (@maleadt) - Fix exception flag linkage for linking. (#694) (@maleadt) - Update manifest (#695) (@github-actions[bot]) - Use simpler try/catch in show(CuError). (#696) (@maleadt) - fix bug in CURAND.jl's setstream function. (#698) (@norci) - Update CUDNN to 8.1. (#701) (@maleadt) - Remove special-cased algorithm selection for CUDNN convolution (#703) (@denizyuret) - Keep track of active handles to avoid early collection. (#704) (@maleadt) - Backports for Julia 1.5 (#705) (@maleadt) - Support for cushow-ing multiple values, including LLVMPtrs. (#709) (@maleadt) - Make CUDNN tests eagerly invoke at-test for better error reporting. (#710) (@maleadt) - Report JIT error log with linker errors. (#712) (@maleadt) - Update manifest (#716) (@github-actions[bot]) - Flip exceptionflag filter! predicate (#717) (@S-D-R) - Keep some memory reserved for external allocations. (#718) (@maleadt) - Upgrade CUDA 11.2 to Update 1. (#719) (@maleadt) - Add Abstract FFT compat (#721) (@DhairyaLGandhi) - Add support for and switch test to warp-synchrnous vote intrinsics. (#723) (@maleadt) - Specialize Base.toindex for AnyCuArray{Bool} (#724) (@pabloferz) - Update manifest (#728) (@github-actions[bot]) - Eagerly dlopen cublasLt to prevent a system library getting picked up. (#729) (@maleadt) - Switch tests over to compute-sanitizer. (#730) (@maleadt) - Perform pool operations in the correct context. (#732) (@maleadt) - Streamline use of retryreclaim. (#733) (@maleadt) - Threading fixes (#734) (@maleadt) - copied the old rnn.jl->rnncompat.jl for Flux compatibility (#738) (@denizyuret) - fix testmode batchnorm back (#739) (@CarloLucibello) - Don't error out if failing to parse the local CUDA version. (#740) (@maleadt) - Backport #739 (#741) (@maleadt) - Add back an older artifact for CUDNN on PPC with CUDA 10.2. (#743) (@maleadt) - Use the default memory pool. (#745) (@maleadt) - Use a memory pool per device. (#746) (@maleadt) - Test sort with at-test at the toplevel, for better reporting. (#749) (@maleadt) - remove NNlib (#753) (@CarloLucibello) - Update manifest (#754) (@github-actions[bot]) - Fix cuda-memcheck, don't use memory pools. (#756) (@maleadt) - Update generated wrappers (#759) (@maleadt) - Rework memory pinning and speed up async ops on unpinned memory (#760) (@maleadt) - Improve context switching (#761) (@maleadt) - Update manifest (#765) (@github-actions[bot]) - Docs on multitasking (#766) (@maleadt) - Update to CUDA 11.2 Update 2. (#768) (@maleadt) - Small backports for CUDA 2.4 / Julia 1.5 (#770) (@maleadt) - Backports for CUDA 2.6 / Julia 1.6 (#771) (@maleadt)

- Julia
Published by github-actions[bot] about 5 years ago

CUDA - v2.6.1

CUDA v2.6.1

Diff since v2.6.0

Closed issues: - CUDA 11.2 (#601) - LLVM not found (#681)

Merged pull requests: - Automatic task-based concurrency using local streams (#662) (@maleadt) - add beta keyword to conv (#672) (@jw3126) - Protect the kernel closure from GC collection. (#674) (@maleadt) - Track external globals, use it to avoid needless exception flags (#675) (@maleadt) - Adapt to GPUCompiler changes. (#676) (@maleadt) - Minor improvements (#677) (@maleadt) - CompatHelper: add new compat entry for "Memoize" at version "0.4" (#678) (@github-actions[bot]) - Use released GPUCompiler. (#683) (@maleadt) - v2.6.1 (#684) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.6.0

CUDA v2.6.0

Diff since v2.5.0

Closed issues: - Invalid results due to shared memory + multiple function exits (?) mysteriously solved by @cuprintf (#43) - NVML-related segfault on Windows (#610) - @cuda with config keyword sometimes allocate lots of memory (#643) - Can someone with push access run the TagBot workflow? (#644) - Taking gradient with Flux results in NaNs when using CUDA arrays but not when using CPU arrays (#657) - Broadcasting fails in a special case (#658) - view causes KeyError in alias (#661) - PTXCompilerTarget error when creating a CuArray with Float64 (#664) - Complex dot product performance of CuArrays and of StructArrays of CuArrays (#667) - could not load cublas64_11.dll (#670)

Merged pull requests: - CUDA quicksort (#431) (@xaellison) - Bump Reexport to 1.0 (#640) (@DhairyaLGandhi) - Use newer NVML initialization method. (#641) (@maleadt) - README: add some information on viewing capabilities of your devices (#642) (@DilumAluthge) - Remove duplicate functions. (#645) (@maleadt) - Use released version of Adapt.jl (#646) (@maleadt) - Simplify list of tests to skip. (#647) (@maleadt) - Use a test-specific Project.toml. (#648) (@maleadt) - Use raw output for CUBLAS log message. (#649) (@maleadt) - Close the async condition used to call host functions. (#650) (@maleadt) - Backports for Julia 1.5 / CUDA 2.4 (#651) (@maleadt) - Allow running benchmarks outside of the master branch on other systems. (#652) (@maleadt) - Bump GPUCompiler. (#653) (@maleadt) - Reuse the compiler when generating SASS code. (#654) (@maleadt) - Run the tests from the current directory. (#655) (@maleadt) - Configure the PTX GPUCompiler codegen quirks. (#656) (@maleadt) - Update manifest (#660) (@github-actions[bot]) - Support view on unmanaged arrays. (#663) (@maleadt) - Retry CuModule creation when OOM. (#665) (@maleadt) - Make fill async. (#669) (@maleadt) - Fix version lookups. (#671) (@maleadt) - Update manifest (#673) (@github-actions[bot])

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.4.1

CUDA v2.4.1

Diff since v2.4.0

Closed issues: - cudaconvert for closures (#67) - Invalid results due to shared memory + multiple function exits (?) mysteriously solved by @cuprintf (#43) - NVML-related segfault on Windows (#610) - Update Reexport compat (#629) - Incomplete CUDA device attributes list (#637) - @cuda with config keyword sometimes allocate lots of memory (#643) - Can someone with push access run the TagBot workflow? (#644) - Taking gradient with Flux results in NaNs when using CUDA arrays but not when using CPU arrays (#657) - Broadcasting fails in a special case (#658) - view causes KeyError in alias (#661) - PTXCompilerTarget error when creating a CuArray with Float64 (#664) - Complex dot product performance of CuArrays and of StructArrays of CuArrays (#667)

Merged pull requests: - CUDA quicksort (#431) (@xaellison) - cudaconvert captured values in closures. (#625) (@maleadt) - CompatHelper: only instantiate /Manifest.toml (the manifest file in the root of the repository) (#631) (@DilumAluthge) - CompatHelper: bump compat for "Reexport" to "1.0" (#633) (@github-actions[bot]) - CompatHelper: bump compat for "AbstractFFTs" to "1.0" (#634) (@github-actions[bot]) - Update wrappers (#638) (@maleadt) - Bump artifacts for Windows/Julia 1.6 compatibility. (#639) (@maleadt) - Bump Reexport to 1.0 (#640) (@DhairyaLGandhi) - Use newer NVML initialization method. (#641) (@maleadt) - README: add some information on viewing capabilities of your devices (#642) (@DilumAluthge) - Remove duplicate functions. (#645) (@maleadt) - Use released version of Adapt.jl (#646) (@maleadt) - Simplify list of tests to skip. (#647) (@maleadt) - Use a test-specific Project.toml. (#648) (@maleadt) - Use raw output for CUBLAS log message. (#649) (@maleadt) - Close the async condition used to call host functions. (#650) (@maleadt) - Backports for Julia 1.5 / CUDA 2.4 (#651) (@maleadt) - Allow running benchmarks outside of the master branch on other systems. (#652) (@maleadt) - Bump GPUCompiler. (#653) (@maleadt) - Reuse the compiler when generating SASS code. (#654) (@maleadt) - Run the tests from the current directory. (#655) (@maleadt) - Configure the PTX GPUCompiler codegen quirks. (#656) (@maleadt) - Update manifest (#660) (@github-actions[bot]) - Support view on unmanaged arrays. (#663) (@maleadt) - Retry CuModule creation when OOM. (#665) (@maleadt) - Make fill async. (#669) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.5.0

CUDA v2.5.0

Diff since v2.4.0

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.4.0

CUDA v2.4.0

Diff since v2.3.0

Closed issues: - cublasXtStrmm test failures on Windows 10 Julia 1.1 (#124) - CUSPARSE tests broken (#259) - Make @cuda return a kernel object (#341) - Depend on CompilerSupportLibraries (#359) - CUBLAS and exceptions test failures on Windows (#536) - argmax(::CuArray) returns nothing with NaN-values (#553) - Multiple @cuDynamicSharedMem in kernel causes unexpected behavior (#555) - Illegal memory access with atomic shared memory (#558) - CUDA.sqrt will not found symbol "_nvsqrt" (#559) - Exception with CUDA.exp (#561) - Use LazyArtifacts instead of Pkg (#570) - Test runner: early bail out (#578) - memory reporting issue (#579) - c[3:4]=0 leads to exception (#580) - Add math ops (including broadcast) for half types (#581) - Dot product of Array and CuArray fails with CPU address error. (#586) - Support for CUDA-capable GPU with compute capability 4.0 like GTX 1080 (#587) - mapreducedim! not threadsafe (#588) - Allow separate directories for cuda and cudnn (#590) - Difficulties installing CUDA on Julia 1.6.0 . (#591) - Bug in Initialisation Error (#603) - CUDA.jl initialisation fails after suspending Ubuntu 20.04 with CUDA 11.2 (#605) - CUDA 11.2 CUBLASError and "CUDA.jl does not yet support CUDA with nvdisasm 11.2.67" (#607) - This intrinsic must be compiled to be called (#611) - OpenGL interop (#612) - Add support for CuFFT callback functions (#614) - I can’t multiply a CSR sparse matrix anymore (#615) - Julia version requirement (#619)

Merged pull requests: - Support all combinations of datatypes and transposes/adjoints in LinearAlgebra (#535) (@cqql) - Use structs for texture intrinsic return types. (#554) (@maleadt) - Backport some 1.6 fixes (#557) (@maleadt) - Update manifest (#560) (@github-actions[bot]) - Correct dims error (#562) (@DhairyaLGandhi) - Lock _shmem_cb (#564) (@vchuravy) - Move to Julia 1.6 (#566) (@maleadt) - Adapt to JuliaLang/julia#38487. (#568) (@maleadt) - Support for 'delayed kernels' (#569) (@maleadt) - Run cuda-memcheck as part of CI (#571) (@maleadt) - Use at-sync instead of calls to synchronize in tests. (#572) (@maleadt) - Update artifacts to include cuda-memcheck (#573) (@maleadt) - Use LazyArtifacts instead of Pkg. (#574) (@maleadt) - Improve LinearAlgebra impl methods for triangular types (#575) (@maleadt) - New findmin/max implementation using single-pass reduction (#576) (@maleadt) - Fix synchronization before testing cublasXt calls. (#577) (@maleadt) - Fix used memory reporting. (#582) (@maleadt) - Implement Statistics.varm/stdm instead of Statistics._var (#583) (@sdewaele) - Test for #558. (#584) (@maleadt) - Add a quick failure option to the test runner. (#585) (@maleadt) - Add lock around cfunction lookup (#589) (@vchuravy) - Catch all initialization errors. (#593) (@maleadt) - Update dependencies. (#596) (@maleadt) - Fix wrong initialisation error message (#604) (@qin-yu) - Fixes wrong spacing in docstring admonition (#608) (@navidcy) - Fix broadcasting with Base.angle (#618) (@marius311) - Test with the 1.6 nightly, not 1.7. (#620) (@maleadt) - Wrap cudaGL.h (#621) (@maleadt) - Initial compatibility with CUDA 11.2. (#622) (@maleadt) - 1.5 compatibility release (#623) (@maleadt) - Add CUDA 11.2 artifacts. (#624) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.3.0

CUDA v2.3.0

Diff since v2.2.1

Closed issues: - Misaligned address on load from Const (#548)

Merged pull requests: - Allow PermutedDimsArray in gemm_strided_batched (#539) (@mcabbott) - Fix broken checkbounds for CuSparseMatrixCSR and tests (#545) (@achuchmala) - Emphasize rebooting option. (#547) (@xanfus) - fix address calculation for ldg (#549) (@vchuravy) - Don't use explicit per-stream threads. (#551) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.2.1

CUDA v2.2.1

Diff since v2.2.0

- Julia
Published by maleadt over 5 years ago

CUDA - v2.2.0

CUDA v2.2.0

Diff since v2.1.0

Closed issues: - cudnn missing after downloading artifact (#521) - Downloading artifact: CUDA110 when using DiffEqFlux (#542)

Merged pull requests: - Update manifest (#520) (@github-actions[bot]) - Try out Buildkite. (#522) (@maleadt) - Update manifest (#529) (@github-actions[bot]) - Support for / Upgrade to CUDA 11.1 update 1. (#530) (@maleadt) - Fix and test svd! (#531) (@maleadt) - Move more CI to Buildkite. (#532) (@maleadt) - Use type symbols to generate wrapper methods (#534) (@cqql) - Fully move to Buildkite. (#537) (@maleadt) - Add unit_diag option for sv2! functions (#540) (@amontoison) - Documentation fixes (#543) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.1.0

CUDA v2.1.0

Diff since v2.0.2

Closed issues: - CUDNN convolution with Float16 always returns zeros (#92) - axp(b)y! and mul! (scalar multiplication) with mixed argument types (#144) - Dispatching to generic matmul instead of CUBLAS (#164) - Support for Ints and Float16? (#165) - Subarrays/views support (#172) - Easy way to pick among multiple GPUs (#174) - More prominently document JULIACUDAUSEBINARYBUILDER (#204) - ERRORCOOPERATIVELAUNCHTOO_LARGE during tests (#247) - Pkg.test error for cutensor test on Windows (#422) - Runtime build improvements (#456) - Fusing Wrappers (#467) - Could not find nvToolsExt (libnvToolsExt.dylib.1.0 or libnvToolsExt.dylib.1) in /Users/imac/.julia/artifacts/b502baf54095dff4a69fd6aba8667124583f6929/lib (#482) - mapreduce assumes commutative op (#484) - SubArray Broadcast Bug in 2.0 (#488) - Nested SubArray Scalar Indexing (#490) - Sparse matrix * view(vector) regression in 2.0 (#493) - Error transforming a reshaped 0-dimentional GPU array to a CPU array (#494) - test cuda FAILURE (#496) - Reshaped CuArray is not DenseCuArray (#511) - assignment failure when using array slicing. (#516)

Merged pull requests: - Use the correct CUDNN scaling parameter type. (#454) (@maleadt) - Fix versioned dylib discovery. (#486) (@maleadt) - Move inv from GPUArrays. (#487) (@maleadt) - Use dense array types in sparse wrappers. (#495) (@maleadt) - Update manifest (#497) (@github-actions[bot]) - Revert array wrapper union changes (#498) (@maleadt) - Clean-up pointer field. (#499) (@maleadt) - mapreduce: change iteration for compatibility with non-commutative operators. (#500) (@maleadt) - Use versioned libcuda (#502) (@maleadt) - Dynamically choose versioned libcuda (#503) (@mustafaquraish) - Update multigpu.md (#504) (@efmanu) - Upgrade artifacts for CUDA 11 compatibility. (#506) (@maleadt) - Update dependencies. (#507) (@maleadt) - Convert unsigned short ints to Cint for printf. (#508) (@maleadt) - Update manifest (#510) (@github-actions[bot]) - Fix reshape with missing dimensions. (#512) (@maleadt) - Don't return a pointer from 'alias'. (#513) (@maleadt) - Add some docs (#514) (@maleadt) - Fix CUDNN-optimized activation broadcasts (#515) (@maleadt) - Fix cooperative launch test. (#517) (@maleadt) - Fixes for Windows (#518) (@maleadt) - CUTENSOR fixes on Windows (#519) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.0.2

CUDA v2.0.2

Diff since v2.0.1

Closed issues: - cu() behavior for complex floating point numbers (#91) - Error when following example on using multiple GPUs on multiple processes (#468) - MacOS without nvidia GPU is trying to download CUDA111 on julia nightly (#469) - Drop BinaryProvider? (#474) - Latest version of master doesn't work on Windows (#477) - sum(CUDA.rand(3,3)) broken (#480) - copyto!() between cpu and gpu with subarrays (#491)

Merged pull requests: - Adapt to GPUCompiler changes. (#458) (@maleadt) - Fix initialization of global state (#471) (@maleadt) - Remove 'view' implementation. (#472) (@maleadt) - Workaround new artifact"" eagerness that prevents loading on unsupported platforms (#473) (@ianshmean) - Remove BinaryProvider dep. (#475) (@maleadt) - typo: libcuda.dll -> libcuda.so on Linux (#476) (@Alexander-Barth) - NFC array simplifications. (#481) (@maleadt) - Update manifest (#485) (@github-actions[bot]) - Convert AbstractArray{ComplexF64} to CuArray{ComplexF32} by default (#489) (@pabloferz)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.0.1

CUDA v2.0.1

Diff since v2.0.0

Closed issues: - Can't update (#462)

Merged pull requests: - Remove duplicate comment (#464) (@blegat) - Add functionality to precompile the runtime library. (#465) (@maleadt) - Update manifest (#470) (@github-actions[bot])

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v2.0.0

CUDA v2.0.0

Diff since v1.3.3

Closed issues: - Test failure during threading tests (#15) - Bad allocations in memory pool after devicereset! (#16) - CuArrays can lose Blas on reshaped views (#78) - allowscalar performance (#87) - Indexing with a CuArrays causes a 'scalar indexing disallowed' error from checkbounds (#90) - 5-arg mul! for CUSPARSE (#98) - copyto!(Device, Host) uses scalar iteration in case of type mismatch (#105) - Array primitives broken for CUSPARSE arrays (#113) - SplittingPool: CPU allocations (#117) - error while concatenating to an empty CuArray (#139) - Showing sparse arrays goes wrong (#146) - Improve test coverage (#147) - CuArrays allocates a lot of memory on the default GPU (#153) - [Feature Request] Indexing CuArray with CuArray (#155) - Reshaping CuArray throws error during backpropagation (#162) - Match syntax and APIs against Julia 1.0 standard libraries (#163) - CURANDSTATUSPREEXISTINGFAILURE when setting seed multiple times. (#212) - RFC: converts SparseMatrixCSC to CuSparseMatrixCSR via cu by default (#216) - Add a CuSparseMatrixCOO type (#220) - Test runner stumbles over path separators (#236) - Error: Invalid bitcode signature when loading CUDA.jl after precompilation (#293) - Atomic operations only work on global memory (#311) - Performance: cudnn algorithm selection (#318) - CUSPARSE is broken in CUDA.jl 1.2 (#322) - Device-side broadcast regression on 1.5 (#350) - API for fast math-like mode (#354) - CUDA 11.0 Update 1: cublasSetWorkspace (#365) - Can't precompile CUDA.jl on Kubuntu 20.04 (#396) - CuPtr should be Ptr in cudnnGetDropoutDescriptor (#397) - CUDA throws OOM error when initializing API on multiple devices (#398) - Cannot launch kernel with > 5 args using Dynamic Parallelism (#401) - Reverse performance regression (#410) - Tag for LLVM 3? (#412) - CUDA not working (#415) - StatsBase.transform fails on CuArray (#426) - Further unification of CUBLAS.axpy! and LinearAlgebra.BLAS.axpy! (#432) - size(range), length(range) and range[end] fail inside CUDA kernels (#434) - InitError: Cannot use memory pool 'binned' when CUDA.jl was precompiled for memory pool 'split'. (#446) - Missing dispatch for matrix multiplication with views? (#448) - New version not available yet? (#452) - using CUDA or CUArray, output: UndefVarError: AddrSpacePtr not defined (#457) - Unable to upgrade to the latest version (#459)

Merged pull requests: - Performance improvements by calling cuDNN API (#321) (@gartangh) - Use ccall wrapper for correct pointer type conversions (#392) (@maleadt) - Simplify Statistics.var and fix dims=tuple. (#393) (@maleadt) - Adapt to GPUArrays test change. (#394) (@maleadt) - Default to per-thread stream semantics (#395) (@maleadt) - Add a missing context argument for stateless codegen. (#399) (@maleadt) - Keep track of package latency timings. (#400) (@maleadt) - Update manifest (#402) (@github-actions[bot]) - Latency improvements (#403) (@maleadt) - Fix bounds checking with GPU views. (#404) (@maleadt) - Force specialization for dynamic_cudacall to support more arguments. (#407) (@maleadt) - Fix some wrong pointer types in the CUDNN headers. (#408) (@maleadt) - Refactor CUSPARSE (#409) (@maleadt) - Fix typo (#411) (@yixingfu) - Update manifest (#413) (@github-actions[bot]) - Simplify library wrappers by introducing a CUDA Ref (#414) (@maleadt) - Simplify and update wrappers (#416) (@maleadt) - GEMM improvements (#417) (@maleadt) - CompatHelper: add new compat entry for "BFloat16s" at version "0.1" (#418) (@github-actions[bot]) - add CuSparseMatrixCOO (#421) (@marius311) - Update manifest (#423) (@github-actions[bot]) - Global math mode for easy use of lower-precision functionality (#424) (@maleadt) - Improve init error message (#425) (@maleadt) - CUBLAS: wrap rot! to implement rotate! and reflect! (#427) (@maleadt) - CUFFT-related optimizations (#428) (@maleadt) - Fix reverse/view regression (#429) (@maleadt) - Update packages (#433) (@maleadt) - Introduce StridedCuArray (#435) (@maleadt) - Retry curandGenerateSeeds when OOM. (#436) (@maleadt) - Introduce DenseCuArray union (#437) (@maleadt) - Array simplifications (#438) (@maleadt) - Fix and test reverse on wrapped array. (#439) (@maleadt) - Fixes after recent array wrapper changes (#441) (@maleadt) - Adapt to GPUArrays changes. (#442) (@maleadt) - Provide CUBLAS with a pool-backed workspace. (#443) (@maleadt) - Fix finalization of copied arrays. (#444) (@maleadt) - Support for/Add CUDA 11.1 (#445) (@maleadt) - Update manifest (#449) (@github-actions[bot]) - Allow use of strided vectors with mul! (gemv! and gemm!) (#450) (@maleadt) - Have convert call CuSparseArray's constructors. (#451) (@maleadt)

- Julia
Published by github-actions[bot] over 5 years ago

CUDA - v1.3.3

CUDA v1.3.3

Diff since v1.3.2

Closed issues: - Type changing Array conversions give error when allowscalar(false) (#344) - getindex(::CuArray, ::Adjoint, ::Colon) fails (#345) - View with array indices causes memory copy before broadcast (#384) - Regression with Julia 1.5 (#390)

Merged pull requests: - Replace DevicePtr with Core.LLVMPtr. (#199) (@maleadt) - Make sure view indices reside on the GPU too. (#388) (@maleadt) - CompatHelper: Update DataStructures to v0.18 (#389) (@ChrisRackauckas)

- Julia
Published by github-actions[bot] almost 6 years ago