Recent Releases of CUDA
CUDA - v5.8.3
CUDA v5.8.3
Merged pull requests: - More tests for diagm (#2791) (@kshyatt) - Add JLD2 to test env (#2792) (@christiangnrd) - cuTENSOR: Destroy plan description and preference after construction. (#2794) (@maleadt) - More tests for sparse matrix dimension checks (#2796) (@kshyatt) - Better error messages and tests for sm2 (#2797) (@kshyatt) - Reorganize interfaces tests, lower allocations (#2799) (@kshyatt) - Cleanup and less memory use for cusparse linalg tests (#2800) (@kshyatt) - Separately version the CUDA compiler (#2801) (@maleadt) - Remove shape-preserving Diagonal conversion constructors. (#2805) (@maleadt) - More accumulation and reduction benchmarks (#2808) (@christiangnrd) - Rationalize and try to fix failing ldiv tests (#2809) (@kshyatt) - Simplify specifying benchmark output file (#2814) (@christiangnrd) - Add KA unified memory support (#2819) (@christiangnrd) - Augment docs about setting runtime version (#2822) (@david-macmahon) - Fix a list numbering problem in docs (#2824) (@david-macmahon) - Move things to GPUToolbox. (#2826) (@maleadt) - Initial compatibility with CUDA 13 (#2834) (@maleadt)
Closed issues:
- Array constructors for ones, zeros, rand, ... (#159)
- CuSparse documentation (#135)
- gemmstridedbatched throws error on windows (#132)
- Documentation: An example for allocating Unified Memory arrays (#33)
- norm function errors on big arrays (#598)
- CuSparse factorizations (#1396)
- opnorm(::CuMatrix, p) for p = (1, Inf) (#1533)
- CUSPARSE: support broadcasting for CuSparseVectors (#2699)
- Remove erroneous CuArray(::Diagonal) methods (#2734)
- Matrix-Matrix-Multiplication fails with CuSparseMatrixBSR. (#2745)
- Possible CPU memory leak in cuTENSOR plans (#2793)
- Test fail for libraries/cublas/level1 (#2810)
- CI is failing due to CUSPARSEVEC (#2817)
- Support for CUDA 13 (#2831)
- Julia
Published by github-actions[bot] 10 months ago
CUDA - v5.8.2
CUDA v5.8.2
Merged pull requests: - Fix spdiagm with specified pairs (#2784) (@ErikQQY) - Add diagm in CUBLAS (#2786) (@ErikQQY)
Closed issues:
- Where to host extension(s) (#2735)
- spdiagm doesn't support specified diagonal elements (#2783)
- CUDA failed to create a diagonal matrix of CuArray(u) (#2785)
- Julia
Published by github-actions[bot] about 1 year ago
CUDA - v5.8.0
CUDA v5.8.0
Merged pull requests:
- SparseMatricesCSR Dispatch (#2720) (@Abdelrahman912)
- Very rough implementation of bcast for CuSparseVector (#2733) (@kshyatt)
- Possible fix for #2745, change args in call to cusparseCreateBsr (#2747) (@manuelbb-upb)
- Simple tests for check and explaineltype (#2748) (@kshyatt)
- Test for printing OutOfGPUMemoryError (#2749) (@kshyatt)
- Fix logmessage pileup (#2750) (@fps)
- Test for parselimit (#2751) (@kshyatt)
- unsafewrap for symbols (#2753) (@vchuravy)
- Use thread adoption to handle log messages. (#2754) (@maleadt)
- Add pre-commit configuration (#2755) (@vchuravy)
- Broaden check for eltypes to make sure we don't allow invalid stuff (#2756) (@kshyatt)
- Prefer alignedsizeof (#2757) (@vchuravy)
- More array tests (#2758) (@kshyatt)
- A few more tests for CUSOLVER Q mats (#2759) (@kshyatt)
- More tests for CuArrayPtr (#2760) (@kshyatt)
- [CUSOLVER] Update gesvdp! (#2763) (@amontoison)
- Get rid of unneeded version checks (#2765) (@kshyatt)
- Remove second import of alignedsizeof (#2767) (@vchuravy)
- CUSPARSE SpGEMM: Support algorithms 2 and 3 (#2769) (@maleadt)
- Update to CUDA 12.9. (#2772) (@maleadt)
- Fix SPGEMM_ALGOS setup (#2773) (@jonas-schulze)
- Support new functionality from KA 0.9.32 (#2774) (@michel2323)
- cuTENSOR: Preserve storage type when multiplying (#2775) (@christiangnrd)
- Update subpackages. (#2776) (@maleadt)
- Remove the unnecessary reshape during mapreduce. (#2778) (@maleadt)
Closed issues:
- Type conversions in broadcast fails when compiling with always_inline=true (#2722)
- cuDNN loses memory to log messages in Pluto.jl context (#2743)
- Xgesvdp! failure when only requesting singular values (#2761)
- CUDA 5.7.3 fails to precompile on Julia 1.12.0-beta2 (#2762)
- alignedsizeof with an existing identifier (#2766)
- CUSPARSESPGEMM_ALG2 not working (#2768)
- sum! throws dispatch error beyond a threshold number of rows (#2777)
- Julia
Published by github-actions[bot] about 1 year ago
CUDA - v5.7.3
CUDA v5.7.3
Merged pull requests: - Merge CSC/CSR broadcast kernels (#2731) (@kshyatt) - GPUToolbox v0.2 take 2 (#2736) (@christiangnrd) - Add dispatches to access device matrix data via SparseArrays interface (#2738) (@termi-official) - More tests for CuContext (#2739) (@kshyatt) - Fill in missing KA functionality (KA.functional + sparse matrices adaption from CUDAbackend) (#2740) (@Abdelrahman912) - Small tests and changes for coverage (#2742) (@kshyatt) - More tests and better error type for cusparse generic (#2744) (@kshyatt) - Restore the descriptors in CUSPARSE (#2746) (@amontoison)
- Julia
Published by github-actions[bot] about 1 year ago
CUDA - v5.7.2
CUDA v5.7.2
Merged pull requests: - Support disabling implicit synchronization (#2662) (@vchuravy) - More tests and bugfixes for CUSOLVER (#2707) (@kshyatt) - Set neutral element to zero for sparse reduce (#2710) (@kshyatt) - Bugfix and tests for cusolver/base (#2712) (@kshyatt) - Small fixes and missed tests for CUTENSORNET (#2713) (@kshyatt) - Even more tests and small fixes for CUTENSORNET (#2715) (@kshyatt) - Tests for CUSTATEVEC errors (#2716) (@kshyatt) - Add compat entries for recent devices and toolkits. (#2717) (@maleadt) - Split out copyto for texture arrays and add more tests (#2719) (@kshyatt) - Add a docstring for pointer (#2721) (@maleadt) - More CUSOLVER dense tests (#2723) (@kshyatt) - Tests for some helper functions (#2724) (@kshyatt) - More tests and bugfixing for CUSPARSE (#2725) (@kshyatt) - Add more methods for all versions to unstick tests (#2726) (@kshyatt)
Closed issues: - Ability to opt out of / improved automatic synchronization between tasks for shared array usage (#2617) - maximum(abs, CuSparseMatrixCSR) returns Inf (#2705) - mapreduce(f, op, A) for sparse A is wrong if f(0) =/= 0 (#2709)
- Julia
Published by github-actions[bot] about 1 year ago
CUDA - v5.7.1
CUDA v5.7.1
Merged pull requests: - Tests for MIME printing and indexing (#2686) (@kshyatt) - Loosen VERSION check for sketchy test (#2688) (@kshyatt) - CompatHelper: bump compat for GPUToolbox to 0.2, (keep existing compat) (#2689) (@github-actions[bot]) - Even more sparse printing and tril/triu tests (#2692) (@kshyatt) - Even more sparse tests (#2695) (@kshyatt) - More tests and a matmatmul fix (#2697) (@kshyatt) - Sparse conversion tests (#2698) (@kshyatt) - Tests for descriptors (#2700) (@kshyatt) - More tests for some missing kron methods (#2701) (@kshyatt) - Don't duplicate const defs (#2703) (@kshyatt) - Exclude device-side sorting code from coverage (#2704) (@kshyatt) - More tests for CuRef/CuRefArray (#2706) (@kshyatt) - Update Project.toml (#2708) (@kshyatt)
Closed issues:
- GC corruption on 1.10 during cusparse/reduce tests (#2027)
- Launch bounds interface (#2674)
- Precompilation errors: ERROR: LoadError: invalid redefinition of constant CUSPARSE.CuSparseUpperOrUnitUpperTriangular (#2690)
- Julia
Published by github-actions[bot] about 1 year ago
CUDA - v5.7.0
CUDA v5.7.0
Merged pull requests: - Bugfix for batched gemv (#2481) (@kose-y) - Split out level 3 gemm tests (#2610) (@kshyatt) - Switch CUBLAS to device-side pointer mode (#2616) (@kshyatt) - Elide bounds checks when kernels contains manual ones. (#2621) (@maleadt) - Support passing symbols as arguments (#2624) (@vchuravy) - Remove eager synchronization with HtoD copies. (#2625) (@maleadt) - Don't prefetch on multi-device systems (#2626) (@vchuravy) - Cooperative groups: add a boundscheck to avoid confusing inexact errors. (#2631) (@maleadt) - NFC fixes (#2632) (@maleadt) - Update to CUDA 12.8 (#2634) (@maleadt) - [CUSOLVER] Update the test of syevBatched! (#2636) (@amontoison) - Improve NSight Systems activation by inspecting the session list. (#2638) (@maleadt) - [CUSPARSE] Support CuSparseMatrixBSR in the generic mm! (#2639) (@amontoison) - [CUSOLVER] Support symmetric factorization without pivoting (#2640) (@amontoison) - Wrap the Givens rotation methods (#2642) (@kshyatt) - Remove kron methods and use those in GPUArrays (#2643) (@kshyatt) - Add a simpler CuRefValue. (#2645) (@maleadt) - Use GPUToolbox.jl (#2646) (@christiangnrd) - DtoH copies: perform a nonblocking sync before calling into libcuda. (#2648) (@maleadt) - Support Adjoint/Transpose -> COO (#2649) (@kshyatt) - Support cuTENSOR contractors for 1D views (#2650) (@kshyatt) - Re-enable mixed precision sparse mv (#2651) (@kshyatt) - Proper support for similar on CuSparseMats (#2652) (@kshyatt) - Test error throw for accumulate (#2656) (@kshyatt) - Lots more tests for CUBLAS (#2657) (@kshyatt) - MORE tests for CUBLAS and a bugfix (#2659) (@kshyatt) - Add tests for gemmEx in fast math mode (#2660) (@kshyatt) - More tests/better coverage for CUSPARSE (#2663) (@kshyatt) - Fixes and tests for CuStateVec (#2664) (@kshyatt) - Re-enable NVTX on Windows. (#2665) (@maleadt) - Protect against occupancy calculations with very large numbers. (#2666) (@maleadt) - Fixes and tests for COO indexing, exclude more kernels from coverage (#2668) (@kshyatt) - Exclude lib*jl from coverage also for CUSTATEVEC, CUTENSOR, and CUTENSORNET (#2669) (@kshyatt) - Even MORE tests and cov for CUBLAS (#2670) (@kshyatt) - Fix and test for mgpu batch measure (#2671) (@kshyatt) - Remove some invalid conversions and test more (#2673) (@kshyatt) - Exclude more device side code in CUSPARSE (#2676) (@kshyatt) - More tests, better errors, more exclusions for CUSPARSE (#2677) (@kshyatt) - Try re-enabling the convolution tests (#2678) (@kshyatt) - Fix Markdown formatting in overview.md (#2680) (@singularitti) - Even more CUSPARSE tests (#2682) (@kshyatt) - Fix inference of FFT plan creation (#2683) (@jipolanco) - Some cudadrv tests (#2684) (@kshyatt)
Closed issues:
- Batched strided GEMM tests fail (#151)
- CuArrays.CURAND.curand missing methods (#141)
- Rationals behave badly (#118)
- Matrix inversion for CuArray (#116)
- Dot product of a complex CuArray with a real CuArray performance (#668)
- Sporadic cudnn/convolution test failures (#725)
- Support for LinearAlgebra.pinv (#883)
- Update mv!, mm!, sv! and sm! with the future release of CUPARSE (#1610)
- [CUSPARSE] changing size in similar returns a cpu array (#1667)
- Mix precision sparse mul is not dispatched correctly (#1760)
- Make CuRef(Value) behave more like Ref (#1803)
- [cuTENSOR] Issue when contracting views of CuArrays with cuTENSOR (#2407)
- versioninfo broken on Jetson Orin due to NVML lookup failure (#2542)
- CUBLAS: Improve concurrency using device pointer mode (#2571)
- NVML issues on Jetson Nano Orin (#2580)
- Passing Symbol as a an argument fails (#2590)
- Remove kron functionality (#2602)
- Disable or make automatic prefecthing of unified memory optional (#2618)
- Circular dependency in CUDA with Julia 1.10 (#2622)
- Regression with nsys profile and CUDA.@profile (#2629)
- PrecompileTools.jl with CUDA.jl causes kernels to fail to run on 1.11 (#2637)
- Support Adjoint Sparse Matrices for CuSparseMatrixCOO (#2647)
- Implicit stream sync in tasks serialise kernel execution (#2654)
- Broadcasting on arrays larger than typemax(Int32) yields truncation error (#2658)
- Problem with function in CUDA (#2667)
- CUDA.limit errors with invalid argument (code 1, ERROR_INVALID_VALUE) (#2672)
- CUDA.jl does not support tuples of UInt128 (#2675)
- Can not permutedims! CuArray with length larger that typemax(Int32) (#2679)
- Support for older GPUs (#2685)
- Julia
Published by github-actions[bot] about 1 year ago
CUDA - v5.6.1
CUDA v5.6.1
Merged pull requests:
- Support GPUArrays allocations cache (#2593) (@pxl-th)
- Fix resize! when pool=none is in use (#2613) (@luraess)
- Update to new alloc cache interface. (#2614) (@maleadt)
- Work around NVML issue on Jetson Orin. (#2620) (@maleadt)
Closed issues: - Add strides, implement CUDA Array Interface (#1298) - Restore broken CUBLAS test (#2584) - Issues with multiple GPUs on a single node (#2615)
- Julia
Published by github-actions[bot] over 1 year ago
CUDA - v5.6.0
CUDA v5.6.0
CUDA.jl v5.6 is a relatively minor release, which the most important change being behind the scenes: GPUArrays.jl v11 has switched to KernelAbstractions.jl (#2524).
Features
- Update to CUDA 12.6.2 (#2512)
- CUSOLVER: support for
Xgeev!(#2513),XsyevBatched(#2577),gesv!andgels!(#2406) - CUBLAS: added multiplication of transpose / adjoint matrices by diagonal matrices (#2518, #2538)
- Improve handle cache performance in the presence of many short-lived tasks (#2583)
- CUFFT: Pre-allocate the buffer required for complex-to-real FFTs only once (#2578)
- Improved batched pointer conversion for very large batches (#2608)
Bug fixes
- Fix
findallwith an empty CuArray (#2554) - CUBLAS: Fix use of level 1 methods with strided arrays (#2528)
- CUSOLVER: Fix
Xgesvdr!(#2556) - Preserve the array buffer type with more linear algebra operations (#2534)
Work around LinearAlgebra.jl breakage in Julia 1.11.2 concerning generic triangular
(l/r)mul!- (#2585) - Fix ambiguity of
LinearAlgebra.dot(#2569) - Native RNG: Fixes when working with very large arrays (#2561)
- Avoid a deadlock due do union splitting in the
mapreducekernel (#2595) - Fix pinning of resized CPU memory by automatically re-pinning (#2599)
Merged pull requests:
- [CUSOLVER] Interface gesv! and gels! (#2406) (@amontoison)
- Update wrappers for CUDA v12.6.2 (#2512) (@amontoison)
- [CUSOLVER] Interface Xgeev! (#2513) (@amontoison)
- Added multiplication of transpose / adjoint matrices by diagonal matrices (#2518) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 1, (keep existing compat) (#2521) (@github-actions[bot])
- Adapt to GPUArrays.jl transition to KernelAbstractions.jl. (#2524) (@maleadt)
- Switch CI to 1.11. (#2525) (@maleadt)
- CUTENSOR: Reduce amount of broadcasts compiled during tests. (#2527) (@maleadt)
- CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. (#2528) (@maleadt)
- Clarify the synchronize(ctx)/devicesynchronize() docstrings (#2532) (@JamesWrigley)
- Issue #2533: Preserving the buffer type in linear algebra (#2534) (@kmp5VT)
- Clarify description of how LocalPreferences.toml is generated in the docs (#2535) (@glwagner)
- Adapt to JuliaGPU/GPUArrays.jl#567. (#2537) (@maleadt)
- Removed allocations for transpose/adjoint - diagonal multiplications (#2538) (@RedRussianBear)
- Consistent use of Nsight Compute (#2541) (@huiyuxie)
- Fix formatting in profiling docs page (#2543) (@efaulhaber)
- Fix typo in EnzymeCoreExt.jl (#2550) (@wsmoses)
- Enhance warning under a profiler (#2552) (@huiyuxie)
- Fix findall with an empty CuArray of Bool (#2554) (@amontoison)
- [CUSOLVER] Fix Xgesvdr! (#2556) (@amontoison)
- Test restore Enzyme.jl (#2557) (@wsmoses)
- Native RNG fixes for very large arrays (#2561) (@maleadt)
- [Enzyme] Mark launchconfiguration as inactive (#2563) (@wsmoses)
- Update EnzymeCoreExt.jl (#2565) (@simenhu)
- Fix ambiguity of LinearAlgebra.dot (#2569) (@amontoison)
- [CUSOLVER] Add more tests for the dense SVD (#2574) (@amontoison)
- [CUSOLVER] Interface XsyevBatched (#2577) (@amontoison)
- [CUFFT] Preallocate a buffer for complex-to-real FFT (#2578) (@amontoison)
- Run the GC when failing to find a handle, but lots are active. (#2583) (@maleadt)
- Work around LinearAlgebra.jl breakage in 1.11.2. (#2585) (@maleadt)
- mapreduce: avoid deadlock by forcing the accumulator type. (#2596) (@maleadt)
- Switch to GitHub Actions-based benchmarks. (#2597) (@maleadt)
- Re-pin variable sized memory (#2599) (@jipolanco)
- Enzyme: add makezero of cuarrays (#2600) (@wsmoses)
- Update cache.jl (#2604) (@jarbus)
- Enzyme: mark devicesync as non-differentiable only downstream (@wsmoses)
- Move strided batch pointer conversion to GPU (#2608) (@THargreaves)
- Split linalg tests into multiple files (#2609) (@kshyatt)
Closed issues:
- Inference failure with sort(::CuMatrix) after loading MLDatasets (#2258)
- Kron Support for CuSparseMatrixCSC (#2370)
- Broadcasting a function returning an anonymous function with a constructor over CUDA arrays fails to compile, "not isbits" (#2514)
- CuArray view has different variable type outside x inside the cuda kernel (#2516)
- Can't build cuDNN on centos7.8 (#2517)
- Precompile errors (#2519)
- Precompile errors (#2520)
- Error returned from CUDA function in CUDA-aware MPI multi-GPU test (#2522)
- Broadcasting over random static array errors on Julia 1.11 (#2523)
- gemm_strided_batched only using strided CUDA kernel when first matrix is transposed (#2529)
- CUDA runtime libraries are loaded from a system path due to LDLIBRARYPATH being set (#2530)
- [Bug] UnifiedMemory buffer changes during LinearAlgebra operations (#2533)
- Improve system library warning when running under profiler (#2540)
- Local CUDA settings not propagated to Pkg.test (#2545)
- Out of Memory when working with Distributed for Small Matricies (#2548)
- findall is not working with an empty vector of bool (#2553)
- CUDA code does not return when running under VSC Debugging mode (#2558)
- dot is quite slow in multinest Arrays (#2559)
- UndefVarError: backend not defined in GPUArrays (#2564)
- view() returns CuArray instead of view for 1-D CuArrays (#2566)
- dot ambiguity (#2568)
- InvalidIRError thrown only if critical function is not previously compiled (#2573)
- circular dependency during precompilation (#2579)
- Sparse MatVec Is Nondeterministic? (#2582)
- CUDA triggers long Circular dependency list (#2586)
- Release v5.5.3 for GPUArray v11? (#2587)
- 'dot' gives different answers when viewing rather than slicing multidimensional arrays (#2589)
- Scalar indexing when performing kron on two CuVectors (#2591)
- Faster strided-batched to batched wrapper (#2592)
- Error when copying data to pinned and resized CPU array (#2594)
- mapreducedim! size-dependent fail when narrowing float element types (#2595)
- Missing Enzyme.make_zero in Enzyme extension leads to incorrect behaviour (#2598)
- 'ArgumentError: array must be non-empty' when attempting to pop idle handles from HandleCache (#2603)
- Do a release as current one doesn't support GPUArrays v11 (#2606)
- Julia
Published by github-actions[bot] over 1 year ago
CUDA - v5.5.2
CUDA v5.5.2
Merged pull requests: - Fix type of AbstractFFTs.Plan for real-complex FFTs (#2504) (@jipolanco) - Profiler: Demangle kernel names. (#2505) (@maleadt) - Bump CUDNN. (#2507) (@maleadt) - Restore Enzyme checks (#2508) (@wsmoses)
- Julia
Published by github-actions[bot] over 1 year ago
CUDA - v5.5.1
What's Changed
- Update wrappers for CUDA v12.6.1 by @amontoison in https://github.com/JuliaGPU/CUDA.jl/pull/2499
- Enzyme: adapt to pending version breaking update by @wsmoses in https://github.com/JuliaGPU/CUDA.jl/pull/2490
Full Changelog: https://github.com/JuliaGPU/CUDA.jl/compare/v5.5.0...v5.5.1
- Julia
Published by maleadt over 1 year ago
CUDA - v5.5.0
CUDA v5.5.0
Merged pull requests:
- Add support for arbitrary group sizes in gemm_grouped_batched! (#2334) (@lpawela)
- Add kernel compilation requirements to docs (#2416) (@termi-official)
- Enzyme: reverse mode kernels (#2422) (@wsmoses)
- CUFFT: Support Float16 (#2430) (@eschnett)
- Updated compute-sanitizer documentation (#2440) (@alexp616)
- Add troubleshooting section for NSight Compute (#2442) (@efaulhaber)
- Correct typo in documentation (#2445) (@eschnett)
- Bump minimal Julia requirement to v1.10. (#2447) (@maleadt)
- fix compute-sanitizer typo (#2448) (@alexp616)
- Address a corner case when establishing p2p access (#2457) (@findmyway)
- Implementation of spdiagm for CUSPARSE (#2458) (@walexaindre)
- Update to CUDA 12.6. (#2461) (@maleadt)
- CompatHelper: bump compat for GPUCompiler to 0.27, (keep existing compat) (#2462) (@github-actions[bot])
- Bump CUDA driver JLL. (#2463) (@maleadt)
- CUSOLVER (dense): cache workspace in fat handle (#2465) (@bjarthur)
- Revert "Run full GC when under very high memory pressure." (#2469) (@maleadt)
- Fix a method deprecation. (#2470) (@maleadt)
- Add Enzyme sum derivatives (#2471) (@wsmoses)
- Re-use pre-converted kernel arguments when launching kernels. (#2472) (@maleadt)
- Bump LLVM compat (#2473) (@maleadt)
- Bump subpackage compat. (#2475) (@maleadt)
- Enzyme: Reversemode cudaconvert (#2476) (@wsmoses)
- Ignore Enzyme.jl CI failures (#2479) (@maleadt)
- Re-enable enzyme testing (#2480) (@wsmoses)
- Add missing GC.@preserves. (#2487) (@maleadt)
- [CUSPARSE] Implement a sparse GEMV for CuSparseMatrixCSC * CuSparseVector (#2488) (@amontoison)
- [CUSPARSE] Add conversions between CuSparseVector and CuSparseMatrices (#2489) (@amontoison)
- Update to LLVM 9.1. (#2491) (@maleadt)
- Use at-consistent_overlay for 1.11 compatibility. (#2492) (@maleadt)
- Rework NNlib CI. (#2493) (@maleadt)
- CUSPARSE: Fix sparse constructor with duplicate elements. (#2495) (@maleadt)
Closed issues:
- LinearAlgebra.norm(x) falls back to generic implementation for x::Transpose and x::Adjoint (#1782)
- dlclose'ing the compatibility driver can fail (#1848)
- Creating a sparse diagonal matrix of CuArray(u) (#1857)
- Support for Julia 1.11 (#2241)
- CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328)
- Adding CUDA to a PackageCompiler sysimage causes segfault (#2428)
- Error using CUDA on Julia 1.10: Number of threads per block exceeds kernel limit (#2438)
- Error when I load my model (#2439)
- Driver JLL improvements (#2446)
- Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449)
- CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452)
- Segmentation Fault on Loading CUDA (#2453)
- Invalid instruction error when using CUDA (#2454)
- Missing adapt for sparse and CUDABackend (#2459)
- CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466)
- Request: Option to disable the "full GC when under very high memory pressure". (#2467)
- copyto! ambiguous (#2477)
- NeuralODE training failed on GPU with Enzyme (#2478)
- issue with atomic - when running standard test, @atomic modify expression missing field access (#2483)
- Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484)
- Issue with compiling CUDA and cuTENSOR using local libraries (#2486)
- Memory Access error in sparse array constructor (#2494)
- Forwards-compatible driver breaks CURAND (#2496)
- CUDA 12.6 Update 1 (#2497)
- Julia
Published by github-actions[bot] over 1 year ago
CUDA - v5.4.3
CUDA v5.4.3
Merged pull requests:
- add cublas
Closed issues:
- Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
- Broadcasted multiplication with a rational doesn't work (#1926)
- Incorrect grid size in kron (#2410)
- GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
- Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
- CUDADriverjll's lazy artifacts cause a precompilation-time warning (#2415)
- Recurrence of integer overflow bug (#1880) for a large matrix (#2427)
- CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
- CUDARuntimeDiscovery Did not find cupti on Arm system with nvhpc (#2433)
- CUDA.jl won't install/run on Jetson Orin NX (#2435)
- Julia
Published by github-actions[bot] almost 2 years ago
CUDA - v5.4.0
CUDA v5.4.0
Merged pull requests: - Support CUDA 12.5 (#2392) (@maleadt) - Mark cuarray as noalias (#2395) (@wsmoses) - Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison) - Enable correct pool access for cublasXt. (#2398) (@maleadt) - More fine-grained CUPTI version checks. (#2399) (@maleadt)
Closed issues: - CUTENSOR breaks after devicereset! (#2319) - cuBLASXt's `xtgemm!` incompatible with stream-ordered allocated memory (#2320) - Add helper function to recompile CUDA stack (#2364)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.3.5
CUDA v5.3.5
Merged pull requests:
- Avoid constructing MulAddMuls on Julia v1.12+ (#2277) (@dkarrasch)
- CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
- Enzyme: allocation functions (#2386) (@wsmoses)
- Tweaks to prevent context construction on some operations (#2387) (@maleadt)
- Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
- CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
- Backport: Enzyme allocation fns (#2393) (@wsmoses)
Closed issues: - Indexing a view uses scalar indexing (#1472) - EnzymeCore is an unconditional dependency. (#2380) - cuBLASLt wrappers ccall into cuBLAS (#2388) - generic_trimatmul! error (#2389)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.3.4
CUDA v5.3.4
Merged pull requests: - Add Enzyme Forward mode custom rule (#1869) (@wsmoses) - Handle cache improvements (#2352) (@maleadt) - Fix cuTensorNet compat (#2354) (@maleadt) - Optimize array allocation. (#2355) (@maleadt) - Change type restrictions in cuTENSOR operations (#2356) (@lkdvos) - Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot]) - Suggest use of 32 bit types over 64 instead of just Float32 over Float64 skip ci (@Zentrik) - Make generictrimatmul more specific (#2359) (@tgymnich) - Return the currect memory type when wrapping system memory. (#2363) (@maleadt) - Mark cublas version/handle as non-differentiable (#2368) (@wsmoses) - Enzyme: Forward mode sync (#2369) (@wsmoses) - Enzyme: support fill (#2371) (@wsmoses) - unsafewrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt) - Remove external_gvars. (#2373) (@maleadt) - Tegra support with artifacts (#2374) (@maleadt) - Backport Enzyme extension (#2375) (@wsmoses) - Add note about --check-bounds=yes (#2378) (@Zinoex) - Test Enzyme in a separate CI job. (#2379) (@maleadt) - Fix tests for Tegra. (#2381) (@maleadt) - Update Project.toml remove EnzymeCore unconditional dep (@wsmoses)
Closed issues:
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a japi1 function (#49)
- copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
- Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
- rand and friends default to Float64 (#108)
- \ does not work for least squares (#104)
- ERRORILLEGALADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for sm_80 cp.async: asynchronous on-device copies (#850)
- Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- devicereset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
- `cusparseSetStreamv2` not defined (#1820)
- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.3.3
CUDA v5.3.3
Merged pull requests: - Rework context handling (#2346) (@maleadt) - fix kernel launch logic (#2353) (@xaellison)
Closed issues: - Excessive allocations when running on multiple threads (#1429) - Fix and test multigpu support (#2218) - Bitonic sort exceeds launch resources (#2331)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.3.2
CUDA v5.3.2
Merged pull requests: - Add EnzymeCore extension for parent_job (#2281) (@vchuravy) - Consider running GC when allocating and synchronizing (#2304) (@maleadt) - Refactor memory wrappers (#2335) (@maleadt) - Auto-detect external profilers. (#2339) (@maleadt) - Fix performance of indexing unified memory. (#2340) (@maleadt) - Improve exception output (#2342) (@maleadt) - Test multigpu on CI (#2348) (@maleadt) - cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt) - cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)
Closed issues:
- CuArrays don't seem to display correctly in VS code (#875)
- Task scheduling can result in delays when synchronizing (#1525)
- Docs: add example on task-based parallelism with explicit synchronization (#1566)
- Exception output from many threads is not helpful (#1780)
- Autodetect external profiler (#2176)
- LazyInitialized is not GC-safe (#2216)
- Track CuArray stream usage (#2236)
- Improve cross-device usage (#2323)
- CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
- Improve error message when assigning real valued arrray with complex numbers (#2341)
- @device_code_sass broken (#2343)
- Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
- @gcsafe_ccall breaks inlining of ccall wrappers (#2347)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.3.1
CUDA v5.3.1
Merged pull requests: - [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison) - Regenerate headers (#2324) (@maleadt) - Add some installation tips to docs/README.md (#2326) (@jlchan) - fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3) - Diagnose kernel limits on launch failure. (#2329) (@maleadt) - Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)
Closed issues: - Missing CUBLASLt wrappers (#2322) - error when switching device (#2323) - v5.3.0: regression in Zygote performance (#2333)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.3.0
CUDA v5.3.0
Merged pull requests:
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt)
- Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining computetype as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Towards supporting Julia 1.11 (#2291) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
- Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- sortperm with dims (#2308) (@xaellison)
- [CUBLAS] Interface gemmgroupedbatched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdatet (#2311) (@amontoison)
- Avoid capturing AbstractArrays in BoundsError (#2314) (@lcw)
- Clarify debug level hint. (#2316) (@maleadt)
Closed issues:
- Failed to compile PTX code when using NSight on Win11 (#1601)
- sortperm fails with dims keyword (#2061)
- NVTX-related segfault on Windows under compute-sanitizer (#2204)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient of sort on 2D CuArray (#2259)
- Multi-threaded code hanging forever with Julia 1.10 (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
- StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
- BUG: sortperm! seems to perform much slower than it should (#2293)
- Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
- BFloat16 support broken on Julia 1.11 (#2306)
- does not emit line info for debbuging/profiling (#2312)
- Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313)
- Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v4.4.2
CUDA v4.4.2
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDARuntimeDiscovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a withworkspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for CUDA.@elapsed (#2118) (@thomasfaingnaert)
- [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSVupdateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.setruntimeversion force the default behavior. (#2169) (@maleadt)
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSESPMVCOOALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMMpreprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMMpreprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make `@devicecodesass` work with non-Julia kernels. (#2247) (@maleadt)
- Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining computetype as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
- Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- [CUBLAS] Interface gemmgroupedbatched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- Failed to compile PTX code when using NSight on Win11 (#1601)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a withworkspaces function to allocate two buffers (Device / Host) (#1767)
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDAVISIBLEDEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- CUDASDKjll: cuda.h in different locations depending on the platform (#2066)
- StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
- Support for LinearAlgebra.pinv (#2070)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
- CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
- Missing CUDARuntimeDiscovery as a dependency in cuDNN (#2094)
- Binaries for Jetson (#2105)
- Minimum/maximum of array of NaNs is infinity (#2111)
- Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
- [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
- More informative errors when parameter size is too big (#2119)
- Unable to allocate unified memory buffers (#2120)
- CUDA 12.3 has been released (#2122)
- atomic min, max for Float32 and Float64 (#2129)
- Native profiler output is limited to around 100 columns when printing to a file (#2130)
- Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
- LLVM generates max.NaN which only works on sm80 (#2148)
- Unified memory-related error on Tegra T194 (#2149)
- Errors on sm61 (#2150)
- First test for Julia/CUDA with 15 failures (#2158)
- High CPU load during GPU syncronization (#2161)
- Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
- Update to CUTENSOR 2.0 (#2174)
- Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
- Support for combining duplicate elements in sparse matrices (#2185)
- Interactive sessions: periodically trim the memory pool (#2190)
- Broadcast does not preserve buffer type (#2191)
- CUDA doesn't precompile on Julia nightly/1.11 (#2195)
- Latest julia: UndefVarError: `makeseednot defined inRandom(#2198)
- NVTX-related segfault on Windows under compute-sanitizer (#2204)
- CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
- Most recent package versions not supported on CUDA.jl (#2212)
- Testing of CUDA fails (#2222)
- Tests fail for CUDA#master (#2223)
---debug-info=2makesNNlibCUDACUDNNExtprecompilation run forever (#2225)
- Test failures on Nvidia GH200 (#2227)
- mul! should support strided outputs (#2230)
- Please add support for older cuda versions (cuda 8 and older) (#2231)
- NSight Compute: prevent API calls during precompilation (#2233)
- Integrated profiler: detect lack of permissions (#2237)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient ofsorton 2D CuArray (#2259)
- Multi-threaded code hanging forever with Julia 1.10 (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
-StackOverflowErrortrying to throwOutOfGPUMemoryError, subsequent errors (#2292)
- BUG: sortperm! seems to perform much slower than it should (#2293)
- MultiplyingCuSparseMatrixCSCbyCuMatrixresults inOut of GPU memory` (#2296)
- BFloat16 support broken on Julia 1.11 (#2306)
- Julia
Published by github-actions[bot] about 2 years ago
CUDA - v5.2.0
CUDA v5.2.0
Merged pull requests: - CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj) - Update to CUTENSOR 2.0 (#2178) (@maleadt) - Updates for new CUQUANTUM (#2210) (@kshyatt) - Take care not to spawn tasks during precompilation. (#2226) (@maleadt) - cuTensor fixes (#2228) (@maleadt) - Bump versions. (#2229) (@maleadt) - Add a note about threaded for-blocks. (#2232) (@kshyatt) - cuTENSOR plan handling changes. (#2234) (@maleadt) - Fix dynamic dispatch issues (#2235) (@MilesCranmer) - CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt) - Fixes for nightly (#2240) (@maleadt) - CUBLAS: Support more strided inputs (#2242) (@maleadt)
Closed issues: - Trouble using nsight systems for profiling CUDA in Julia (#1779) - Evaluating sparse matrices in the REPL has a huge memory footprint (#2016) - Intermittent CI failure: Segfault during nonblocking synchronization (#2141) - First test for Julia/CUDA with 15 failures (#2158) - Update to CUTENSOR 2.0 (#2174) - Tests fail for CUDA#master (#2223) - Test failures on Nvidia GH200 (#2227) - mul! should support strided outputs (#2230) - Please add support for older cuda versions (cuda 8 and older) (#2231) - NSight Compute: prevent API calls during precompilation (#2233) - Integrated profiler: detect lack of permissions (#2237)
- Julia
Published by github-actions[bot] over 2 years ago
CUDA - v5.1.2
CUDA v5.1.2
Merged pull requests: - kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne) - [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison) - Added kronecker product support for dense matrices (#2177) (@albertomercurio) - Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik) - provide more information on kernel compilation error (#2180) (@simonbyrne) - [CUSPARSE] Test CUSPARSESPMVCOOALG2 (#2182) (@amontoison) - [CUSPARSE] Use cusparseSpMMpreprocess (#2183) (@amontoison) - [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison) - Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison) - [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison) - Support more kwarg syntax with kernel launches (#2189) (@maleadt) - Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt) - NVML: Add support for clock queries. (#2194) (@maleadt) - Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth) - Improvements to context handling (#2200) (@maleadt) - Add a concurrent kwarg to profiling macros. (#2201) (@maleadt) - Rework unique context management. (#2202) (@maleadt) - Preserve the buffer type when broadcasting. (#2203) (@maleadt) - Fixes for Windows (#2206) (@maleadt) - Bump Aqua. (#2207) (@maleadt) - CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt) - CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot]) - Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt) - Default to testing with only a single device. (#2221) (@maleadt) - Backports for v5.1 (#2224) (@maleadt)
Closed issues:
- More informative errors when parameter size is too big (#2119)
- Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
- Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
- Support for combining duplicate elements in sparse matrices (#2185)
- Interactive sessions: periodically trim the memory pool (#2190)
- Broadcast does not preserve buffer type (#2191)
- CUDA doesn't precompile on Julia nightly/1.11 (#2195)
- Latest julia: UndefVarError: make_seed not defined in Random (#2198)
- CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
- Most recent package versions not supported on CUDA.jl (#2212)
- Testing of CUDA fails (#2222)
- --debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)
- Julia
Published by github-actions[bot] over 2 years ago
CUDA - v5.1.1
CUDA v5.1.1
Merged pull requests: - Sanitizer improvements. (#2157) (@maleadt) - [CUSPARSE] Update the wrapper of cusparseSpSVupdateMatrix (#2159) (@amontoison) - Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt) - [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison) - [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison) - Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt) - expand docs on launch parameters (#2167) (@simonbyrne) - Make CUDA.setruntime_version force the default behavior. (#2169) (@maleadt)
Closed issues: - High CPU load during GPU syncronization (#2161)
- Julia
Published by github-actions[bot] over 2 years ago
CUDA - v5.1.0
CUDA v5.1.0
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.
Merged pull requests:
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDARuntimeDiscovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a withworkspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for CUDA.@elapsed (#2118) (@thomasfaingnaert)
- [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a withworkspaces function to allocate two buffers (Device / Host) (#1767)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDAVISIBLEDEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- CUDASDKjll: cuda.h in different locations depending on the platform (#2066)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
- CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
- Missing CUDARuntimeDiscovery as a dependency in cuDNN (#2094)
- Binaries for Jetson (#2105)
- Minimum/maximum of array of NaNs is infinity (#2111)
- Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
- [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
- Unable to allocate unified memory buffers (#2120)
- CUDA 12.3 has been released (#2122)
- atomic min, max for Float32 and Float64 (#2129)
- Native profiler output is limited to around 100 columns when printing to a file (#2130)
- LLVM generates max.NaN which only works on sm80 (#2148)
- Unified memory-related error on Tegra T194 (#2149)
- Errors on sm_61 (#2150)
- Julia
Published by github-actions[bot] over 2 years ago
CUDA - v5.0.0
CUDA v5.0.0
Blog post: https://info.juliahub.com/cuda-jl-5-0-changes
This is a breaking release, but the breaking changes are minimal (see the blog post for details): - Julia 1.8 is now required, and only CUDA 11.4+ is supported - selection of local toolkits has changed slightly
Merged pull requests: - Added support for more transform directions (#1903) (@RainerHeintzmann) - Add some performance tips to the documentation (#1999) (@Zentrik) - Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt) - Adapt to GPUCompiler#master. (#2062) (@maleadt) - Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj) - Use released GPUCompiler. (#2064) (@maleadt) - Fixes for Windows. (#2065) (@maleadt) - Switch to GPUArrays buffer management. (#2068) (@maleadt) - Update CUDA 12 to Update 2. (#2071) (@maleadt) - Update manifest (#2076) (@github-actions[bot]) - Test improvements (#2079) (@maleadt) - Update manifest (#2082) (@github-actions[bot])
Closed issues:
- StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
- Support for LinearAlgebra.pinv (#2070)
- Julia
Published by github-actions[bot] over 2 years ago
CUDA - v4.4.1
CUDA v4.4.1
Closed issues:
- CUDA driver device support does not match toolkit (#70)
- Launching kernels should not allocate (#66)
- syncthreads() appears to not be sync'ing threads (#61)
- Exception when using CuArrays with Flux (#129)
- Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45)
- Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538)
- Improve 'VS C++ redistributable' error message (#764)
- CUSPARSE does not support reductions (#1406)
- CUDA test failed (#1690)
- Type constructor in broadcast doesn't compile (#1761)
- accumulate(+) gives different results for CuArray compared to Array. (#1810)
- Compat driver: preload all libraries (#1859)
- Stream synchronization is slow when waiting on the event from CUDA (#1910)
- cuDNN: Store convolution algorithm choice to disk. (#1947)
- Disable 'No CUDA-capable device found' error log (#1955)
- CUDNNSTATUSNOTSUPPORTED using 1D CNN model (#1977)
- Memory allocations during in-place sparse matrix-vector multiplication (#1982)
- CUSPARSE.sum_dim1 sums the absolute values of elements (#1983)
- Update to CUDA 12.2 (#1984)
- unsafe_wrap fails on zero element CuArrays (#1985)
- rand in kernel works in a deterministic way (#2008)
- Scalar indexing with CuArray * ReshapedArray{SubArray{CuArray}}} (#2009)
- volumerhs performance regression (#2010)
- CuSparseMatrix constructors allocate too much memory? (#2015)
- Native profiler using CUPTI (#2017)
- libLLVM-15jl.so (#2018)
- "symbol multiply defined" error (#2021)
- Confusion on row major vs column major (#2023)
- Printing of CuArrays gives zeros or random numbers (#2033)
- sortperm! fails when output is UInt vector (#2046)
- Re-introduce spinning loop before nonblocking synchronization (#2057)
Merged pull requests:
- Check mathType only if not Float32 (#1943) (@RomeoV)
- 1.10 enablement (#1946) (@dkarrasch)
- Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. (#1948) (@RomeoV)
- Wrapper with tests for gemmBatchedEx! (#1975) (@lpawela)
- Add wrappers for gemv_batched! (#1981) (@lpawela)
- Update CUSPARSE.sum_dim<n> to allow for arbitrary function on elements (#1987) (@lpawela)
- Update manifest (#1988) (@github-actions[bot])
- Add vectorized cached loads (#1993) (@Zentrik)
- Update manifest (#1995) (@github-actions[bot])
- Fix typo in captured macro example (#1996) (@Zentrik)
- Adapt Type call broadcasting to a function (#2000) (@simonbyrne)
- [CUSPARSE] Added support for generalized dot product dot(x, A, y) = dot(x, A * y) without allocating A * y (#2001) (@albertomercurio)
- Update manifest (#2002) (@github-actions[bot])
- Support for printing types. (#2003) (@maleadt)
- Fix accumulate bug (#2005) (@chrstphrbrns)
- Update manifest (#2013) (@github-actions[bot])
- Add a raw mode to code_sass. (#2019) (@maleadt)
- Update manifest (#2022) (@github-actions[bot])
- Add a native profiler. (#2024) (@maleadt)
- Perform synchronization on a worker thread (#2025) (@maleadt)
- Remove broken video link in docs (#2028) (@christiangnrd)
- When freeing memory, use the high-level device getter. (#2029) (@maleadt)
- Add support for @cuda fastmath (#2030) (@maleadt)
- Make "CUDA.jl" a link on the doc entry page (#2031) (@carstenbauer)
- Add support for CUDA 12.2. (#2034) (@maleadt)
- rand: seed kernels from the host. (#2035) (@maleadt)
- Update wrappers for CUDA 12.2. (#2039) (@maleadt)
- On CUDA 12.2, have the memory pool enforce hard memory limits. (#2040) (@maleadt)
- Delay all initialization errors until run time. (#2041) (@maleadt)
- JLL/CI/Julia changes. (#2042) (@maleadt)
- Add support for NVTX events to the integrated profiler. (#2043) (@maleadt)
- Update cuStateVec to cuQuantum 23.6. (#2044) (@maleadt)
- Add some more fastmath functions (#2047) (@Zentrik)
- Fixup wrong key lookup. (#2048) (@RomeoV)
- Update manifest (#2049) (@github-actions[bot])
- Make sortperm! resilient to type mismatches. (#2051) (@maleadt)
- Disable tests that cause GC corruption on 1.10. (#2053) (@maleadt)
- enable dependabot for GitHub actions (#2054) (@ranocha)
- Bump actions/checkout from 2 to 3 (#2055) (@dependabot[bot])
- Bump peter-evans/create-pull-request from 3 to 5 (#2056) (@dependabot[bot])
- Rework how local toolkits are selected. (#2058) (@maleadt)
- Busy-wait before doing nonblocking synchronization. (#2059) (@maleadt)
- Julia
Published by github-actions[bot] almost 3 years ago
CUDA - v4.4.0
CUDA v4.4.0
Closed issues:
- Unreachable control flow leads to illegal divergent barriers (#1746)
- CUBLAS fails on new CUDA.jl v4 (#1852)
- Sort fails on Lovelace (sm8.9) GPUs (#1874)
- gesvd! crashes on Pascal and v12.0 (#1932)
- No effect for calling "nsys launch" (#1938)
- Basic math operations with nested adjoint and transpose (#1940)
- CPU and GPU implementations return results at dissimilar scales, even in double precision arithmetics (#1950)
- Failed CUDA.jl initialization breaks Flux? (#1952)
- Recent mul! changes break multiplication with matrices that have StaticArray elements (#1953)
- Test infrastructure: define test groups (#1961)
- Strange rand errors when sampling large matrices (#1963)
- Add aqua tests (#1964)
- Support of Orin GPU from Nvidia ? (#1966)
- Crash in LLVM (#1971)
- Warning cuDNN Convolution (#1972)
- Strange behaviour when installed at system level (#1973)
Merged pull requests: - Update benchmarks for 1.8 and 1.9 (#1933) (@maleadt) - CUSOLVER: Explicitly pass NULL when not requesting svd outputs. (#1934) (@maleadt) - Detect and complain about loading system libraries. (#1935) (@maleadt) - Update manifest (#1936) (@github-actions[bot]) - Avoid stack overflow with eary OOM reporting. (#1937) (@maleadt) - [CUSPARSE] Improved support for UniformScaling ad Diagonal (#1941) (@albertomercurio) - Update manifest (#1949) (@github-actions[bot]) - Update GPUCompiler to fix unreachable control flow. (#1951) (@maleadt) - Allow StaticArray eltype in matmat{vec,mul} (#1954) (@lcw) - Bump CUDNN to v8.9. (#1959) (@maleadt) - Bump CUTENSOR to v1.7. (#1960) (@maleadt) - Add and fix some aqua tests (#1965) (@charleskawczynski) - Fix compatibility of CUDA 11.4 to support Orin. (#1967) (@maleadt) - Don't use Int32 indices in rand kernels. (#1969) (@maleadt) - CI simplifications (#1970) (@maleadt) - Use Base.pkgversion on 1.9. (#1974) (@maleadt) - Update to LLVM.jl 6. (#1976) (@maleadt) - fix launch config bug in bitonic sort (#1979) (@xaellison) - Update manifest (#1980) (@github-actions[bot])
- Julia
Published by github-actions[bot] almost 3 years ago
CUDA - v4.3.1
CUDA v4.3.1
Closed issues: - Array testsuite compiles kernel with large types (#1902) - CUDA.jl v4 installs CUDA runtime despite version=local (#1922) - Occaisonal "CUSOLVERError: an internal operation failed (code 7, CUSOLVERSTATUSINTERNAL_ERROR)" (#1924) - Does cuDNN@v1.0.4 need CUDA@v4.3? (#1929)
Merged pull requests: - Simplify libdevice linking. (#1927) (@maleadt) - Add a show method for kernel objects. (#1928) (@maleadt) - Update manifest (#1930) (@github-actions[bot]) - Pass a higher capability to ptxas. (#1931) (@maleadt)
- Julia
Published by github-actions[bot] about 3 years ago
CUDA - v4.3.0
CUDA v4.3.0
Closed issues:
- Multidimensional reverse (#1126)
- Test errors on master (#1866)
- Integer overflow error with svd for large matrix (#1880)
- Erratic behaviour of CUDA.jl if used in the REPL of VSCode. (#1892)
- QR decomposition requires scalar indexing (#1893)
- BSOD during package tests (#1898)
- Insufficient coverage of CuArrays in the documentation (#1901)
- Failed to compile with Julia v1.9 on PowerPC (#1911)
- CUDA test failed in wmma.jl (#1914)
- Fix deprecation warnings (#1920)
Merged pull requests: - CUSOLVER: Fix workspace size passing. (#1890) (@maleadt) - Lovelace fixes (#1894) (@maleadt) - Update manifest (#1897) (@github-actions[bot]) - Reverse with multiple dimensions (#1899) (@RainerHeintzmann) - Restrict number of test jobs based on available memory. (#1900) (@maleadt) - Avoid unneeded macros to cut down on generated code (#1905) (@maleadt) - Avoid unneeded macros to cut down on generated code (#1906) (@maleadt) - Update manifest (#1907) (@github-actions[bot]) - Bump GPUCompiler. (#1908) (@maleadt) - Don't use Float64 atomics on unsupported platforms. (#1912) (@maleadt) - Report package versions as part of versioninfo(). (#1913) (@maleadt) - Align variables in constant memory by 256 bit (#1915) (@Zentrik) - Add norm functions for 3 floats (#1916) (@Zentrik) - cuDNN: only choose conv algorithms if they match descriptor mathType (#1917) (@ToucheSir) - Update manifest (#1918) (@github-actions[bot]) - Skip Integer WMMA tests on older devices. (#1919) (@maleadt)
- Julia
Published by github-actions[bot] about 3 years ago
CUDA - v4.2.0
CUDA v4.2.0
Closed issues:
- NVTX: consider using Start/End for ranges (#1485)
- Limitations of CuIterator (#1768)
- Testing fails on unsupported devices. (#1815)
- Local runtime discovery does not work for external libraries (CUDNN, CUTENSOR) (#1850)
- Passing tests using Github CI workflow errors with libcuda not defined (#1867)
- Cannot precompile GPU code with SnoopPrecompile (#1870)
- Incorrect kernel execution with bounds checking using Julia 1.9.0-rc2 (#1875)
- Fake CUDA library (#1879)
- Error thrown when launching Julia with Nsight systems or compute. (#1886)
- Cannot construct CuDeviceArray (#1887)
- Incorrect colVal array when using CuSparseMatrixCSR command on sparse matrix (#1888)
Merged pull requests:
- Use adapt symmetrically in CuIterator (#1769) (@mcabbott)
- Allow but warn when testing on not fully-supported devices. (#1818) (@maleadt)
- Support runtime discovery for non-toolkit libraries (CUTENSOR, CUDNN, CUQUANTUM) (#1858) (@mloubout)
- Add KernelAbstractions.jl unsafe_free! (#1863) (@pxl-th)
- Allow precompiling CUDA code. (#1865) (@maleadt)
- Assert CUDA.jl is functional when creating the TLS. (#1868) (@maleadt)
- Update manifest (#1871) (@github-actions[bot])
- Don't collect AbstractQ objects in tests (#1872) (@dkarrasch)
- Add compatibility entry for Lovelace (#1873) (@xaellison)
- remove some type-piracy from cusparse (#1876) (@vtjnash)
- Remove more unneeded ndims methods. (#1878) (@maleadt)
- Guard the initialization-time CUDA driver check in a try/catch. (#1881) (@maleadt)
- Update manifest (#1882) (@github-actions[bot])
- Update CUDA 12.1 to 12.1.1. (#1883) (@maleadt)
- Use atomics for allocation statistics. (#1884) (@maleadt)
- Fix atomic increment of alloc stats. (#1885) (@maleadt)
- Update manifest (#1889) (@github-actions[bot])
- Julia
Published by github-actions[bot] about 3 years ago
CUDA - v4.1.4
CUDA v4.1.4
Closed issues:
- Buggy precompilation of init-defined symbols can break CUDADriverjll initialization (#1798)
- Calling CUDA.setruntimeversion!() with float parameter makes CUDA.jl unusable. (#1831)
- Unexpexted memory allocation when using randn! (#1856)
- The memory copy speed seems to exceed the hardware limit (#1860)
- PCG produces different output on GPU (via Krylov.jl) (#1864)
Merged pull requests: - Fix systemdriverversion on platforms not supported by CUDADriverjll. (#1854) (@maleadt) - Update manifest (#1861) (@github-actions[bot])
- Julia
Published by github-actions[bot] about 3 years ago
CUDA - v4.1.3
CUDA v4.1.3
Closed issues: - CUDA.versioninfo() triggers download of lazy artifacts (#1844)
Merged pull requests: - Choose parallel tests based on CPUs, not threads. (#1842) (@maleadt) - Adapt to LLVM.jl 5 and GPUCompiler.jl 0.19. (#1847) (@maleadt)
- Julia
Published by github-actions[bot] about 3 years ago
CUDA - v4.1.0
CUDA v4.1.0
Closed issues: - ERROR: LoadError: bin\cublas6411.dll when installing CUDA (#1750) - System-wide CUDA in LDLIBRARYPATH breaks CUBLAS (#1755) - CuDeviceTexture getindex breaks when executed on the CPU (#1757) - cuDNN.version can cause Julia to crash, missing `cudnnopsinfer648.dll` (#1777) - cuDNN compile error "ERROR: LoadError: ArgumentError: invalid version string: local" (#1783) - "Error: No CUDA Runtime library found" for ≥v4.0.0 (#1808) - sqrt broken in kernels 'Format of nvvmreflect function not recognized' (#1817)
Merged pull requests:
- Add support for CUDA 12.0. (#1742) (@maleadt)
- Add more fixes and tests for CUDA toolkit 12.0 (#1756) (@amontoison)
- Update manifest (#1758) (@github-actions[bot])
- Fix test/cusparse/interfaces.jl (#1762) (@amontoison)
- Simplify the function sig. (#1763) (@N5N3)
- Update manifest (#1770) (@github-actions[bot])
- Make versioninfo() resilient against NVML EPERM. (#1771) (@maleadt)
- Move CUDAKernels to CUDA.jl (#1772) (@vchuravy)
- [CUSPARSE] Improve conversion and tests between sparse matrices (#1774) (@amontoison)
- Use geam for + and - operations with CuMatrix{<:CublasFloat} (#1775) (@amontoison)
- Update manifest (#1776) (@github-actions[bot])
- Update manifest (#1781) (@github-actions[bot])
- Update manifest (#1784) (@github-actions[bot])
- [CUSPARSE] Update preconditioners.jl (#1785) (@amontoison)
- [CUSOLVER] Avoid the conversion to CSR format for reordering routines (#1786) (@amontoison)
- Bump GPUCompiler. (#1787) (@maleadt)
- Remove unneeded variable. (#1788) (@maleadt)
- [CUSPARSE] Update conversions.jl (#1791) (@amontoison)
- Update to CUDNN 8.8.1 for CUDA 12 compatibility. (#1792) (@maleadt)
- Add support for CUDA 12.1 (#1793) (@maleadt)
- [CUSPARSE] Interface color reordering (#1794) (@amontoison)
- [CUSPARSE] Interface gtsv2 (#1795) (@amontoison)
- Update manifest (#1796) (@github-actions[bot])
- Adapt to GPUCompiler 0.18 (#1799) (@maleadt)
- Follow Array's behavior when initializing (#1800) (@lcw)
- [CUSOLVER] Support A \ b for rectangular matrices (#1802) (@amontoison)
- Use symbols instead of values when emitting code, when possible. (#1804) (@maleadt)
- Refactor CI pipeline a little. (#1805) (@maleadt)
- [CUSOLVER] Improve the dispatch for LAPACK routines (#1806) (@amontoison)
- Diagonal for lower triangular of LU decomposition set incorrectly (#1813) (@tgymnich)
- CompatHelper: add new compat entry for "KernelAbstractions" at version "0.9" (#1824) (@github-actions[bot])
- Rebuild CUPTI API with support for STRUCT_SIZE (#1827) (@vchuravy)
- Release CUDA 4.1 (#1828) (@vchuravy)
- Julia
Published by github-actions[bot] about 3 years ago
CUDA - v4.0.1
What's Changed
- Warn when using old devices by @maleadt in https://github.com/JuliaGPU/CUDA.jl/pull/1752
- Silence some errors to support conditional use. by @maleadt in https://github.com/JuliaGPU/CUDA.jl/pull/1754
Full Changelog: https://github.com/JuliaGPU/CUDA.jl/compare/v4.0.0...v4.0.1
- Julia
Published by vchuravy over 3 years ago
CUDA - v4.0.0
CUDA v4.0.0
Closed issues: - Missing implementation of right multiply for QR decomposition (#1738) - [CUSPARSE] Type error with mm! (#1743)
Merged pull requests: - Implement rmul for qr. (#1739) (@maleadt) - Update manifest (#1741) (@github-actions[bot]) - Update CUSPARSE for CUDA v12.0 (#1744) (@amontoison) - Fix nvprof command (#1745) (@lucifer1004) - Update manifest (#1747) (@github-actions[bot]) - Fix grammar (#1748) (@lucifer1004)
- Julia
Published by github-actions[bot] over 3 years ago
CUDA - v3.13.1
CUDA v3.13.1
Closed issues: - CUDA.jl cuFFT underperforming against CuPy cuFFT (#1682) - Is block-spmm supported? (#1736)
Merged pull requests: - Introduce cuFFT plan cache; switch to auto-managed memory. (#1734) (@maleadt) - Stop pirating GPUArrays' RNG methods. (#1735) (@maleadt)
- Julia
Published by github-actions[bot] over 3 years ago
CUDA - v3.12.2
CUDA v3.12.2
Closed issues:
- CUDA.jl cuFFT underperforming against CuPy cuFFT (#1682)
- Error during CUDA test (#1718)
- Kernel error from bad broadcast (should be regular error?) (#1720)
- Freeze into StackOverflow when JULIA_DEBUG=CUDA set (#1721)
- Use of linear operators in CUDA.jl (#1727)
- Is block-spmm supported? (#1736)
Merged pull requests:
- Allow copy(::RNG) (#1719) (@mcabbott)
- Update manifest (#1722) (@github-actions[bot])
- Simplify CuError rendering before library initialization. (#1723) (@maleadt)
- Simplify CuError rendering before library initialization (master branch version) (#1724) (@maleadt)
- Make device RNG test more robust. (#1725) (@maleadt)
- Rely on LLVM.jl's typed_ccall for more intrinsics. (#1728) (@maleadt)
- Backports for 3.13 (#1729) (@maleadt)
- Simplify CUBLAS and CUSPARSE wrappers, reducing code generated. (#1730) (@maleadt)
- Add Julia 1.9 CI. (#1731) (@maleadt)
- Use released dependencies. (#1732) (@maleadt)
- Remove NVTX. (#1733) (@maleadt)
- Introduce cuFFT plan cache; switch to auto-managed memory. (#1734) (@maleadt)
- Stop pirating GPUArrays' RNG methods. (#1735) (@maleadt)
- Julia
Published by github-actions[bot] over 3 years ago
CUDA - v3.13.0
CUDA v3.13.0
Closed issues:
- Error during CUDA test (#1718)
- Kernel error from bad broadcast (should be regular error?) (#1720)
- Freeze into StackOverflow when JULIA_DEBUG=CUDA set (#1721)
- Use of linear operators in CUDA.jl (#1727)
Merged pull requests:
- Allow copy(::RNG) (#1719) (@mcabbott)
- Update manifest (#1722) (@github-actions[bot])
- Simplify CuError rendering before library initialization. (#1723) (@maleadt)
- Simplify CuError rendering before library initialization (master branch version) (#1724) (@maleadt)
- Make device RNG test more robust. (#1725) (@maleadt)
- Rely on LLVM.jl's typed_ccall for more intrinsics. (#1728) (@maleadt)
- Backports for 3.13 (#1729) (@maleadt)
- Simplify CUBLAS and CUSPARSE wrappers, reducing code generated. (#1730) (@maleadt)
- Add Julia 1.9 CI. (#1731) (@maleadt)
- Use released dependencies. (#1732) (@maleadt)
- Remove NVTX. (#1733) (@maleadt)
- Julia
Published by github-actions[bot] over 3 years ago
CUDA - v3.12.1
CUDA v3.12.1
Closed issues:
- Accumulate doesn't work on >=4 dim Arrays with dims <= ndims(A) - 3 (#1039)
- CUSPARSE does not support dense-sparse matrix multiplication (#1403)
- Scalar indexing when comparing a CuArray to the identity matrix (#1557)
- CUBLASSTATUSNOTINITIALIZED (#1567)
- LinearAlgebra./ and LinearAlgebra.\ breaks CuArray (#1568)
- Window size in grid-stride loop (#1573)
- Matrix multiplication works for primitive and non-primitive custom number types on the CPU, but it fails for primitive custom number types on the GPU. (#1574)
- CuIterator doesn't specify IteratorSize but has no length() (#1583)
- Garbage collection doesn't work as shown in the documentation (#1586)
- Adding sparse adjoint results in kernel error (#1591)
- sparse - sparse matrix multiplication partially missing (#1599)
- FastMath sincos(), cis(), exp(im..) aren't as fast as C++ (#1606)
- wrong type in wrapper of a cusolver function (#1621)
- Adding CUDNN support for 3D convolutions/cross-correlations (#1631)
- copyto! does not work between a CuArray and a view(Array) (#1634)
- Minor issue with sparse function (#1641)
- Scalar indexing when displaying Diagonal{Int64, CuSparseVector{Int64, Int32}} (#1645)
- Many errors running test suite on GTX 960 4GB (#1650)
- Driver discovery broken on platforms without compat driver (#1653)
- Aliasing/Polluted Result from rfftplan for Float32 2^n 3D array (#1656)
- Re-instate memory limit (#1670)
- Split libnvToolsExt from CUDARuntime_jll? (#1672)
- accumulate(op, a) causes scalar indexing (#1680)
- CUSPARSE CI failures (#1692)
- axpy! for nested base types (reshapedarray/adjoint/view) (#1696)
- copyto! between a PermutedDimsArray view and a CuArray doesn't work (#1697)
- WMMA test failure (#1700)
- UndefVarError when a binary is not found (#1701)
- Is CUSPARSELT supported? (#1702)
- Best practices to reduce startup time (#1707)
- 1.9 compatibility (#1710)
- WARNING: unused variadic paramters. (#1712)
Merged pull requests:
- Remove/rework CuDeviceArray constructors (#1308) (@maleadt)
- Add always_inline kernel parameter (#1554) (@lcw)
- Update manifest (#1564) (@github-actions[bot])
- Update manifest (#1569) (@github-actions[bot])
- Update manifest (#1571) (@github-actions[bot])
- Fix native RNG window calculation. (#1575) (@maleadt)
- Use Base.activeproject. (#1576) (@maleadt)
- Fixes for and tests using JET. (#1577) (@maleadt)
- Update manifest (#1578) (@github-actions[bot])
- Docs, remove global variables in intro benchmark (#1580) (@SteffenPL)
- Update manifest (#1581) (@github-actions[bot])
- Update manifest (#1582) (@github-actions[bot])
- Bugfixes when using \ operator with non square matrices (#1584) (@GVigne)
- remove unbound type parameters (#1585) (@nsajko)
- added --openacc-profiling off to the nvprof (#1587) (@mbeltagy)
- Update manifest (#1588) (@github-actions[bot])
- Wrap at-cuda's code in a let block. (#1589) (@maleadt)
- Revert: Use JET during test suite. (#1590) (@maleadt)
- [CUSPARSE] Update mv! and mm! functions for CuSparseMatrixCOO and CuSparseMatrixCSC (#1592) (@amontoison)
- [CUSPARSE] Add sv! and sm! routines (#1593) (@amontoison)
- CompatHelper: bump compat for "BFloat16s" to "0.3" (#1594) (@github-actions[bot])
- Update wrap.jl (#1595) (@amontoison)
- Provide more useful explanation why an eltype is unsupported. (#1596) (@maleadt)
- CompatHelper: bump compat for "BFloat16s" to "0.4" (#1597) (@github-actions[bot])
- Improve eltype error reporting. (#1598) (@maleadt)
- Add () at the end of the library name in all ccall (#1600) (@amontoison)
- Define length for CuIterator (#1602) (@mcabbott)
- Added more sparse functions like: kron, tril, triu, reshape, adjoint, transpose, sparse-sparse multiplication (#1603) (@albertomercurio)
- Fix rotate! and reflect! for the generic fallback in GPUArrays.jl (#1604) (@amontoison)
- Update manifest (#1605) (@github-actions[bot])
- Update manifest (#1609) (@github-actions[bot])
- [CUSPARSE] Interface generic routines (#1611) (@amontoison)
- [CUSPARSE] Update sparse-sparse GEMM (#1613) (@amontoison)
- [CUSPARSE] Add sddmm! and gemvi! routines (#1615) (@amontoison)
- Update manifest (#1616) (@github-actions[bot])
- Don't use isbitsunion to support structs of union types. (#1617) (@maleadt)
- Update CUDA driver compatibility package to 11.8. (#1618) (@maleadt)
- Update CUDA artifacts to 11.7 Update 1. (#1619) (@maleadt)
- Update to CUDA 11.8 (#1620) (@maleadt)
- Update to CUDNN 8.6. (#1622) (@maleadt)
- Move CUDNN and CUTENSOR into separate packages (#1624) (@maleadt)
- Bump BFloat16s. (#1625) (@maleadt)
- fix #1621 (#1626) (@jemiryguo)
- Restore functionality of FastMath.sincos. (#1627) (@maleadt)
- Update manifest (#1628) (@github-actions[bot])
- Switch from manual artifact handling to automated JLLs (#1629) (@maleadt)
- [CUSPARSE] Add CuMatrix * CuSparseMatrix products (#1632) (@amontoison)
- Silence some test warnings. (#1635) (@maleadt)
- Update CUTENSOR to v1.6 (#1636) (@maleadt)
- [CUSPARSE] Add SparseMatrix * SparseVector products (#1637) (@amontoison)
- Upgrade CUSTATEVEC to v1.1 (#1638) (@maleadt)
- Upgrade CUTENSORNET to v1.1 (#1639) (@maleadt)
- [CUSPARSE] Add CuSparseVector ± CuSparseVector (#1640) (@amontoison)
- CompatHelper: add new compat entry for "Preferences" at version "1" (#1642) (@github-actions[bot])
- Fix #1641 (#1643) (@amontoison)
- Update manifest (#1646) (@github-actions[bot])
- [CUSPARSE] Add dot(CuSparseVector,CuVector) and vice-versa (#1647) (@amontoison)
- [CUSPARSE] Add ldiv! for CuSparseMatrixCOO and geam for CuSparseMatrixCSC (#1648) (@amontoison)
- Update autogenerated headers (#1649) (@maleadt)
- Remove deprecations (#1651) (@maleadt)
- Don't warn about the old JULIACUDAUSEBINARYBUILDER env var when using preferences (#1652) (@maleadt)
- Update CUTENSORNET to use new slice group (#1654) (@kshyatt)
- [CUSPARSE] Fix conversions between CuSparseMatrixCOO and CuSparseMatrixCSC (#1655) (@amontoison)
- Include compiler options in error log. (#1657) (@maleadt)
- Discover the system driver when CUDADriverjll isn't available. (#1658) (@maleadt)
- Preserve buffer type when adapting to CuArray. (#1659) (@maleadt)
- Update manifest (#1661) (@github-actions[bot])
- Extend conversion of QRPackedQ object to CuArray (#1662) (@GVigne)
- [CUSPARSE] Add CuSparseMatrixCSC * CuSparseMatrixCSC (#1663) (@amontoison)
- Update manifest (#1665) (@github-actions[bot])
- [CUSPARSE] Add more tests (#1668) (@amontoison)
- Update manifest (#1671) (@github-actions[bot])
- Update manifest (#1676) (@github-actions[bot])
- Fix eigen when using Hermitian or Symmetric matrices (#1677) (@GVigne)
- Update manifest (#1679) (@github-actions[bot])
- adding defaults for accumulate(op, a) with modified code from Base.accumulate (#1681) (@leios)
- Add right division operator for Diagonal matrices (#1683) (@GVigne)
- Update manifest (#1686) (@github-actions[bot])
- Bump CUQUANTUM libraries (#1688) (@maleadt)
- typo (#1689) (@ArnoStrouwen)
- Retry CUSOLVER handle creation when encountering an internal error. (#1691) (@maleadt)
- Fix #1692 (#1693) (@amontoison)
- Update manifest (#1694) (@github-actions[bot])
- [CUSPARSE] Support kron with Diagonal arguments (#1695) (@albertomercurio)
- Re-introduce memory limits. (#1698) (@maleadt)
- Adapt to GPUCompiler changes. (#1699) (@maleadt)
- WMMA: Don't wrap fragments of size 1 in a struct. (#1704) (@maleadt)
- Update manifest (#1708) (@github-actions[bot])
- Use plain llvmcall calling convention for WMMA intrinsics. (#1709) (@maleadt)
- Reclaim in cuDNN conv algorithm search (#1711) (@ToucheSir)
- CUBLAS: test against generic axp(b)y, not the BLAS-specific one. (#1713) (@maleadt)
- Fix LU getproperty invoke. (#1714) (@maleadt)
- Backports for 3.12.1 (#1715) (@maleadt)
- Specialize cholcopy to avoid scalar indexing. (#1716) (@maleadt)
- Fix handling of inline-allocated structures with unions. (#1717) (@maleadt)
- Julia
Published by github-actions[bot] over 3 years ago
CUDA - v3.12.0
CUDA v3.12.0
Closed issues:
- Implement Base.repeat (#177)
- repeat performs scalar indexing for multi-dimensional arrays (#1051)
- The GPU compiler fails on a call to maximum (#1548)
- versioninfo triggers artifact downloads (#1549)
- Error when broadcasting composed functions (#1550)
- overload Base.copy! for AbstractGPUArray{<:Any,1} (#1555)
Merged pull requests:
- Fix math quirk. (#1546) (@maleadt)
- Wrap cusolverRf.h and cusolverSp_LOWLEVEL_PREVIEW.h (#1547) (@frapac)
- Update manifest (#1551) (@github-actions[bot])
- tighten unsafe_wrap signature on scalar length (#1552) (@sjkelly)
- Update Documenter key. (#1553) (@maleadt)
- Update manifest (#1556) (@github-actions[bot])
- Import factorisation internal types from LinearAlgebra (#1558) (@theabhirath)
- Update manifest (#1560) (@github-actions[bot])
- add reshape for CuDeviceArray (#1561) (@omlins)
- Julia
Published by github-actions[bot] almost 4 years ago
CUDA - v3.11.0
CUDA v3.11.0
Closed issues:
- CUSPARSE: Diagonal + CSC/CSR gives dense array (#1469)
- CUBLAS: Multiplication of UpperTriangular/LowerTriangular not supported (#1486)
- CUTENSOR tests consume lots of memory, breaking other tests (#1501)
- CUFFT doesn't work for ComplexF64 C2C in-place (#1519)
- Inconsistency of == and isequal for CuArray (#1524)
- Setting CUDA seed the first time changes Random's RNG non-deterministically (#1526)
- Undefined exported symbols (#1527)
- Could not load library libLLVMExtra-14.dll (#1535)
- Add an rrule for cholesky to CUDA.jl (#1541)
Merged pull requests:
- specialize +/- op for sparse diag (#1514) (@Roger-luo)
- Make sure instantiating RNGs doesn't affect the global CPU RNG. (#1530) (@maleadt)
- Update manifest (#1531) (@github-actions[bot])
- ldiv! for LU Decomposition (#1532) (@SBuercklin)
- Lower dmax for contraction tests (#1534) (@kshyatt)
- Fix convolution algorithm search (#1536) (@maxfreu)
- Update manifest (#1537) (@github-actions[bot])
- add specializations for some triangular-triangular multiplications (#1538) (@Red-Portal)
- Add a utility to download artifacts without a functional driver. (#1539) (@maleadt)
- Update manifest (#1543) (@github-actions[bot])
- Explicit tests for type conversion (#1544) (@kshyatt)
- Remove unused exports. (#1545) (@maleadt)
- Julia
Published by github-actions[bot] almost 4 years ago
CUDA - v3.10.1
CUDA v3.10.1
Closed issues:
- Overflow in randn using CUDA.jl's native RNG (#1464)
- Segmentation fault with pre-compiled library importing CUDA (#1465)
- Julia freezes when using Polynomials with CuArray (#1497)
- Launch overhead regression (#1503)
- CUSOLVER: Matrix division requires identical types (#1512)
- Incorrect distribution for complex standard normals when using CUDA.default_rng() (#1515)
- loggamma (#1528)
Merged pull requests: - CUSPARSE: Support mixed type mv (#1475) (@Roger-luo) - Add method for LinearAlgebra.opnorm2 (#1516) (@danielwe) - Promote to common eltype in matrix division (#1517) (@danielwe) - Fix Box-Muller transformation for complex eltypes (#1518) (@danielwe) - Update manifest (#1521) (@github-actions[bot]) - Use at-dispose for LLVM.jl resource cleanup. (#1523) (@maleadt) - loggamma (#1529) (@cossio)
- Julia
Published by github-actions[bot] about 4 years ago
CUDA - v3.10.0
CUDA v3.10.0
Closed issues:
- Error while freeing DeviceBuffer-warning when using multiple GPUs (#1454)
- CUDNN cache locking prevents finalizers resulting in OOMs (#1461)
- EOFError from pool_cleanup when closing REPL (#1495)
- TypeError in compiler with custom kernel (#1496)
Merged pull requests:
- expose sparse mv/mm algo selection (#1201) (@Roger-luo)
- Always inspect the task-local context when verifying before freeing. (#1462) (@maleadt)
- support sparse opnorm (#1466) (@Roger-luo)
- Move CUSTATEVEC and CUTENSORNET into lib/ (#1478) (@vchuravy)
- Adapt to GPUCompiler 0.15 changes (#1488) (@maleadt)
- Limit time held by CUDNN locks. (#1491) (@maleadt)
- Docstring for cu (#1493) (@mcabbott)
- Update manifest (#1499) (@github-actions[bot])
- Silence EOFError in pool_cleanup (#1502) (@Octogonapus)
- Adapt to GPUCompiler changes (#1504) (@maleadt)
- Fixes for CUSPARSE 11.7.1. (#1505) (@maleadt)
- Update artifacts (#1507) (@maleadt)
- Update manifest (#1509) (@github-actions[bot])
- Add a new cache for HostKernel objects. (#1510) (@maleadt)
- Julia
Published by github-actions[bot] about 4 years ago
CUDA - v3.9.1
CUDA v3.9.1
Closed issues:
- Issue with copy_cublasfloat (#1476)
- Errors when broadcasting random number generators (#1480)
- CPU version of linear algebra routine is dispatched when using Zygote.gradient (#1481)
- scan! fails on vectors of structs (#1482)
- InexactError when getting CUDA version info (#1489)
Merged pull requests: - Allow more integer argument types for byteperm (#1420) (@eschnett) - support CuSparseMatrix(::Diagonal) (#1470) (@Roger-luo) - Don't emit debug info until the next CUDA version. (#1473) (@maleadt) - Update manifest (#1474) (@github-actions[bot]) - Update manifest (#1479) (@github-actions[bot]) - fix unsafewrap docstring and widen signature (#1483) (@piever) - Update manifest (#1484) (@github-actions[bot]) - Check whether cudaRuntimeGetVersion succeeded. (#1490) (@maleadt) - Update manifest (#1494) (@github-actions[bot]) - Fix #1476: Allow any container in copy_cublasfloat (#1498) (@danielwe)
- Julia
Published by github-actions[bot] about 4 years ago
CUDA - v3.9.0
CUDA v3.9.0
Closed issues: - Tests for showing (#35) - Support LU factorizations (#1193) - Int8 WMMA not working in 3.8.4 and 3.8.5 despite merged PR. Add more unit tests? (#1442) - Optional CPU cpu kernel call with @cuda (#1443) - Add library/artifact management for NCCL (#1446) - permutedims returns a lowertriangular matrix (#1451) - New broadcast corrupts memory? (#1457) - norm does not dispatch on CuSparseMatrixCSC (#1460) - scalar * sparse multiplication (#1468)
Merged pull requests: - CUTENSOR: axpy! and axpby! not mutating fixed (#1416) (@yapanuwan) - Initial wrap of cuquantum (#1437) (@kshyatt) - CompatHelper: bump compat for "GPUCompiler" to "0.14" (#1441) (@github-actions[bot]) - Fix return type of nrm2 for ComplexF16 (#1444) (@danielwe) - Use a build matrix. (#1445) (@maleadt) - Update manifest (#1447) (@github-actions[bot]) - Rework factorizations (#1449) (@maleadt) - Add NCCL binaries. (#1450) (@maleadt) - Support general eltypes in matrix division and SVD (#1453) (@danielwe) - Update manifest (#1456) (@github-actions[bot]) - Look at more environment variables to find nsys. (#1459) (@maleadt) - Fixes for 1.8 (#1463) (@maleadt)
- Julia
Published by github-actions[bot] about 4 years ago
CUDA - v3.8.4
CUDA v3.8.4
Closed issues:
- sparse-sparse and sparse-constant multiplication lose sparsity (output dense matrix) (#1264)
- LLVMExtra fails to load on Julia 1.8 and PPC (#1387)
- compute-sanitizer CUDAERRORINVALID_VALUE on CUDA.jl 3.0+ (#1415)
- @cudnnDescriptor is not threadsafe (#1421)
- Precomplication of CUDA 3.8.3 broken on 1.7.1 due to changes in Random123.jl (#1422)
- OOM error should include memory status (#1427)
- WMMA kernel works with Julia 1.7.2 but fails with illegal memory access for Julia 1.8.0-beta1 (#1431)
- Non Int64 local memory size leads to dynamic function invocation (#1434)
- "initialization" test failing (#1435)
- cuda with julia 1.8 not working on windows (working fine(?) on wsl2) (#1436)
Merged pull requests: - Add Int8 WMMA Support (#1119) (@max-Hawkins) - Wrap generic sparse-sparse GEMM (#1285) (@kshyatt) - Fix sparse COO to CSR conversion. (#1412) (@maleadt) - Drop support for CUDA 10.1 and below (#1414) (@maleadt) - Update manifest (#1417) (@github-actions[bot]) - Report the OOM memory status at the time of the error. (#1428) (@maleadt) - Lock CUDNN descriptor cache lookups. (#1430) (@maleadt) - Switch to new LLVM context management for 1.9 compatibility. (#1432) (@maleadt) - Update manifest (#1433) (@github-actions[bot]) - Backports for 3.8.4 (#1438) (@maleadt)
- Julia
Published by github-actions[bot] about 4 years ago
CUDA - v3.8.3
CUDA v3.8.3
Closed issues:
- Sparse matrix addition not working (#528)
- Native implementation of sparse arrays (#829)
- CUSPARSE: Adding a value to the diagonal (#1372)
- Conversion by cu casts Float64 to Float32 but not Int64 to Int32 (#1388)
- CUDA.math_mode!(...; precision) option not working (#1392)
- cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail (#1398)
- axpby! support for BFloat16 (#1399)
- CUSPARSE does not support integer matrices, breaks printing (#1402)
- sparse(I, J, V) doesn't support unsorted inputs (#1407)
Merged pull requests: - General purpose broadcast for sparse CSR matrices. (#1380) (@maleadt) - Update manifest (#1389) (@github-actions[bot]) - Implement sparse operations with UniformScaling using broadcast. (#1390) (@maleadt) - Prevent toplevel compilation. (#1391) (@maleadt) - Fix and test math precision. (#1394) (@maleadt) - Bump artifacts (#1397) (@maleadt) - support BFloat16 for atomic_cas (#1400) (@bjarthur) - Implement sparse broadcasting with CSC matrices. (#1401) (@maleadt) - Always report issues with discovering CUDA. (#1404) (@maleadt) - Fix sparse 1-argument broadcast output type. (#1405) (@maleadt) - CUSPARSE BSR improvements (#1409) (@maleadt) - Support limited sparse integer arrays by bitcasting to floating point. (#1410) (@maleadt) - Support using sparse with unsorted inputs. (#1411) (@maleadt) - Backports for 3.8.3 (#1413) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.8.2
CUDA v3.8.2
Closed issues: - CuSparseMatrixCSC missing lu and interactions with UniformScaling (#79) - CUSPARSE typo (#1231) - similar(A::CuSparse,eltype) returns an Array (#1316) - "errormonitor" undefined in julia1.6 (#1375) - Pool free can switch tasks (#1384)
Merged pull requests: - Define a compatibility shim for errormonitor (#1378) (@vchuravy) - Backport #1361 to 3.8 (#1379) (@vchuravy) - Backports for 3.8.2 (#1381) (@maleadt) - Remove broken errormonitor implementation, just don't use it on 1.6. (#1382) (@maleadt) - Memory pool improvements (#1383) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.8.1
CUDA v3.8.1
Closed issues:
- one(::CuMatrix) result on cpu (#142)
- Broadcasted setindex! triggers scalar setindex! (#101)
- OutOfGPUMemoryError With Available Memory (#1346)
- Distributions.jl with CuArrays (#1347)
- Views of Flux OneHotArrays (#1349)
- synchronize(blocking = false) hangs in julia 1.7 eventually (#1350)
- unsupported call through a literal pointer (call to log1pf) on Julia 1.6.5 (#1352)
- SpecialFunctions ^1.8 compat entry? (#1354)
- Performance deprecation using ^ on Float32 (#1358)
- Method definition setindex!(LinearAlgebra.Diagonal{T, V} ... overwritten in module CUDA (#1364)
- [PackageCompiler] Segmentation fault with CUDA.jl in multiversioning (#1365)
- Vectors in customary structs make julia stuck (#1366)
- sparseCSC-dense matrix multiplication yields unstable results (#1368)
- UndefVarError: parameters not defined on Windows10 (#1371)
Merged pull requests: - Optimize memoization helpers. (#1345) (@maleadt) - Update manifest (#1348) (@github-actions[bot]) - Update manifest (#1355) (@github-actions[bot]) - Fastmath improvements (#1356) (@maleadt) - Make the default pool visible when doing P2P (#1357) (@maleadt) - Fix resize of empty arrays. (#1359) (@maleadt) - CUSPARSE: add COO ctors and similar with eltype. (#1360) (@maleadt) - Add device_override for SpecialFunctions.gamma (#1361) (@vchuravy) - Implement (limited) broadcast of sparse arrays (#1367) (@maleadt) - Make nonblocking synchronization robust to errors. (#1369) (@maleadt) - Update manifest (#1370) (@github-actions[bot]) - Backports for 3.8.1 (#1374) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.7.1
CUDA v3.7.1
Closed issues: - Moving data between devices (#1136) - Repeated hascudagpu errors when CUDAVISIBLEDEVICES is empty (#1331) - Error when env var CUDAVISIBLEDEVICES is set but empty (#1336)
Merged pull requests:
- Wrap and test peer to peer memory copies (#1284) (@kshyatt)
- Update manifest (#1332) (@github-actions[bot])
- Have libcuda() fail repeatedly if anything (e.g. init) failed. (#1333) (@maleadt)
- Simplify workarounds. (#1334) (@maleadt)
- Properly detect a missing driver. (#1335) (@maleadt)
- Various small fixes (#1337) (@maleadt)
- Move CUDA.jl global state innto CUDAdrv wrapper "submodule" (#1338) (@maleadt)
- Add CUDA.return_type (#1339) (@tkf)
- Compute-sanitizer QOL improvements and docs (#1340) (@maleadt)
- Fix regression in backwards CUFFT plans. (#1341) (@maleadt)
- Don't assume host pointers are directly usable on the device. (#1342) (@maleadt)
- Backports for 3.7.1 (#1343) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.7.0
CUDA v3.7.0
Closed issues:
- mul! is missing for plan_fft! (#1311)
- Segfault with CUDA in a sysimage (#1314)
- CuSparse does not support broadcast (#1317)
- CUDA.functional(true) errors instead of printing "why" and returning false (#1318)
- Interesting timings (#1323)
- Syncronization how to? (#1324)
Merged pull requests: - Remove debug info hack. (#1259) (@maleadt) - Update manifest (#1312) (@github-actions[bot]) - CUFFT improvements (#1313) (@maleadt) - Add additional quirks. (#1315) (@maleadt) - Use pointer to async_send directly instead of a wrapper function (#1319) (@vchuravy) - Update manifest (#1325) (@github-actions[bot]) - Add support and test CUDA 11.6. (#1326) (@maleadt) - Bump CUTENSOR, expose libcutensorMg. (#1327) (@maleadt) - Bump CUDNN to v8.3.2. (#1328) (@maleadt) - Enable use of CUDA 11.6. (#1329) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.6.3
CUDA v3.6.3
Closed issues:
- CUDA.@atomic deadlocks when overwriting NaN (#1299)
- Unreasonablely slow copy kernel (#1301)
- Passing a LogicalIndex(::CuArray) fails (#1304)
Merged pull requests:
- Allow sorting of tuples of numbers (#1196) (@mcabbott)
- Use === for generic atomic updates with compare-and-swap (#1300) (@guyvdbroeck)
- Update manifest (#1302) (@github-actions[bot])
- Store the array length next to its dimensions. (#1303) (@maleadt)
- Disallow calling CUDA device array intrinsics on the host. (#1305) (@maleadt)
- Support logical indexing with CPU sources. (#1306) (@maleadt)
- Activate a context when calling device!. (#1307) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.6.2
CUDA v3.6.2
Closed issues:
- Norm of complex-typed CuArray is not real (#1290)
- Calling @show on Symmetric of a CuArray triggers Scalar Indexing (#1294)
- CUSPARSE Error when solving a linear system (#1296)
Merged pull requests: - Correctly handle missing cached_memory. (#1295) (@maleadt) - Update manifest (#1297) (@github-actions[bot])
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.6.1
CUDA v3.6.1
Closed issues: - reduceblock error on Complex type (#1289) - cudnncnninfer648 could not be laoded (#1291) - Support to find the first k eigenvalues of a sparse matrix (#1292)
Merged pull requests: - Bump CUDNN artifacts (#1293) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.6.0
CUDA v3.6.0
Closed issues:
- Conversion issue (#157)
- Extend new RNG to Complex numbers & normal distributions (#726)
- Fatal errors during sorting tests (#916)
- deepcopy failing (#1202)
- Kernel compilation fails when specifying shared memory array size as a tuple consisting of block dimension and kernel argument (#1205)
- ERROR: LoadError: The artifact at C:\Users\name.julia\artifacts\58bd87695e9ccdb508cb38be1ab717315ecc9152 is empty. (#1209)
- InvalidIRError when displaying a model which is on the GPU (#1212)
- CUDA.jl tries to load CUDA compat loaded via jll even though system package is installed (#1216)
- Synchronizing over blocks (#1220)
- assignment changes random seed (#1226)
- accumulate gives wrong answer when init != 0 (#1227)
- Generic dot kernel: use multiple kernels instead of atomics (#1244)
- integer division error creating CuVector of missing and nothing (#1251)
- unsupported dynamic function invocation with union type of more than 2 elements (#1252)
- three CUDA.@atomic in a row result in out-of-bounds error (#1254)
- Float16 CAS cannot use atom.cas.b16.global on sm61 (#1258)
- cu(::SVector) gives SVector, cu(::MVector) gives CuArray (#1262)
- Get back `unsafecopyto!methods for unified<-unified and unified<->device (#1263)
- Passing and using a FFT plan in a CUDA kernel seems impossible (#1266)
- Inplace Complex FFT and Threads (#1268)
-sortreturns nothing (#1270)
- Release a new version (#1276)
-init_drivernot called in 3.5 (#1280)
- Shared memory does not support isbits unions. (#1281)
- NVIDIA Nsight Systems andCUDA.@profileerror (#1282)
- nvprof withusing CUDA` crashes julia (#1283)
Merged pull requests: - Addition over CuSparseMatrix (#1195) (@yuehhua) - [CUSOLVER] Add ordering functions (#1198) (@amontoison) - Correctly handle multi-GPU instances with NVML. (#1199) (@maleadt) - CI improvements. (#1200) (@maleadt) - fix FFT workarea typo leading to memory corruption (#1204) (@marius311) - Update manifest (#1206) (@github-actions[bot]) - Minor improvements for library wrappers (#1207) (@maleadt) - Various small improvements (#1210) (@maleadt) - Extend CuDeviceArray ctors for mixed-int indices. (#1211) (@maleadt) - Deprecate non-blocking sync, and always call the synchronization API. (#1213) (@maleadt) - Generic CUSPARSE: use the index arguments. (#1214) (@maleadt) - Add bitonic sort implementation (#1217) (@xaellison) - Update manifest (#1218) (@github-actions[bot]) - Reverted deepcopy, added test (#1221) (@birkmichael) - Use broadcast instead of copies to initialize mapreduce buffers. (#1223) (@maleadt) - Remove some unneeded Base module prefixes. (#1224) (@maleadt) - Update manifest (#1225) (@github-actions[bot]) - Cherry-picked improvements (#1228) (@maleadt) - Update introduction.jl (#1232) (@aramirezreyes) - Update manifest (#1233) (@github-actions[bot]) - Fix SpMV for CUDA 11.5 (#1234) (@amontoison) - Add support for randn and randexp. (#1236) (@maleadt) - Avoid double-initializing partial accumulate results. (#1237) (@maleadt) - Fix cuTENSOR contractions not working for FP16 inputs (#1238) (@thomasfaingnaert) - Bump CUTENSOR and fix on CUDA 11.5 (#1239) (@maleadt) - Support dot product on GPU between CuArrays with inconsistent eltypes (#1240) (@findmyway) - Update manifest (#1241) (@github-actions[bot]) - Optimize CUTENSOR contraction. (#1243) (@maleadt) - Don't use nondeterministic atomics in dot when requested. (#1245) (@maleadt) - Remove CUBLAS decomposition tests without pivoting. (#1246) (@maleadt) - Update manifest (#1247) (@github-actions[bot]) - wrap CUBLAS spmv and spr (#1248) (@bjarthur) - CompatHelper: bump compat for "SpecialFunctions" to "2" (#1249) (@github-actions[bot]) - Update manifest (#1250) (@github-actions[bot]) - Store array offset as elements to fix all-singleton case. (#1255) (@maleadt) - Update CUDA to 11.5 Update 1. (#1256) (@maleadt) - Use Base functionality for iteration Union type components. (#1257) (@maleadt) - Bump CI to Julia 1.7. (#1260) (@maleadt) - Update manifest (#1261) (@github-actions[bot]) - Use CUDA APIs for unoptimized copies. (#1265) (@maleadt) - Bump CUDNN to 8.3.1, enable CUDA 11.5 by default. (#1267) (@maleadt) - Adding stream update for inplace complex FFT (#1269) (@ovanvincq) - Fix sort! return type. (#1272) (@maleadt) - Add const keyword to type aliases declarations. (#1273) (@eliascarv) - Update manifest (#1274) (@github-actions[bot]) - Avoid eager expansion of CUDAcompat artifact string. (#1275) (@maleadt) - Allow copies between unified arrays in different contexts. (#1277) (@maleadt) - fix zeros and ones for user defined types (#1278) (@GiggleLiu) - Make CUDNN depend on CUBLAS. (#1279) (@maleadt) - Update manifest (#1286) (@github-actions[bot]) - Restore call to initdriver. (#1287) (@maleadt) - Improvements for isbits union shared memory (#1288) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.5.0
CUDA v3.5.0
Closed issues:
- Illegal memory access on 3.3 (#975)
- Forward compatibility (#1071)
- ambiguous sparse constructor (#1088)
- Map reduce with float 16 (#1124)
- Allow invalid GPU pointers not allowed in unsafewrap (#1125)
- Scalar Indexing error in the Introduction docs (#1127)
- stackoverflow when printing a custom subtype of AbstractCuSparseMatrix (#1128)
- missing rand methods (#1138)
- Error mapreducing over a 0 dimensional array (#1141)
- seed! is not thread safe (#1158)
- Simplify Int32-based indices (#1160)
- Concatenating a scalar to a CuArray gives an Array (#1162)
- Calling `bytepermwithInt32values inserts sign checks (#1165)
-sum!does not compile for large arrays (#1169)
- Same random sequence on GPU and CPU? (#1170)
- Specifying eltype and buffer type when adapting toCuArray? (#1171)
- Inefficientlop3.lut` instructions generated (#1172)
- Writing temporary PTX files can fail (#1173)
- Switching devices doesn't switch the REPL's output task (#1175)
- GC is not working for CuSparseMatrixCSR (#1178)
- sparse*dense operations shouldn't drop sparseness (#1188)
- Raises illegal memory access error randomly (#1189)
Merged pull requests:
- CI fixes (#950) (@maleadt)
- implement sparse (#1093) (@CarloLucibello)
- Use the kernel state object to pass the exception flag location. (#1110) (@maleadt)
- Update manifest (#1123) (@github-actions[bot])
- Improve show methods in sparse GPU arrays. (#1129) (@maleadt)
- Use warp intrinsics for a wider range of reductions. (#1130) (@maleadt)
- Support wrapping a host buffer with a CuArray (#1131) (@maleadt)
- support transpose CSC to CUDA CSR (#1132) (@Roger-luo)
- Small improvements to discovery of local toolkits. (#1134) (@maleadt)
- Rework device and context getters. (#1135) (@maleadt)
- Avoid memory operations during graph capture. (#1137) (@maleadt)
- Streamline the random number interface. (#1146) (@maleadt)
- Native device synchronization (#1147) (@maleadt)
- support interpret(reshape) (#1149) (@Roger-luo)
- add a gitignore (#1150) (@Roger-luo)
- Fix normalize on complex number (#1151) (@maleadt)
- Addition and multiplication over cuarray and cusparse (#1152) (@maleadt)
- Preserve Int32 hardware indices (#1153) (@maleadt)
- remove mutable to make device sparse type bitstype (#1154) (@Roger-luo)
- Update manifest (#1155) (@github-actions[bot])
- CompatHelper: bump compat for "BFloat16s" to "0.2" (#1156) (@github-actions[bot])
- Perform actual synchronization API calls when we need the memory (#1157) (@maleadt)
- Binary dependency changes (#1159) (@maleadt)
- Bump dependencies. (#1161) (@maleadt)
- Generalize Sparse Array Indices Type in Struct Def (#1163) (@Roger-luo)
- Use unchecked type conversions for byte_perm arguments (#1166) (@eschnett)
- Fix performance regressions (#1167) (@maleadt)
- Fix big mapreduce kernel for inputs without neutral element. (#1174) (@maleadt)
- Switch contexts before performing memory operations on arrays (#1176) (@maleadt)
- Improvements to stream-ordered memory management (#1177) (@maleadt)
- Update manifest (#1180) (@github-actions[bot])
- Consistently use chars instead of raw enums in CUSPARSE/CUSOLVER functions. (#1181) (@maleadt)
- Implement forward compatibility (#1182) (@maleadt)
- Bump GPUCompiler for 1.8 compat. (#1183) (@maleadt)
- Bump GPUArrays. (#1186) (@maleadt)
- Update documentation (#1187) (@maleadt)
- Julia
Published by github-actions[bot] over 4 years ago
CUDA - v3.4.2
CUDA v3.4.2
Closed issues: - Broadcasting a datatype does not work (#261) - CUDA error: invalid argument during Zygote/Flux gradient computation (#1107) - EXCEPTIONACCESSVIOLATION when using shared memory allocations. (#1116)
Merged pull requests: - add symmetric support for mul (#217) (@Roger-luo) - adds a device array type for CuSparseMatrixCSR to support using it in kernel functions (#1106) (@Roger-luo) - Update manifest (#1108) (@github-actions[bot]) - Specialize Ref{<:Type} for GPU compatibility. (#1109) (@maleadt) - Use the documented version of the enable_finalizers API. (#1111) (@maleadt) - Don't embed the method table in the AST. (#1112) (@maleadt) - Remove the hacky unique'ing of shmem GVs. (#1114) (@maleadt) - Introduce a macro for marking multiple functions as device-only. (#1117) (@maleadt) - Simplify library loading. (#1121) (@maleadt) - Backports for 3.4.2 (#1122) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.4.1
CUDA v3.4.1
Closed issues: - cudnnFindConvolutionAlgorithmWorkspaceSize uses removed function cached_memory (#1101)
Merged pull requests: - Update manifest (#1102) (@github-actions[bot]) - Release hotfixes (#1103) (@maleadt) - Reverse CI for NNlibCUDA.jl (#1104) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.6
CUDA v3.3.6
Closed issues: - LinearAlgebra.mul! with scalar arguments triggers scalar iteration (#790) - Kernel fails if input is struct with function (#1094) - cusparse: sparse matrix - matrix multiplication broken with transpose operation (#1095)
Merged pull requests: - lib cusparse: fix #1095 (broken sparse matrix-matrix multiplication with transpose operation) (#1096) (@frapac) - Only export the atomic macro on 1.6. (#1097) (@maleadt) - Support more inplace atomic operations. (#1098) (@maleadt) - Backports for 3.3.6 (#1099) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.5
CUDA v3.3.5
Closed issues:
- Integer division error for the product of sparse times empty matrices (#962)
- Bad conversion from QR to CuArray (#969)
- Errors during installation test (#1004)
- Be explicit about imports (#1028)
- Exponentiation with constants can produce bad GPU code compared to the CPU (#1031)
- rem uses wrong intrinsic (#1040)
- test Cuda fails on gpuarrays\reductions/minimum maximum (#1043)
- Broadcasted type conversion on literal value doesn't work (#1044)
- CUDA overrides somehow screwing up customized printing? (#1055)
- Is it possible to copy any data into GPU via recursive CuDeviceArray construction? (#1057)
- CUDA doesn't compile after upgrade to Julia 1.6.2 (#1065)
- Timing discrepancy between CUDA.@time and Benchmarktools for Flux model (#1067)
- cannot convert range to Curray (#1070)
- Thread safety issue with gemv! (#1072)
- CuSparseMatrixCSC conversion errors (#1075)
- cublasHgemmStridedBatched (#1076)
- ERROR: UndefKeywordError: keyword argument elements not assigned (#1077)
- Support for generating Float16 random numbers (#1081)
- Illegal memory access during complex exponential with large imaginary part as exponent (#1085)
- "Error: CUDA.jl does not yet support CUDA with ptxas 11.3.109" when using "JULIACUDAUSE_BINARYBUILDER=false" (#1089)
Merged pull requests: - Add support for unified arrays. (#1023) (@maleadt) - Look for libcuda in more places. (#1030) (@maleadt) - Detect common integer exponentiations and handle them directly. (#1033) (@maleadt) - Allow strided inputs to various library functions. (#1038) (@maleadt) - Use correct intrinsics for rem (#1041) (@simonbyrne) - update Package Manager link (#1052) (@ehgus) - Update manifest (#1054) (@github-actions[bot]) - Add test for math_mode (#1056) (@kshyatt) - Streamline atomics. (#1059) (@maleadt) - Add support for device capability-dependent code. (#1060) (@maleadt) - Adapt to GPUArrays changes. (#1061) (@maleadt) - Add special constructors to work around Base AbstractQ size weirdness. (#1063) (@maleadt) - Update manifest (#1064) (@github-actions[bot]) - Small allocator improvements (#1068) (@maleadt) - Latency improvements (bis) (#1069) (@maleadt) - lib: cusparse: fix #962 (#1073) (@thazhemadam) - Make handle cache thread-safe. (#1074) (@maleadt) - Bump GPUCompiler. (#1079) (@maleadt) - add support for half-precision gemm (#1080) (@bjarthur) - Extend and switch to the new CUDA RNG (#1082) (@maleadt) - cusparse: fix conversion from sparse matrix to dense matrix (#1083) (@maleadt) - Support/bump for CUDA 11.4.1 and CUDNN 8.2.2 (#1084) (@maleadt) - Use sincos from libdevice to perform illegal global load. (#1086) (@maleadt) - Bump GPUCompiler; use our own opt pipeline. (#1087) (@maleadt) - Update manifest (#1090) (@github-actions[bot]) - Backports for 3.3.5 (#1091) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.4
CUDA v3.3.4
Closed issues: - Cholesky on 1.8 doesn't dispatch correctly (#1046)
Merged pull requests: - restore lost tests (#1042) (@vchuravy) - Base.unsafe_lenght is deprecated on 1.8 (#1045) (@vchuravy) - Update manifest (#1048) (@github-actions[bot]) - Fix cholesky on 1.8, fix #1046 (#1049) (@kshyatt) - Backport changes for 3.3.4 (#1050) (@vchuravy)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.3
CUDA v3.3.3
Merged pull requests: - Adapt to LLVM changes. (#1022) (@maleadt) - Update manifest (#1029) (@github-actions[bot]) - just some simple printing tests (#1032) (@kshyatt) - Test for is_capturing (#1034) (@kshyatt) - Tests for buffer printing (#1035) (@kshyatt) - Make it possible to change the pool alloc and handle types. (#1036) (@maleadt) - Backports for 3.3 (#1037) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.2
CUDA v3.3.2
Closed issues: - Missing artifacts errors (#1003) - Relax restriction on types allowed in kernels? (#1005) - PPC: Atomic{Float64} is not supported (#1008) - Unexpected result in combination with Zygote.gradient() (#1019) - Both ExprTools and LLVM export "parameters"; uses of it in module CUDA must be qualified (#1025)
Merged pull requests: - Fixes for artifact loading. (#1006) (@maleadt) - dlopen CUBLAS before CUTENSOR. (#1007) (@maleadt) - Use a plain integer to keep track of pool last use time. (#1009) (@maleadt) - More fixes to artifact discovery. (#1010) (@maleadt) - add custom structs tutorial (#1011) (@jw3126) - big mapreduce performance (#1012) (@xaellison) - Fixes for Julia 1.7 (#1013) (@maleadt) - Update manifest (#1014) (@github-actions[bot]) - Remove memory pools (#1015) (@maleadt) - Move refcounting to an array storage type (#1016) (@maleadt) - Remove unneeded disambiguation method. (#1017) (@maleadt) - Simplify context validity check. (#1018) (@maleadt) - Improve LazyInitialized (#1020) (@maleadt) - More allocator clean-ups (#1021) (@maleadt) - CUDA 11.4 (#1024) (@maleadt) - Only import from ExprTools what we need. (#1026) (@maleadt) - Backports release 3.3 (#1027) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.1
CUDA v3.3.1
Closed issues:
- Reclaim with stream-ordered allocator (#952)
- possible hanging with CUDA.@profile? (#961)
- Upgrading from v3.2.1 to v3.3.0 broke my installation (#970)
- Calls to has_cudnn running on wrong CuDevice? (#978)
- Test does not run on MIT Supercloud after upgrading to 3.3.0 (#980)
- Performance issue with complicated loops in function (#984)
- Is it possible to set cache config in CUDA.jl? (#988)
- @atomic should perform type conversions (#989)
- Compatible NVIDIA driver but still got compatibility warning (#1001)
Merged pull requests: - Update manifest (#971) (@github-actions[bot]) - Fix disambiguation of CUDA 11.1 using CUSOLVER. (#972) (@maleadt) - Simplify initialization helper macro. (#973) (@maleadt) - Move at-typed_ccall to LLVM.jl. (#976) (@maleadt) - Replace workspace macro with function (#981) (@maleadt) - Implement and improve reclaim for the stream-ordered allocator (#983) (@maleadt) - Bump GPUCompiler to fix WMMA test issue. (#985) (@maleadt) - Rework memoization (#986) (@maleadt) - Fixes for CUBLAS/CUDNN logging (#987) (@maleadt) - Perform type conversions in at-atomic. (#990) (@maleadt) - Don't initialize the API when setting log callbacks. (#992) (@maleadt) - Create a helper for lazy, thread-safe initialization. (#993) (@maleadt) - Optimize library handles (#996) (@maleadt) - Optimize PerDevice for abstract element types. (#997) (@maleadt) - Update manifest (#999) (@github-actions[bot]) - Replace PerDevice with context-keyed dictionaries. (#1000) (@maleadt) - Improve launch latency (#1002) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.3.0
CUDA v3.3.0
Closed issues:
- PTX code missing DWARF debug information (#72)
- Suggestion - Disable AbstractArray indexing fallback by default (#178)
- Support isbits Union Arrays (#103)
- Missing norm(x, p) kernel (#84)
- CUDA enhanced compatibility (#832)
- Support for CuSparseMatrixCSC{Float16} x CuVector{Float16} (#849)
- CuArray to zeroth power returns Matrix (#897)
- Fatal errors during sorting tests (#916)
- Error when computing reductions into a view with reduce_blocks > 1 (#919)
- CUDA FFT plan application runs Out of Memory in Pluto (#926)
- has_cuda() errors in CPU-only environments on master (#928)
- Race condition when computing mean! of large arrays? (#929)
- Supporting union bits types (#934)
- test failing in device/intrinsics (#942)
- Memory allocation fails for multi-GPU (#943)
- Scalar operations when using output of cu(::OffsetArray) (#954)
- Quicksort kernel does not cope with reduced threads (#955)
- CUDA.jl cannot find installed CUPTI libraries with local installation on linux (#956)
- Error for complex sparse-dense Matrix-vector multiplication (#958)
- "using CUDA" gives error in type inference of Ref{Bool} (#965)
Merged pull requests: - Override outlined throw functions. (#874) (@maleadt) - Enable location and debug info. (#891) (@maleadt) - Compile using the toolkit, not the driver. (#892) (@maleadt) - Rework timings (#898) (@maleadt) - Fix #849, allow CUSPARSE to use F16 (#904) (@kshyatt) - Add Windows CI. (#907) (@maleadt) - Split test for better parallelization. (#908) (@maleadt) - Update manifest (#909) (@github-actions[bot]) - Improve package latency. (#910) (@maleadt) - Just some missing tests for CUBLAS (#911) (@kshyatt) - Fix bug and add tests for iamax/iamin (#913) (@kshyatt) - Fix profiler initialization and exception handling. (#914) (@maleadt) - Add a show method for devices(). (#915) (@maleadt) - Fix update of CUFFT handle. (#921) (@maleadt) - Update manifest (#922) (@github-actions[bot]) - Reinstate compatibility with Kepler GPUs. (#923) (@maleadt) - Use multiple GPUs on CI when available. (#924) (@maleadt) - Fix two-step mapreduce with wrapped output. (#925) (@maleadt) - Eagerly free the CUFFT workspace when generating a new one. (#927) (@maleadt) - Fix CUDA.function without throwing. (#930) (@maleadt) - Fix the REPL synchronization hook. (#931) (@maleadt) - Re-initialize the random seed every time. (#932) (@maleadt) - Protect against race in iterating compute processes. (#933) (@maleadt) - Helper function to get the device given a cu ptr. (#935) (@akashkgarg) - Implement CUDA's Enhanced Compatibility when selecting a toolkit. (#936) (@maleadt) - Update manifest (#939) (@github-actions[bot]) - Re-introduce specialization of cufunction. (#940) (@maleadt) - Support isbits union element types with CuArray. (#941) (@maleadt) - Try generating code with unreachable control flow. (#944) (@maleadt) - Upgrade to CUDA 11.3 Update 1. (#945) (@maleadt) - Always use exit instead of trap. (#947) (@maleadt) - Select devices without NVML. (#948) (@maleadt) - Fixes for Julia 1.7. (#949) (@maleadt) - Query the CUBLAS version without requiring a handle. (#951) (@maleadt) - Improve CUBLAS and CUDNN logging. (#953) (@maleadt) - Update manifest (#957) (@github-actions[bot]) - Enable sorting with reduced block sizes (#959) (@xaellison) - Adapt to GPUCompiler changes, bump GPUArrays. (#963) (@maleadt) - Adapt to change in allowscalar. (#964) (@maleadt) - Don't disable the CUDNN log callback on Windows. (#966) (@maleadt) - Use released dependencies. (#968) (@maleadt)
- Julia
Published by github-actions[bot] almost 5 years ago
CUDA - v3.2.1
CUDA v3.2.1
Closed issues:
- adding constant to an array: performance regression compared to CUDAdrv (#838)
- CUDA.abs() on vector input: performance regression compared to CUDAdrv (#839)
- CUDA.@sync seems to be using a lot of CPU while waiting (#893)
- Memory leaks with repeated use of fft of a CUDA Array (#894)
- CUDA.jl v3.2 seems to download wrong version of CUDNN and CUTENSOR (#899)
Merged pull requests: - Rework synchronization: first spin, then yield, and finally block. (#896) (@maleadt) - Make cusolvermg really optional. (#900) (@maleadt) - Rebuild artifacts. (#901) (@maleadt) - Take back control over the CUFFT work area. (#902) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v3.2.0
CUDA v3.2.0
Closed issues: - Explore CUDA graph API (#65) - Runtime functions are missing debug information (#53) - Native RNGs do not pass SmallCrush (#803) - Remaining threads/FFT/mult-gpu error (#876)
Merged pull requests: - Add wrappers for the CUDA graph API. (#877) (@maleadt) - Use the profiler API to start capture. (#878) (@maleadt) - Duplicate RNG state across block to avoid need for synchronization (#879) (@maleadt) - Support for printing tuples. (#880) (@maleadt) - Support unsigned inputs to integer intrinsics. (#881) (@maleadt) - Switch to Philox2x32 for device-side RNG (#882) (@maleadt) - Update manifest (#884) (@github-actions[bot]) - Treat CartesianIndices in views as scalars. (#886) (@maleadt) - Robustly get variables from the environment during init. (#887) (@maleadt) - Move Statistics functionality to GPUArrays. (#888) (@maleadt) - Update artifacts and use sources from unified JLLs. (#889) (@maleadt) - Lazy initialization of CUDNN and CUTENSOR (#890) (@maleadt) - Update manifest (#895) (@github-actions[bot])
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v3.1.0
CUDA v3.1.0
Closed issues: - GPU Implementation of partialsort! (#93) - Document associativity requirements of scan/reduce operators (#819) - Problem in reduceblock? (#843) - CUDNN convolution incorrect for small images (#848) - Newly-spawned tasks should re-set the device (#851) - sort!(CUDA.zeros(2^25)) throws invalid configuration argument (code 9, cudaErrorInvalidConfiguration) (#852) - Type-preserving upload about cu in doc may be wrong (#855) - Memory corruption / segfault with Threads.@async and planned FFTs (#859) - Don't call nvmlErrorString (during init?) to prevent crashes on WSL (#860) - unsafecopy3d! does not work with stream-ordered allocations (#863) - CUDA3 seems to have memory leak (#866)
Merged pull requests: - Implement statistics functions: correlation and covariance (#509) (@berquist) - @atomic support * and / (#842) (@yuehhua) - CUDNN docstring revisions. (#844) (@GunnarFarneback) - Sorting perf (again) (#845) (@xaellison) - Update manifest (#846) (@github-actions[bot]) - Remove extraneous apostrophe (#847) (@kshyatt) - reduceblock fixes. (#853) (@maleadt) - Fix sorting large arrays. (#854) (@maleadt) - Remove unsupported config launch keyword. (#856) (@maleadt) - Identify the buffer during unsafewrap to support unified free. (#857) (@maleadt) - Add support for CUDA 11.3. (#858) (@maleadt) - Work around buggy NVML initialization on WSL (#861) (@maleadt) - ae/partialsort (#864) (@xaellison) - Update manifest (#865) (@github-actions[bot]) - Improve multitasking with CUFFT. (#867) (@maleadt) - Introduce a HandleCache type. (#868) (@maleadt) - Improve multitasking with CURAND (#869) (@maleadt) - Document associativity requirement of accumulate (#870) (@HenriDeh) - Half-Precision Intrinsics (#871) (@iyaja) - Work around offset calculation bug in cuMemcpy3DAsync. (#872) (@maleadt) - fix #848: CUDNN convolution incorrect for small images (#873) (@denizyuret)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v3.0.2
CUDA v3.0.2
Closed issues: - REPL display happens in different task, breaking synchronization (#831) - map! function raise a InvalidIRError (#833) - Compile error on shfldownsync (#834) - Error broadcasting some Base intrinsics (eq sqrt) over complex (but not real) CuArray (#836)
Merged pull requests: - Add an integration benchmark. (#835) (@maleadt) - Synchronize REPL expressions before returning. (#837) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v3.0.1
CUDA v3.0.1
Closed issues: - Sort overwriting values in target array (#822)
Merged pull requests: - sort bugfix (#823) (@xaellison) - Update manifest (#824) (@github-actions[bot]) - Test validation of GPU-only function. (#827) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v3.0.0
CUDA v3.0.0
Closed issues: - Driver crashes when running tests (#136) - dilation=0 causes CUDNNSTATUSBADPARAM (#122) - CUBLASXT test errors (#112) - Program used external function 'nvpowi' which could not be resolved! (#109) - Wrapping Thrust for sorting (#107) - Prevent CUDA.@cufunc to transform the type's Int parameter to Int32. (#420) - CUBLASError: the GPU program failed to execute (#447) - Julia crashes on windows when using CUDA together with a system image (#479) - Consider link-time optimization (#505) - Segmentation fault on exiting Julia (#533) - Heisenbug in NNlib.conv! with nonzero beta (#736) - Benchmark suite segfaults on PRs (#794) - Performance issue with Pluto.jl (#815) - missing kernel for partialsort (#817)
Merged pull requests: - Improvements to try and fix benchmark OOMs (#809) (@maleadt) - Use the function type, not its instance, as the *Kernel typevar. (#816) (@maleadt) - More OOM fixes for the stream-ordered allocator (#818) (@maleadt) - Make reduce_warp exception free (#820) (@vchuravy) - Remove unused Adapt rule. (#821) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v2.4.3
CUDA v2.4.3
Closed issues:
- Cannot select __powidf2 while lowering powi.f64 (#76)
- Cannot select fpow while lowering pow.f32 (#71)
- Any chance x^.a, for (a> 2.0) will be supported at some point? (#171)
- support for vecnorm (#169)
- Accidentally calling GPU intrinsics on the host causes segfaults (#60)
- Partial support for Dual Numbers (#140)
- Bug involving broadcasting, sqrt, and Flux (#130)
- Add support for sincos, cis? (#42)
- at-extalloc should not try/catch (#99)
- Linalg support for non-contiguous views of arrays (#96)
- Segfault during handle finalization (#95)
- Error in julia': double free or corruption (out): 0x00007ffee6bd8d00 (#88)
- Kernel launch overhead regression from 1.7.3 (#80)
- Thread safety issue with free (#595)
- Bounds checking very slow with @views (#597)
- unspecified launch failure (code 719, ERROR_LAUNCH_FAILED) (#606)
- cusolver errors during]test CUDA` (#616)
- argmax (i.e. findmax) fails with Bool arrays (#659)
- Set random seed is extremely slow (#685)
- Modulo operator is actually the remainder operator (#748)
- reinterpret on cuDynamicSharedMem throws ERRORILLEGALADDRESS (#752)
- sampler randbinomial! for generating binomially distributed CuArrays directly on the GPU (#767)
- compute-sanitizer out-of-bounds failure (#780)
- Could not load the CUDA 11.1.0 artifact (#784)
- precompile error "cicache not defined" (#787)
- Conditional element-wise assignment still won't compile (#789)
- performance regression compared to CUDAdrv (#799)
- Slow contiguous view() on a CuArray (#802)
- ConvTranspose with negative padding fails on GPU (#810)
- Performance issue with Pluto.jl (#815)
- missing kernel for partialsort (#817)
Merged pull requests: - cuSolverMg wrappers (#308) (@kshyatt) - Use contextual dispatch for device functions. (#750) (@maleadt) - Implement reinterpret on CuDeviceArray (#755) (@tkf) - Quicksort performance update (#762) (@xaellison) - Add a basic implementation of rand() for use inside kernels (#772) (@S-D-R) - Deduplicate launch code. (#773) (@maleadt) - typo fixes - oliver (#774) (@kw-fn) - CompatHelper: add new compat entry for "SpecialFunctions" at version "1.3" (#775) (@github-actions[bot]) - Additional uses of contextual dispatch (#776) (@maleadt) - Use more inner constructors to ensure handle validity. (#777) (@maleadt) - Don't free memory asynchronously from finalizers. (#778) (@maleadt) - Set exception flag asynchronously. (#779) (@maleadt) - Debug memory pinning. (#781) (@maleadt) - Additional fixes for finalization when using the stream-ordered allocator (#782) (@maleadt) - Use legacy streams from finalizers to block on other streams (#783) (@maleadt) - Simplify thread/task state management. (#785) (@maleadt) - CompatHelper: add new compat entry for "RandomNumbers" at version "1.4" (#786) (@github-actions[bot]) - Speed-up rand: Tausworthe RNG with shared random state. (#788) (@maleadt) - Update manifest (#791) (@github-actions[bot]) - Optimize array construction and allocation (#792) (@maleadt) - Don't use the stream-ordered memory pool for PR benchmarks. (#795) (@maleadt) - Update introduction.jl (#796) (@Satvik) - Various improvements (#801) (@maleadt) - Speed-up view boundscheck. (#804) (@maleadt) - Don't use mod from libdevice. (#805) (@maleadt) - fix bounds check for reverse kwarg and add tests (#806) (@kshyatt) - Update manifest (#808) (@github-actions[bot]) - Improvements to try and fix benchmark OOMs (#809) (@maleadt) - Fix detection of need-for-cudadevrt. (#811) (@maleadt) - Backport #811 (#813) (@maleadt) - Use the function type, not its instance, as the *Kernel typevar. (#816) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v2.6.3
CUDA v2.6.3
Closed issues:
- Cannot select __powidf2 while lowering powi.f64 (#76)
- Cannot select fpow while lowering pow.f32 (#71)
- Any chance x^.a, for (a> 2.0) will be supported at some point? (#171)
- support for vecnorm (#169)
- Accidentally calling GPU intrinsics on the host causes segfaults (#60)
- Partial support for Dual Numbers (#140)
- Bug involving broadcasting, sqrt, and Flux (#130)
- Add support for sincos, cis? (#42)
- at-extalloc should not try/catch (#99)
- Linalg support for non-contiguous views of arrays (#96)
- Segfault during handle finalization (#95)
- Error in julia': double free or corruption (out): 0x00007ffee6bd8d00 (#88)
- Kernel launch overhead regression from 1.7.3 (#80)
- Thread safety issue with free (#595)
- Bounds checking very slow with @views (#597)
- unspecified launch failure (code 719, ERROR_LAUNCH_FAILED) (#606)
- cusolver errors during]test CUDA` (#616)
- argmax (i.e. findmax) fails with Bool arrays (#659)
- Set random seed is extremely slow (#685)
- Modulo operator is actually the remainder operator (#748)
- reinterpret on cuDynamicSharedMem throws ERRORILLEGALADDRESS (#752)
- sampler randbinomial! for generating binomially distributed CuArrays directly on the GPU (#767)
- compute-sanitizer out-of-bounds failure (#780)
- Could not load the CUDA 11.1.0 artifact (#784)
- precompile error "cicache not defined" (#787)
- Conditional element-wise assignment still won't compile (#789)
- performance regression compared to CUDAdrv (#799)
- Slow contiguous view() on a CuArray (#802)
- ConvTranspose with negative padding fails on GPU (#810)
Merged pull requests: - cuSolverMg wrappers (#308) (@kshyatt) - Use contextual dispatch for device functions. (#750) (@maleadt) - Implement reinterpret on CuDeviceArray (#755) (@tkf) - Quicksort performance update (#762) (@xaellison) - Add a basic implementation of rand() for use inside kernels (#772) (@S-D-R) - Deduplicate launch code. (#773) (@maleadt) - typo fixes - oliver (#774) (@kw-fn) - CompatHelper: add new compat entry for "SpecialFunctions" at version "1.3" (#775) (@github-actions[bot]) - Additional uses of contextual dispatch (#776) (@maleadt) - Use more inner constructors to ensure handle validity. (#777) (@maleadt) - Don't free memory asynchronously from finalizers. (#778) (@maleadt) - Set exception flag asynchronously. (#779) (@maleadt) - Debug memory pinning. (#781) (@maleadt) - Additional fixes for finalization when using the stream-ordered allocator (#782) (@maleadt) - Use legacy streams from finalizers to block on other streams (#783) (@maleadt) - Simplify thread/task state management. (#785) (@maleadt) - CompatHelper: add new compat entry for "RandomNumbers" at version "1.4" (#786) (@github-actions[bot]) - Speed-up rand: Tausworthe RNG with shared random state. (#788) (@maleadt) - Update manifest (#791) (@github-actions[bot]) - Optimize array construction and allocation (#792) (@maleadt) - Don't use the stream-ordered memory pool for PR benchmarks. (#795) (@maleadt) - Update introduction.jl (#796) (@Satvik) - Various improvements (#801) (@maleadt) - Speed-up view boundscheck. (#804) (@maleadt) - Don't use mod from libdevice. (#805) (@maleadt) - fix bounds check for reverse kwarg and add tests (#806) (@kshyatt) - Update manifest (#808) (@github-actions[bot]) - Fix detection of need-for-cudadevrt. (#811) (@maleadt) - Backport #811 (#813) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v2.4.2
CUDA v2.4.2
Closed issues:
- High allocations and getindex (#150)
- ResNet spending much time in CuArrays GC (#149)
- Broadcast inference failure results in scalar iteration (#145)
- Allocator very slow to reclaim memory after running for sufficiently long (#137)
- Assignment using logical indexing (#131)
- CUDNN convolution allocates outside of the memory pool (#111)
- Logical indexing per-dim (#106)
- Threading-related assertion failure in split allocator (#97)
- dims support for softmax (#226)
- Memory pinning needs more features (#242)
- External allocations fail under high memory pressure (#340)
- Incomplete CUDNN wrappers (#343)
- softmax(x) and logsoftmax(x) update their arguments (#592)
- Freeing large buffers takes a while (#594)
- softmax has problem with dim parameter (#599)
- CUDA 11.2 (#601)
- gemmEx on sm52 results in CUBLASSTATUSARCHMISMATCH (#609)
- could not load cublas6411.dll (#670)
- LLVM not found (#681)
- about the document of conditional use (#689)
- GPU run out of memory if 2 workers use the same GPU (#692)
- CURAND handles are collected early (#699)
- cudnnConvolutionForward fails memory checking (#702)
- Deadlock during OOM (#706)
- Segfault during trampoline allocation when querying occupancy from multiple threads (#707)
- Ballot intrinsics should use .sync variety (#711)
- cfunction $shmemcint use after free (#713)
- OOM when evaluating a small resnet (with both Flux and Knet) (#714)
- Supprt CUDA 11.2 Update 1 (#715)
- Base.mapreducedim returns wrong answer with non-zero target array (#720)
- CUBLASSTATUSARCHMISMATCH (#722)
- Test failures on linux (#727)
- Switching devices causes GC errors (#731)
- Pin CPU buffers when doing memory copies (#735)
- Memory free error with CUDA 11.2 and multi threads/GPUs (#737)
- Per-device memory pool (#742)
- Could not load library cudnnopsinfer648.dll (#757)
- CUDA.lgamma(x) crashes Julia (#758)
Merged pull requests: - New high level interface for cuDNN (#523) (@denizyuret) - bilinear upsampling (#636) (@maxfreu) - Automatic task-based concurrency using local streams (#662) (@maleadt) - Fix version lookups. (#671) (@maleadt) - add beta keyword to conv (#672) (@jw3126) - Update manifest (#673) (@github-actions[bot]) - Protect the kernel closure from GC collection. (#674) (@maleadt) - Track external globals, use it to avoid needless exception flags (#675) (@maleadt) - Adapt to GPUCompiler changes. (#676) (@maleadt) - Minor improvements (#677) (@maleadt) - CompatHelper: add new compat entry for "Memoize" at version "0.4" (#678) (@github-actions[bot]) - Use CUDA 11.2's stream-ordered allocator (#679) (@maleadt) - Support an additional nvdisasm version. (#680) (@maleadt) - Add fast getristridedbatch (#682) (@cfranken) - Use released GPUCompiler. (#683) (@maleadt) - v2.6.1 (#684) (@maleadt) - Fix race during multi-threaded init. (#687) (@maleadt) - Update manifest (#690) (@github-actions[bot]) - Change to Buildkite v1 plugins. (#691) (@maleadt) - CUPTI improvements for multithreading (#693) (@maleadt) - Fix exception flag linkage for linking. (#694) (@maleadt) - Update manifest (#695) (@github-actions[bot]) - Use simpler try/catch in show(CuError). (#696) (@maleadt) - fix bug in CURAND.jl's setstream function. (#698) (@norci) - Update CUDNN to 8.1. (#701) (@maleadt) - Remove special-cased algorithm selection for CUDNN convolution (#703) (@denizyuret) - Keep track of active handles to avoid early collection. (#704) (@maleadt) - Backports for Julia 1.5 (#705) (@maleadt) - Support for cushow-ing multiple values, including LLVMPtrs. (#709) (@maleadt) - Make CUDNN tests eagerly invoke at-test for better error reporting. (#710) (@maleadt) - Report JIT error log with linker errors. (#712) (@maleadt) - Update manifest (#716) (@github-actions[bot]) - Flip exceptionflag filter! predicate (#717) (@S-D-R) - Keep some memory reserved for external allocations. (#718) (@maleadt) - Upgrade CUDA 11.2 to Update 1. (#719) (@maleadt) - Add Abstract FFT compat (#721) (@DhairyaLGandhi) - Add support for and switch test to warp-synchrnous vote intrinsics. (#723) (@maleadt) - Specialize Base.toindex for AnyCuArray{Bool} (#724) (@pabloferz) - Update manifest (#728) (@github-actions[bot]) - Eagerly dlopen cublasLt to prevent a system library getting picked up. (#729) (@maleadt) - Switch tests over to compute-sanitizer. (#730) (@maleadt) - Perform pool operations in the correct context. (#732) (@maleadt) - Streamline use of retryreclaim. (#733) (@maleadt) - Threading fixes (#734) (@maleadt) - copied the old rnn.jl->rnncompat.jl for Flux compatibility (#738) (@denizyuret) - fix testmode batchnorm back (#739) (@CarloLucibello) - Don't error out if failing to parse the local CUDA version. (#740) (@maleadt) - Backport #739 (#741) (@maleadt) - Add back an older artifact for CUDNN on PPC with CUDA 10.2. (#743) (@maleadt) - Use the default memory pool. (#745) (@maleadt) - Use a memory pool per device. (#746) (@maleadt) - Test sort with at-test at the toplevel, for better reporting. (#749) (@maleadt) - remove NNlib (#753) (@CarloLucibello) - Update manifest (#754) (@github-actions[bot]) - Fix cuda-memcheck, don't use memory pools. (#756) (@maleadt) - Update generated wrappers (#759) (@maleadt) - Rework memory pinning and speed up async ops on unpinned memory (#760) (@maleadt) - Improve context switching (#761) (@maleadt) - Update manifest (#765) (@github-actions[bot]) - Docs on multitasking (#766) (@maleadt) - Update to CUDA 11.2 Update 2. (#768) (@maleadt) - Small backports for CUDA 2.4 / Julia 1.5 (#770) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v2.6.2
CUDA v2.6.2
Closed issues:
- High allocations and getindex (#150)
- ResNet spending much time in CuArrays GC (#149)
- Broadcast inference failure results in scalar iteration (#145)
- Allocator very slow to reclaim memory after running for sufficiently long (#137)
- Assignment using logical indexing (#131)
- CUDNN convolution allocates outside of the memory pool (#111)
- Logical indexing per-dim (#106)
- Threading-related assertion failure in split allocator (#97)
- dims support for softmax (#226)
- Memory pinning needs more features (#242)
- External allocations fail under high memory pressure (#340)
- Incomplete CUDNN wrappers (#343)
- softmax(x) and logsoftmax(x) update their arguments (#592)
- Freeing large buffers takes a while (#594)
- softmax has problem with dim parameter (#599)
- gemmEx on sm52 results in CUBLASSTATUSARCHMISMATCH (#609)
- about the document of conditional use (#689)
- GPU run out of memory if 2 workers use the same GPU (#692)
- CURAND handles are collected early (#699)
- cudnnConvolutionForward fails memory checking (#702)
- Deadlock during OOM (#706)
- Segfault during trampoline allocation when querying occupancy from multiple threads (#707)
- Ballot intrinsics should use .sync variety (#711)
- cfunction $shmemcint use after free (#713)
- OOM when evaluating a small resnet (with both Flux and Knet) (#714)
- Supprt CUDA 11.2 Update 1 (#715)
- Base.mapreducedim returns wrong answer with non-zero target array (#720)
- CUBLASSTATUSARCHMISMATCH (#722)
- Test failures on linux (#727)
- Switching devices causes GC errors (#731)
- Pin CPU buffers when doing memory copies (#735)
- Memory free error with CUDA 11.2 and multi threads/GPUs (#737)
- Per-device memory pool (#742)
- Could not load library cudnnopsinfer64_8.dll (#757)
- CUDA.lgamma(x) crashes Julia (#758)
Merged pull requests: - New high level interface for cuDNN (#523) (@denizyuret) - bilinear upsampling (#636) (@maxfreu) - Use CUDA 11.2's stream-ordered allocator (#679) (@maleadt) - Support an additional nvdisasm version. (#680) (@maleadt) - Add fast getristridedbatch (#682) (@cfranken) - Fix race during multi-threaded init. (#687) (@maleadt) - Update manifest (#690) (@github-actions[bot]) - Change to Buildkite v1 plugins. (#691) (@maleadt) - CUPTI improvements for multithreading (#693) (@maleadt) - Fix exception flag linkage for linking. (#694) (@maleadt) - Update manifest (#695) (@github-actions[bot]) - Use simpler try/catch in show(CuError). (#696) (@maleadt) - fix bug in CURAND.jl's setstream function. (#698) (@norci) - Update CUDNN to 8.1. (#701) (@maleadt) - Remove special-cased algorithm selection for CUDNN convolution (#703) (@denizyuret) - Keep track of active handles to avoid early collection. (#704) (@maleadt) - Backports for Julia 1.5 (#705) (@maleadt) - Support for cushow-ing multiple values, including LLVMPtrs. (#709) (@maleadt) - Make CUDNN tests eagerly invoke at-test for better error reporting. (#710) (@maleadt) - Report JIT error log with linker errors. (#712) (@maleadt) - Update manifest (#716) (@github-actions[bot]) - Flip exceptionflag filter! predicate (#717) (@S-D-R) - Keep some memory reserved for external allocations. (#718) (@maleadt) - Upgrade CUDA 11.2 to Update 1. (#719) (@maleadt) - Add Abstract FFT compat (#721) (@DhairyaLGandhi) - Add support for and switch test to warp-synchrnous vote intrinsics. (#723) (@maleadt) - Specialize Base.toindex for AnyCuArray{Bool} (#724) (@pabloferz) - Update manifest (#728) (@github-actions[bot]) - Eagerly dlopen cublasLt to prevent a system library getting picked up. (#729) (@maleadt) - Switch tests over to compute-sanitizer. (#730) (@maleadt) - Perform pool operations in the correct context. (#732) (@maleadt) - Streamline use of retryreclaim. (#733) (@maleadt) - Threading fixes (#734) (@maleadt) - copied the old rnn.jl->rnncompat.jl for Flux compatibility (#738) (@denizyuret) - fix testmode batchnorm back (#739) (@CarloLucibello) - Don't error out if failing to parse the local CUDA version. (#740) (@maleadt) - Backport #739 (#741) (@maleadt) - Add back an older artifact for CUDNN on PPC with CUDA 10.2. (#743) (@maleadt) - Use the default memory pool. (#745) (@maleadt) - Use a memory pool per device. (#746) (@maleadt) - Test sort with at-test at the toplevel, for better reporting. (#749) (@maleadt) - remove NNlib (#753) (@CarloLucibello) - Update manifest (#754) (@github-actions[bot]) - Fix cuda-memcheck, don't use memory pools. (#756) (@maleadt) - Update generated wrappers (#759) (@maleadt) - Rework memory pinning and speed up async ops on unpinned memory (#760) (@maleadt) - Improve context switching (#761) (@maleadt) - Update manifest (#765) (@github-actions[bot]) - Docs on multitasking (#766) (@maleadt) - Update to CUDA 11.2 Update 2. (#768) (@maleadt) - Small backports for CUDA 2.4 / Julia 1.5 (#770) (@maleadt) - Backports for CUDA 2.6 / Julia 1.6 (#771) (@maleadt)
- Julia
Published by github-actions[bot] about 5 years ago
CUDA - v2.6.1
CUDA v2.6.1
Closed issues: - CUDA 11.2 (#601) - LLVM not found (#681)
Merged pull requests: - Automatic task-based concurrency using local streams (#662) (@maleadt) - add beta keyword to conv (#672) (@jw3126) - Protect the kernel closure from GC collection. (#674) (@maleadt) - Track external globals, use it to avoid needless exception flags (#675) (@maleadt) - Adapt to GPUCompiler changes. (#676) (@maleadt) - Minor improvements (#677) (@maleadt) - CompatHelper: add new compat entry for "Memoize" at version "0.4" (#678) (@github-actions[bot]) - Use released GPUCompiler. (#683) (@maleadt) - v2.6.1 (#684) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.6.0
CUDA v2.6.0
Closed issues: - Invalid results due to shared memory + multiple function exits (?) mysteriously solved by @cuprintf (#43) - NVML-related segfault on Windows (#610) - @cuda with config keyword sometimes allocate lots of memory (#643) - Can someone with push access run the TagBot workflow? (#644) - Taking gradient with Flux results in NaNs when using CUDA arrays but not when using CPU arrays (#657) - Broadcasting fails in a special case (#658) - view causes KeyError in alias (#661) - PTXCompilerTarget error when creating a CuArray with Float64 (#664) - Complex dot product performance of CuArrays and of StructArrays of CuArrays (#667) - could not load cublas64_11.dll (#670)
Merged pull requests: - CUDA quicksort (#431) (@xaellison) - Bump Reexport to 1.0 (#640) (@DhairyaLGandhi) - Use newer NVML initialization method. (#641) (@maleadt) - README: add some information on viewing capabilities of your devices (#642) (@DilumAluthge) - Remove duplicate functions. (#645) (@maleadt) - Use released version of Adapt.jl (#646) (@maleadt) - Simplify list of tests to skip. (#647) (@maleadt) - Use a test-specific Project.toml. (#648) (@maleadt) - Use raw output for CUBLAS log message. (#649) (@maleadt) - Close the async condition used to call host functions. (#650) (@maleadt) - Backports for Julia 1.5 / CUDA 2.4 (#651) (@maleadt) - Allow running benchmarks outside of the master branch on other systems. (#652) (@maleadt) - Bump GPUCompiler. (#653) (@maleadt) - Reuse the compiler when generating SASS code. (#654) (@maleadt) - Run the tests from the current directory. (#655) (@maleadt) - Configure the PTX GPUCompiler codegen quirks. (#656) (@maleadt) - Update manifest (#660) (@github-actions[bot]) - Support view on unmanaged arrays. (#663) (@maleadt) - Retry CuModule creation when OOM. (#665) (@maleadt) - Make fill async. (#669) (@maleadt) - Fix version lookups. (#671) (@maleadt) - Update manifest (#673) (@github-actions[bot])
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.4.1
CUDA v2.4.1
Closed issues:
- cudaconvert for closures (#67)
- Invalid results due to shared memory + multiple function exits (?) mysteriously solved by @cuprintf (#43)
- NVML-related segfault on Windows (#610)
- Update Reexport compat (#629)
- Incomplete CUDA device attributes list (#637)
- @cuda with config keyword sometimes allocate lots of memory (#643)
- Can someone with push access run the TagBot workflow? (#644)
- Taking gradient with Flux results in NaNs when using CUDA arrays but not when using CPU arrays (#657)
- Broadcasting fails in a special case (#658)
- view causes KeyError in alias (#661)
- PTXCompilerTarget error when creating a CuArray with Float64 (#664)
- Complex dot product performance of CuArrays and of StructArrays of CuArrays (#667)
Merged pull requests:
- CUDA quicksort (#431) (@xaellison)
- cudaconvert captured values in closures. (#625) (@maleadt)
- CompatHelper: only instantiate /Manifest.toml (the manifest file in the root of the repository) (#631) (@DilumAluthge)
- CompatHelper: bump compat for "Reexport" to "1.0" (#633) (@github-actions[bot])
- CompatHelper: bump compat for "AbstractFFTs" to "1.0" (#634) (@github-actions[bot])
- Update wrappers (#638) (@maleadt)
- Bump artifacts for Windows/Julia 1.6 compatibility. (#639) (@maleadt)
- Bump Reexport to 1.0 (#640) (@DhairyaLGandhi)
- Use newer NVML initialization method. (#641) (@maleadt)
- README: add some information on viewing capabilities of your devices (#642) (@DilumAluthge)
- Remove duplicate functions. (#645) (@maleadt)
- Use released version of Adapt.jl (#646) (@maleadt)
- Simplify list of tests to skip. (#647) (@maleadt)
- Use a test-specific Project.toml. (#648) (@maleadt)
- Use raw output for CUBLAS log message. (#649) (@maleadt)
- Close the async condition used to call host functions. (#650) (@maleadt)
- Backports for Julia 1.5 / CUDA 2.4 (#651) (@maleadt)
- Allow running benchmarks outside of the master branch on other systems. (#652) (@maleadt)
- Bump GPUCompiler. (#653) (@maleadt)
- Reuse the compiler when generating SASS code. (#654) (@maleadt)
- Run the tests from the current directory. (#655) (@maleadt)
- Configure the PTX GPUCompiler codegen quirks. (#656) (@maleadt)
- Update manifest (#660) (@github-actions[bot])
- Support view on unmanaged arrays. (#663) (@maleadt)
- Retry CuModule creation when OOM. (#665) (@maleadt)
- Make fill async. (#669) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.4.0
CUDA v2.4.0
Closed issues: - cublasXtStrmm test failures on Windows 10 Julia 1.1 (#124) - CUSPARSE tests broken (#259) - Make @cuda return a kernel object (#341) - Depend on CompilerSupportLibraries (#359) - CUBLAS and exceptions test failures on Windows (#536) - argmax(::CuArray) returns nothing with NaN-values (#553) - Multiple @cuDynamicSharedMem in kernel causes unexpected behavior (#555) - Illegal memory access with atomic shared memory (#558) - CUDA.sqrt will not found symbol "_nvsqrt" (#559) - Exception with CUDA.exp (#561) - Use LazyArtifacts instead of Pkg (#570) - Test runner: early bail out (#578) - memory reporting issue (#579) - c[3:4]=0 leads to exception (#580) - Add math ops (including broadcast) for half types (#581) - Dot product of Array and CuArray fails with CPU address error. (#586) - Support for CUDA-capable GPU with compute capability 4.0 like GTX 1080 (#587) - mapreducedim! not threadsafe (#588) - Allow separate directories for cuda and cudnn (#590) - Difficulties installing CUDA on Julia 1.6.0 . (#591) - Bug in Initialisation Error (#603) - CUDA.jl initialisation fails after suspending Ubuntu 20.04 with CUDA 11.2 (#605) - CUDA 11.2 CUBLASError and "CUDA.jl does not yet support CUDA with nvdisasm 11.2.67" (#607) - This intrinsic must be compiled to be called (#611) - OpenGL interop (#612) - Add support for CuFFT callback functions (#614) - I can’t multiply a CSR sparse matrix anymore (#615) - Julia version requirement (#619)
Merged pull requests:
- Support all combinations of datatypes and transposes/adjoints in LinearAlgebra (#535) (@cqql)
- Use structs for texture intrinsic return types. (#554) (@maleadt)
- Backport some 1.6 fixes (#557) (@maleadt)
- Update manifest (#560) (@github-actions[bot])
- Correct dims error (#562) (@DhairyaLGandhi)
- Lock _shmem_cb (#564) (@vchuravy)
- Move to Julia 1.6 (#566) (@maleadt)
- Adapt to JuliaLang/julia#38487. (#568) (@maleadt)
- Support for 'delayed kernels' (#569) (@maleadt)
- Run cuda-memcheck as part of CI (#571) (@maleadt)
- Use at-sync instead of calls to synchronize in tests. (#572) (@maleadt)
- Update artifacts to include cuda-memcheck (#573) (@maleadt)
- Use LazyArtifacts instead of Pkg. (#574) (@maleadt)
- Improve LinearAlgebra impl methods for triangular types (#575) (@maleadt)
- New findmin/max implementation using single-pass reduction (#576) (@maleadt)
- Fix synchronization before testing cublasXt calls. (#577) (@maleadt)
- Fix used memory reporting. (#582) (@maleadt)
- Implement Statistics.varm/stdm instead of Statistics._var (#583) (@sdewaele)
- Test for #558. (#584) (@maleadt)
- Add a quick failure option to the test runner. (#585) (@maleadt)
- Add lock around cfunction lookup (#589) (@vchuravy)
- Catch all initialization errors. (#593) (@maleadt)
- Update dependencies. (#596) (@maleadt)
- Fix wrong initialisation error message (#604) (@qin-yu)
- Fixes wrong spacing in docstring admonition (#608) (@navidcy)
- Fix broadcasting with Base.angle (#618) (@marius311)
- Test with the 1.6 nightly, not 1.7. (#620) (@maleadt)
- Wrap cudaGL.h (#621) (@maleadt)
- Initial compatibility with CUDA 11.2. (#622) (@maleadt)
- 1.5 compatibility release (#623) (@maleadt)
- Add CUDA 11.2 artifacts. (#624) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.3.0
CUDA v2.3.0
Closed issues:
- Misaligned address on load from Const (#548)
Merged pull requests:
- Allow PermutedDimsArray in gemm_strided_batched (#539) (@mcabbott)
- Fix broken checkbounds for CuSparseMatrixCSR and tests (#545) (@achuchmala)
- Emphasize rebooting option. (#547) (@xanfus)
- fix address calculation for ldg (#549) (@vchuravy)
- Don't use explicit per-stream threads. (#551) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.2.0
CUDA v2.2.0
Closed issues: - cudnn missing after downloading artifact (#521) - Downloading artifact: CUDA110 when using DiffEqFlux (#542)
Merged pull requests: - Update manifest (#520) (@github-actions[bot]) - Try out Buildkite. (#522) (@maleadt) - Update manifest (#529) (@github-actions[bot]) - Support for / Upgrade to CUDA 11.1 update 1. (#530) (@maleadt) - Fix and test svd! (#531) (@maleadt) - Move more CI to Buildkite. (#532) (@maleadt) - Use type symbols to generate wrapper methods (#534) (@cqql) - Fully move to Buildkite. (#537) (@maleadt) - Add unit_diag option for sv2! functions (#540) (@amontoison) - Documentation fixes (#543) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.1.0
CUDA v2.1.0
Closed issues: - CUDNN convolution with Float16 always returns zeros (#92) - axp(b)y! and mul! (scalar multiplication) with mixed argument types (#144) - Dispatching to generic matmul instead of CUBLAS (#164) - Support for Ints and Float16? (#165) - Subarrays/views support (#172) - Easy way to pick among multiple GPUs (#174) - More prominently document JULIACUDAUSEBINARYBUILDER (#204) - ERRORCOOPERATIVELAUNCHTOO_LARGE during tests (#247) - Pkg.test error for cutensor test on Windows (#422) - Runtime build improvements (#456) - Fusing Wrappers (#467) - Could not find nvToolsExt (libnvToolsExt.dylib.1.0 or libnvToolsExt.dylib.1) in /Users/imac/.julia/artifacts/b502baf54095dff4a69fd6aba8667124583f6929/lib (#482) - mapreduce assumes commutative op (#484) - SubArray Broadcast Bug in 2.0 (#488) - Nested SubArray Scalar Indexing (#490) - Sparse matrix * view(vector) regression in 2.0 (#493) - Error transforming a reshaped 0-dimentional GPU array to a CPU array (#494) - test cuda FAILURE (#496) - Reshaped CuArray is not DenseCuArray (#511) - assignment failure when using array slicing. (#516)
Merged pull requests: - Use the correct CUDNN scaling parameter type. (#454) (@maleadt) - Fix versioned dylib discovery. (#486) (@maleadt) - Move inv from GPUArrays. (#487) (@maleadt) - Use dense array types in sparse wrappers. (#495) (@maleadt) - Update manifest (#497) (@github-actions[bot]) - Revert array wrapper union changes (#498) (@maleadt) - Clean-up pointer field. (#499) (@maleadt) - mapreduce: change iteration for compatibility with non-commutative operators. (#500) (@maleadt) - Use versioned libcuda (#502) (@maleadt) - Dynamically choose versioned libcuda (#503) (@mustafaquraish) - Update multigpu.md (#504) (@efmanu) - Upgrade artifacts for CUDA 11 compatibility. (#506) (@maleadt) - Update dependencies. (#507) (@maleadt) - Convert unsigned short ints to Cint for printf. (#508) (@maleadt) - Update manifest (#510) (@github-actions[bot]) - Fix reshape with missing dimensions. (#512) (@maleadt) - Don't return a pointer from 'alias'. (#513) (@maleadt) - Add some docs (#514) (@maleadt) - Fix CUDNN-optimized activation broadcasts (#515) (@maleadt) - Fix cooperative launch test. (#517) (@maleadt) - Fixes for Windows (#518) (@maleadt) - CUTENSOR fixes on Windows (#519) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.0.2
CUDA v2.0.2
Closed issues:
- cu() behavior for complex floating point numbers (#91)
- Error when following example on using multiple GPUs on multiple processes (#468)
- MacOS without nvidia GPU is trying to download CUDA111 on julia nightly (#469)
- Drop BinaryProvider? (#474)
- Latest version of master doesn't work on Windows (#477)
- sum(CUDA.rand(3,3)) broken (#480)
- copyto!() between cpu and gpu with subarrays (#491)
Merged pull requests: - Adapt to GPUCompiler changes. (#458) (@maleadt) - Fix initialization of global state (#471) (@maleadt) - Remove 'view' implementation. (#472) (@maleadt) - Workaround new artifact"" eagerness that prevents loading on unsupported platforms (#473) (@ianshmean) - Remove BinaryProvider dep. (#475) (@maleadt) - typo: libcuda.dll -> libcuda.so on Linux (#476) (@Alexander-Barth) - NFC array simplifications. (#481) (@maleadt) - Update manifest (#485) (@github-actions[bot]) - Convert AbstractArray{ComplexF64} to CuArray{ComplexF32} by default (#489) (@pabloferz)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.0.1
CUDA v2.0.1
Closed issues: - Can't update (#462)
Merged pull requests: - Remove duplicate comment (#464) (@blegat) - Add functionality to precompile the runtime library. (#465) (@maleadt) - Update manifest (#470) (@github-actions[bot])
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v2.0.0
CUDA v2.0.0
Closed issues:
- Test failure during threading tests (#15)
- Bad allocations in memory pool after devicereset! (#16)
- CuArrays can lose Blas on reshaped views (#78)
- allowscalar performance (#87)
- Indexing with a CuArrays causes a 'scalar indexing disallowed' error from checkbounds (#90)
- 5-arg mul! for CUSPARSE (#98)
- copyto!(Device, Host) uses scalar iteration in case of type mismatch (#105)
- Array primitives broken for CUSPARSE arrays (#113)
- SplittingPool: CPU allocations (#117)
- error while concatenating to an empty CuArray (#139)
- Showing sparse arrays goes wrong (#146)
- Improve test coverage (#147)
- CuArrays allocates a lot of memory on the default GPU (#153)
- [Feature Request] Indexing CuArray with CuArray (#155)
- Reshaping CuArray throws error during backpropagation (#162)
- Match syntax and APIs against Julia 1.0 standard libraries (#163)
- CURANDSTATUSPREEXISTINGFAILURE when setting seed multiple times. (#212)
- RFC: converts SparseMatrixCSC to CuSparseMatrixCSR via cu by default (#216)
- Add a CuSparseMatrixCOO type (#220)
- Test runner stumbles over path separators (#236)
- Error: Invalid bitcode signature when loading CUDA.jl after precompilation (#293)
- Atomic operations only work on global memory (#311)
- Performance: cudnn algorithm selection (#318)
- CUSPARSE is broken in CUDA.jl 1.2 (#322)
- Device-side broadcast regression on 1.5 (#350)
- API for fast math-like mode (#354)
- CUDA 11.0 Update 1: cublasSetWorkspace (#365)
- Can't precompile CUDA.jl on Kubuntu 20.04 (#396)
- CuPtr should be Ptr in cudnnGetDropoutDescriptor (#397)
- CUDA throws OOM error when initializing API on multiple devices (#398)
- Cannot launch kernel with > 5 args using Dynamic Parallelism (#401)
- Reverse performance regression (#410)
- Tag for LLVM 3? (#412)
- CUDA not working (#415)
- StatsBase.transform fails on CuArray (#426)
- Further unification of CUBLAS.axpy! and LinearAlgebra.BLAS.axpy! (#432)
- size(range), length(range) and range[end] fail inside CUDA kernels (#434)
- InitError: Cannot use memory pool 'binned' when CUDA.jl was precompiled for memory pool 'split'. (#446)
- Missing dispatch for matrix multiplication with views? (#448)
- New version not available yet? (#452)
- using CUDA or CUArray, output: UndefVarError: AddrSpacePtr not defined (#457)
- Unable to upgrade to the latest version (#459)
Merged pull requests: - Performance improvements by calling cuDNN API (#321) (@gartangh) - Use ccall wrapper for correct pointer type conversions (#392) (@maleadt) - Simplify Statistics.var and fix dims=tuple. (#393) (@maleadt) - Adapt to GPUArrays test change. (#394) (@maleadt) - Default to per-thread stream semantics (#395) (@maleadt) - Add a missing context argument for stateless codegen. (#399) (@maleadt) - Keep track of package latency timings. (#400) (@maleadt) - Update manifest (#402) (@github-actions[bot]) - Latency improvements (#403) (@maleadt) - Fix bounds checking with GPU views. (#404) (@maleadt) - Force specialization for dynamic_cudacall to support more arguments. (#407) (@maleadt) - Fix some wrong pointer types in the CUDNN headers. (#408) (@maleadt) - Refactor CUSPARSE (#409) (@maleadt) - Fix typo (#411) (@yixingfu) - Update manifest (#413) (@github-actions[bot]) - Simplify library wrappers by introducing a CUDA Ref (#414) (@maleadt) - Simplify and update wrappers (#416) (@maleadt) - GEMM improvements (#417) (@maleadt) - CompatHelper: add new compat entry for "BFloat16s" at version "0.1" (#418) (@github-actions[bot]) - add CuSparseMatrixCOO (#421) (@marius311) - Update manifest (#423) (@github-actions[bot]) - Global math mode for easy use of lower-precision functionality (#424) (@maleadt) - Improve init error message (#425) (@maleadt) - CUBLAS: wrap rot! to implement rotate! and reflect! (#427) (@maleadt) - CUFFT-related optimizations (#428) (@maleadt) - Fix reverse/view regression (#429) (@maleadt) - Update packages (#433) (@maleadt) - Introduce StridedCuArray (#435) (@maleadt) - Retry curandGenerateSeeds when OOM. (#436) (@maleadt) - Introduce DenseCuArray union (#437) (@maleadt) - Array simplifications (#438) (@maleadt) - Fix and test reverse on wrapped array. (#439) (@maleadt) - Fixes after recent array wrapper changes (#441) (@maleadt) - Adapt to GPUArrays changes. (#442) (@maleadt) - Provide CUBLAS with a pool-backed workspace. (#443) (@maleadt) - Fix finalization of copied arrays. (#444) (@maleadt) - Support for/Add CUDA 11.1 (#445) (@maleadt) - Update manifest (#449) (@github-actions[bot]) - Allow use of strided vectors with mul! (gemv! and gemm!) (#450) (@maleadt) - Have convert call CuSparseArray's constructors. (#451) (@maleadt)
- Julia
Published by github-actions[bot] over 5 years ago
CUDA - v1.3.3
CUDA v1.3.3
Closed issues: - Type changing Array conversions give error when allowscalar(false) (#344) - getindex(::CuArray, ::Adjoint, ::Colon) fails (#345) - View with array indices causes memory copy before broadcast (#384) - Regression with Julia 1.5 (#390)
Merged pull requests: - Replace DevicePtr with Core.LLVMPtr. (#199) (@maleadt) - Make sure view indices reside on the GPU too. (#388) (@maleadt) - CompatHelper: Update DataStructures to v0.18 (#389) (@ChrisRackauckas)
- Julia
Published by github-actions[bot] almost 6 years ago