Recent Releases of elpa
elpa - ELPA 2023.11.001 release
- enable gpu-streams per default for NVIDIA and AMD GPUs
- Updated / improved documentation and man pages
- Fixed compilation error on AMD GPUs
- Fixed SVE 256 compute kernels
- Allow (currently in parts of ELPA) to use NVIDIA NCCL for device to device commpunication
- Speed up of GPU version of hermitian_multiply by up to an factor of 4
- significantly faster full-to-tridiagonal step in ELPA 1stage GPU
- significatnly faster ELPA 2stage solver on Intel GPUs
- Consistent enabling/disabling of SKEW_SYMMETRIC in header files
- new setup_gpu API function
- Fortran
Published by marekandreas about 2 years ago
elpa - ELPA 2023.05.001
- added CITATION.cff file
- allow test programs to be run with 1 MPI task
- correct a memory leak in the gpu stream setup
- better handling of GPU BLAS handles
- implement the execution of the AMD HIP code path on NVIDIA GPUs
- implement the execution of the SYCL GPU code path on CPUs (debugging)
- port generalized routines to SYCL GPU
- PoC to use NVIDIA NCCL instead of MPI (not production ready)
- somewhat cleanup of documentation
- Fortran
Published by marekandreas over 2 years ago
elpa - ELPA 2023.05.001.rc1
- added CITATION.cff file
- allow test programs to be run with 1 MPI task
- correct a memory leak in the gpu stream setup
- better handling of GPU BLAS handles
- implement the execution of the AMD HIP code path on NVIDIA GPUs
- implement the execution of the SYCL GPU code path on CPUs (debugging)
- port generalized routines to SYCL GPU
- PoC to use NVIDIA NCCL instead of MPI (not production ready)
- somewhat cleanup of documentation
- Fortran
Published by marekandreas almost 3 years ago
elpa - ELPA_2016.05_release
- fix problem with generated *.sh- check scripts
- name library differently if build without MPI support
- install only public modules
- support building without MPI for one node usage
- doxygen and man pages documentation for ELPA
- cleanup of documentation
- introduction of SSE gcc intrinsic kernels
- Remove errors due to unaligned memory
- removal of Fortran "contains functions"
- Fortran interfaces for assembly and C kernel
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2016.05.003_release
- fix a problem with the build of SSE kernels
- make some (internal) functions public, such that they can be used outside of ELPA
- add documentation and interfaces for new public functions
- shorten file namses and directory names for test programs in under to by pass "make agrument list too long" error
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2016.05.004_release
- fix a problem with the private state of module precision
- distribute test_project with dist tarball
- generic driver routine for ELPA 1stage and 2stage
- test case for elpamultatbreal
- test case for elpamultahbcomplex
- test case for elpacholeskyreal
- test case for elpacholeskycomplex
- test case for elpainverttrm_real
- test case for elpainverttrm_complex
- fix building of static library
- better choice of AVX, AVX2, AVX512 kernels
- make assumed size Fortran arrays default
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2017.05.001_release
- faster GPU implementation, especially for ELPA 1stage
- the restriction of the block-cyclic distribution blocksize = 128 in the GPU case is relaxed
- Faster CPU implementation due to better blocking
- support of already banded matrices (new API only!)
- improved KNL support
- add missing script "manual_cpp"
- cleanup of code
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2017.05.002_release
Mainly bugfixes for ELPA 2017.05.001: - fix memory leak of MPI communicators - tests for hermitian_multiply, cholesky decomposition and - deal with a problem on Debian (mawk)
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2017.05.003_release
- remove bug in invert_triangular, which had been introduced in ELPA 2017.05.002
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2017.11.001_release
- significant improvement of performance of GPU version
- added new compute kernels for IBM Power8 and Fujistu Sparc64 processors
- a first implementation of autotuning capability
- correct some type statements in Fortran
- correct detection of PAPI in configure step
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2018.05.001_release
- significant improved performance on K-computer
- added interface for the generalized eigenvalue problem
- extended autotuning functionality
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2018.11.001_release
- improved autotuning
- improved performance of generalized problem via Cannon's algorithm
- check pointing functionality of elpa objects
- store/read/resume of autotuning
- Python interface for ELPA
- more ELPA functions have an optional error argument (Fortran) or required error argument (C) => ABI and API change
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2019.05.002_release
- repacking of the src since the legacy interface has been forgotten in the 2019.05.001 release
- elpaprintkernels supports GPU usage
- fix an error if PAPI measurements are activated
- new simple real kernels: block4 and block6
- c functions can be build with optional arguments if compiler supports it (configure option)
- allow measurements with the likwid tool
- users can define the default-kernel at build time
- ELPA versioning number is provided in the C header files
- as announced a year ago, the following deprecated routines have been finally removed; see DEPRECATEDFEATURES for the replacement routines , which have been introduced a year ago. Removed routines: -> multatbreal -> multahbcomplex -> inverttrmreal -> inverttrmcomplex -> choleskyreal -> choleskycomplex -> solvetridi
- new kernels for ARM arch64 added
- fix an out-of-bound-error in elpa2
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2019.11.001_release
- solve a bug when using parallel make builds
- check the cpuid set during build time
- add experimental feature "heterogenous-cluster-support"
- add experimental feature for 64bit integer LAS/LAPACK/SCALAPACK support
- add experimental feature for 64bit integer MPI support
- support of ELPA for real valued skew-symmetric matrices, please cite: https://arxiv.org/abs/1912.04062
- cleanup of the GPU version
- bugfix in the OpenMP version
- bugfix on the Power8/9 kernels
- bugfix on ARM aarch64 FMA kernels
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2021.05.002_release
- no feature changes
- correct the SO version which was wrong in ELPA 2021.05.001
- allow the user to set the mapping of MPI tasks to GPU id per set/get
- experimental feature: port to AMD GPUS, works correctly, performance yet unclear; only tested --with-mpi=0
- On request, ELPA can print the pinning of MPI tasks and OpenMP thread
- support for FUGAKU: some minor fix still have to be fixed due to compiler issues
- BUG FIX: if matrix is already banded, check whether bandwidth >= 2. DO NOT ALLOW a bandwidth = 1, since this would imply that the input matrix is already diagonal which the ELPA algorithms do not support
- BUG FIX in internal test programs: do not consider a residual of 0.0 to be an error
- support for skew-symmetric matrices now enabled by default
- BUG FIX in generalized case: in setups like "mpiexec -np 4 ./validaterealdoublegeneralized1stage_random 90 90 45`
- ELPASETUPS does now (in case of MPI-runs) check whether the user-provided BLACSGRID is reasonable (i.e. ELPA does _not rely anymore that the user does check prior to calling ELPA whether the BLACSGRID is ok) if this check fails then ELPA returns with an error
- limit number of OpenMP threads to one, if MPI thread level is not at least MPITHREADSERIALIZED
- allow checking of the supported threading level of the MPI library at build time
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2020.05.001_release
- Enable compilation with gcc v10
- Fix a bug in elpamultiplya_b (GPU)
- improved documentation, including fixing of typos and errors in markdown
- Fix a bug in the calling of Cannons algorithm which might lead to crashes for a squared process grid
- improvements and bugfixes of the ELPA2 stage GPU version, see https://arxiv.org/abs/2002.10991
- bugfix for the build of AVX-512 KNL kernels
- clean seperation of SIMD instructions for AVX and AVX2 kernels
- better error checking for allocations / deallocations of CPU and GPU memory
- experimental feature of matrix redistribution
- bugfix in the cpuid tests
- bugfix in elpa2printkernels
- bugfix when configuring --with-gpu-support-only
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2022.05.001_release
- implement OpenMP offloading to GPU for Intel GPU for ELPA 1 and 2 stage ( except for "step triditoband")
- implement SYCL offloading to Intel GPUs for ELPA 1 and 2 stage
- AMD GPU offload has been tested on Mi200 (also with MPI)
- can use ELPA with one individual "gpu stream" per MPI task (Nvidia and AMD only)
- allow steps "cholesky", "inverttrm", and "multiplyab" to be called directly with GPU device pointers
- on error ELPA returns rather than aborting to give controll to calling application and to allow for error recovery and/or graceful abort
- allow ELPA to build with OpenMP and GPU
- fix an FPE with the Intel compiler and AVX-512 instructions and optimization level > -O2
- better checking of user defined options in configure
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2021.11.002_release
- fix an error when choosing the Nvidia GPU kernel (fallback to CPU might have been selected)
- support of Nvidia cusolver library to accelerate some routines (needs CUDA >= 11.4)
- experimental Nvidia GPU versions for "elpainverttrm" and "elpacholesky" can be tested by setting elpaset("gpuinverttrm",1) and elpaset("gpucholesky",1). Is not used otherwise
- BUGFIX: error in resort_ev (also backported to 2021.05.002 and 2020.11.001)
- allow to call ELPA eigenvectors and eigenvalues also with GPU device pointers for the input matrix, the vectors of eigenvalues and the output matrix for the eigenvectors
- BUGFIX: error in resort_ev
- EXPERIMENTAL feature:g new real GPU kernel for Nvidia A100 (provided by Nvidia): can show a performance boost if number of vectors per MPI task is > 20000. Most likely most benifit in non-MPI version
- as anounced, droping the legacy interface
- more autotuning features, for example using non blocking MPI collectives
- new version of autotunig avoiding a combinatorial grow of possibilities (the old autotune version can be still used if elpa%autotunesetapiversion(APIVERSION, error) is set to API_VERSION < 20211125)
- Fortran
Published by marekandreas over 3 years ago
elpa - ELPA_2020.11.001_release
- this release containts mostly bugfixes:
- fix determination whether a _ is needed to link Fortran to C
- fix an error in the real block4 kernel for arch64 NEON
- add missing testscalapacktemplate.F90 to EXTRA_DIST list
- fix error in the GPU kernel
- do not use MPICOMMWORLD but mpi_parent instead
- switch form python2 to python3
- experimental feature: complex kernels for arch64 NEON
- experimental feature: kernels for ARM SVE
- Fortran
Published by marekandreas about 5 years ago