Releases | Open Source Science

https://github.com/libxsmm/libxsmm - Version 1.17

This release is porting master/main's build-system back to v1.16. The necessary code changes have been minimized. However, since some non-trivial code changes are required, the release is labeled v1.17. The release became necessary due to the aging 1.16-line of code and new compilers emerged since then. For example, issues like #562 are among similar issues when using, e.g., GNU GCC 10.x or 11.x.

Note: version v1.17 leverages the same code base as version v1.16x. All new features, fixes, and development progress remain unreleased. As per LIBXSMM's policy to keep the master/main branch stable, one can take the latter to leverage new features, fixes, and development progress.

INTRODUCED * Validated with compilers released after the original v1.16 (GNU GCC 10.x, 11.x, and several Clang releases).

IMPROVEMENTS / CHANGES * Improved default for static code-paths using certain ISA extensions (no need to adjust INTRINSICS setting).

The build systems controls several options, and generally the set of options evolved since v1.16, which is the main reason for code changes. A positive impact of more changes is thorough (re-)validation. This release adjusted to LIBXSMM's evolved test environment (1.16.x cannot be revalidated). Code validation of v1.17 again reaches the level of the original v1.16 and further includes new compiler available since then.

- C
Published by hfp about 4 years ago

https://github.com/libxsmm/libxsmm - Version 1.16.3

This update promotes fixes in LIBXSMM's master/main branch and resolves two CVEs. Version 1.16.3 continues to leverage the same code base as version 1.16.2 and 1.16.1. All new features, fixes, and development progress remain unreleased. As per LIBXSMM's policy to keep the master/main branch stable, one can rely on the latter to leverage new features, fixes, and development progress.

IMPROVEMENTS / CHANGES / FIXES * CVE-2021-39535 * CVE-2021-39536

- C
Published by hfp over 4 years ago

https://github.com/libxsmm/libxsmm - Version 1.16.2

This minor update can resolve an issue where the OS installation (on a legacy system) does not signal about saving the state for contexts using instruction set extensions like SSE. The problem was resolved in LIBXSMM's main development branch already a long time ago. The problem was discovered in certain Virtual Machine installations (VMs) as well as on some OS installations (e.g., here).

INTRODUCED * New functionality and new features continue to remain with LIBXSMM's main revision (under development).

IMPROVEMENTS / CHANGES / FIXES * Adopt code-path even if OS does not properly signal support for an ISA extension.

Note: version 1.16.2 leverages the same code base as version 1.16.1 (except for a single line of code applying above mentioned fix). All new features, fixes, and development progress remain unreleased. As per LIBXSMM's policy to keep the master/main branch stable, one can take the latter to leverage new features, fixes, and development progress.

- C
Published by hfp over 4 years ago

https://github.com/libxsmm/libxsmm - Version 1.16.1

This (minor) release fixes the issues mentioned below as well as improving on platform support.

THANK YOU to the Department of Chemistry at University of Zurich for generously covering Cray system access.

IMPROVEMENTS / CHANGES * Muted compiler warnings caused by separate OpenMP runtime (Clang based tool chains). * Sample code: prevent OpenBLAS undefined type when including 77blas.h (issue)

FIXES * Fixed compilation and runtime issues with Clang-based Cray Compiler as well as Cray Classic Compiler. * Revised Fortran implementation of libxsmm_xdiff and removed _Bool dependency (issue).

- C
Published by hfp over 5 years ago

https://github.com/libxsmm/libxsmm - Version 1.16

This is a maintenance release which is meant to capture the project´s continuous development into a stable release. A validated release allows our users to leverage several improvements and fixes (see below) especially in the light of upcoming new features.

THANK YOU FOR YOUR CONTRIBUTION - your contribution matters! This project received several contributions whether as pull request, issue report, feature suggestion, or as an informal inquiry. We would like to thank you for your effort and time spent for Open Source software!

INTRODUCED * Zero-config for all platforms with absolutely no configuration needed for header-only. Simplifies using Visual Studio as no up-front configuration or in-build custom steps are needed. Simplifies 3rd-party build systems incorporating LIBXSMM for both header-only and classic ABI. * Updated Hello LIBXSMM, and added code examples for C/C++ and Fortran, included minimal "support" for Bazel (request). The latter is not meant to change our Makefile based build setup but can rather help to get people started who prefer Bazel. * Fortran interface for user-data dispatch and a Fortran code sample using this interface to dispatch multiple kernels at once. The C interface was introduced earlier (v1.15). * Experimental: element-wise kernels with matrix elements (meltw), e.g., to scale, reduce, type-convert, etc.

IMPROVEMENTS / CHANGES * Extended list of applications using LIBXSMM. Our documentation also lists applications among popular categories (at the bottom of the left-hand side menu). * Fixed performance bug in matcopy routine; added microbenchmarks. * Improved verbose output (watermarks, additional warnings). * Disabled memory wrapper at compile-time (opt-in only). * Fully moved to Python3 shebang (fallback to Python2). * Improved Fortran interface (overloads, etc.). * Further improved support for GNU GCC 10. * Extended sparse functionality.

FIXES * Avoid manipulating GNU´s feature flags (improves header-only library). * Fixed detecting Intel VTune 2020 (SYM=1 with source'd profiler). * Consistently emit unaligned LD/ST (intrinsics based code).

- C
Published by hfp over 5 years ago

https://github.com/libxsmm/libxsmm - Version 1.15

Version 2.0 was our anticipated next release. With v1.15 the goal is to flawlessly upstream LIBXSMM with OS-distributions that soon start building packages with GNU GCC 10 (further details).

Beyond new compiler support, LIBXSMM received a slight but consistent performance improvement even for core-functionality, namely SMM-kernels including batch-reduce. The DNN domain was developed the most and continues to deliver like a rolling release. The DNN backend broadened support for low/mixed-precision kernels and kernel-fusion (batch-reduce plus-X as used by convolutional neural networks).

INTRODUCED * Small matrix multiplication and batch-reduce kernels are available for the following input types FP64, FP32, bfloat16, int16, and int8. Low-precision support exists in several type-combinations with respect to input and accumulation type leveraging AVX-512 extensions (VNNI and Bfloat16). * New C-APIs (Fortran to follow): (1) kernel introspection, takes kernel-function pointer, fills info-structure with FLOPS-count, code-size, and more; no search overhead, (2) register user-defined data with LIBXSMM's fast key-value database/query, e.g., to lower dispatch overhead for multiple kernels used in one task. * Fortran API: more flavors of certain generic procedures; can potentially avoid temporary values due to exact match (procedure overload). * Example vectorizing along finite elements (DGFEM) using LIBXSMM for sparse weight matrices. * Example showing sparse weight matrix multiplication (deep learning). * Reproducer for next-gen. CP2K/collocate implementation. * Module file generated during build (module av).

IMPROVEMENTS / CHANGES * Allow to omit full configure step under Windows; improved build VS support. Note, Windows calling convention is still pending but in the works. Necessary state is currently not call-preserved, which may or may not work (as a workaround it may help to use wrapper-functions for LIBXSMM's kernels). * Dropped code generation for convolutions, which are now based on batch-reduce kernels, and revised batch-reduce API to support (1) absolute addresses like in previous releases, (2) relative offsets/indexes, and (3) constant/identical offset/stride. * LIBXSMM/EXT: OpenMP support under macOS (w/ Apple's LLVM based compiler). * Entire code base of LIBXSMM uses SPDX-License-Identifier (BSD-3-Clause). * Verbose message about timer accuracy (virtualized platforms). * Generally improved verbosity (insight/detail, and accuracy). * New instructions support in backend. * Slightly lowered dispatch overhead. * NUMA-aware GxM framework.

FIXES * Issues #334, #347, #371, #368, and #369. * Zero defects as of Synopsys Coverity. * Rebuild issue (build system). * Library initialization.

- C
Published by hfp almost 6 years ago

https://github.com/libxsmm/libxsmm - Version 1.14

This release brings notable fixes and improvements (see below) prior to merging our reworked DL backend. This version is likely the last release of our 1.x series. For the upcoming major release of LIBXSMM, the API remains compatible for core functionality except for the DL domain. Even for the DL domain, there are only API adjustments rather than big changes (straight forward or minor).

THANK YOU FOR YOUR CONTRIBUTION: jewillco, yurivict, antoscha, breuera, jeremylt, HiSPEET, and legrosbuffle. We would like to thank all direct contributors as well as people who informally spent effort and time for this Open Source software!

INTRODUCED * Native PROCEDURE types for generic 3-/6-arguments (arity) functions (Fortran interface). * Intercepted memory allocation for applications based on LIBXSMM's scratch memory. * LIBXSMM guarantees non-NULL kernels for valid requests since several versions.
Empty shape requests are now considered valid (SMM, MCOPY, and TCOPY). * Getting Started section added to documentation ("Hello LIBXSMM").

IMPROVEMENTS / CHANGES * Termination statistic now distinct SMMs and degenerated SMMs (GEMV). * Support Immintrin-debug (https://github.com/intel/Immintrin-debug). * Emit warning if compiler support only enables low-resolution timers. * Support PGI Compiler based on GNU GCC settings; still some issues. * Generally enable ISA extensions even if not permitted by OS (XSAFE). * Enforce AVX-512 under OSX i/Mac Pro (OSX: XSAFE/ZMM disabled). * VTUNE=0: disables profiler support (even if detected and SYM=1). * Memory info to handle foreign pointers (not allocated by library). * Scratch memory allocation: avoid unnecessary warning (verbose). * Improved scratch memory allocation statistics (watermark, etc.). * Implemented exit-handler for Fortran programs using STOP. * Avoid compiler warnings previously suppressed by flags. * Make: only permit matching static/shared library builds. * Accommodate Clang based compiler under Windows. * Improved RNG performance for very short sequences. * Updated Visual Studio projects and setup (VS2019). * Updated and revised documentation. * Updated articles and applications. * Contribution #355 incorporated. * Lowered dispatch overhead.

FIXES * Fixed issue (2019/02/24) dispatching compiler-generated code (affected SpMDM and DL). * Fixed casting literal -1 to an unsigned integer when 64-bits were intended. * Resolved issue related to structure alignment/padding/copy (CCE). * Potentially invalid kernel cache with concurrently finalized library. * Potentially treated non-OpenMP lock as OpenMP lock. * Avoid potentially recursive locking at termination. * Fixed potential hang with header-only. * Incorrect LDC for intercepted GEMV. * Issues fixed: #340 and #347.

- C
Published by hfp over 6 years ago

https://github.com/libxsmm/libxsmm - Version 1.13

This release delivers improvements made to the build system and internal structures. The main purpose is to continuously deliver smooth build and run experience for latest OS environments.

THANK YOU FOR YOUR CONTRIBUTION - your contribution matters! This project received direct (and indirect) contributions whether as issue report, feature suggestion, or involvement from people who came across the project. We would like to thank you all for the effort and time you spent for Open Source software!

IMPROVEMENTS / CHANGES * Fortran: enabled libxsmmptr* to eventually return CNULL_PTR. * Avoid treating Spack environment as maintainer build (apply SSE4 flags). * Renamed structure-of-array (SOA) dense routines into "packed". * Internal preparation for upcoming features (memory allocation). * Improved build system (most recent OS environments).

FIXES * Precondition for working around missing Float128 definition (#339). * Conceptionally avoid accessing a zero-sized array (Fortran interface). * Corrected number of scratch-memory pools (LIBXSMMVERBOSE).

- C
Published by hfp over 6 years ago

https://github.com/libxsmm/libxsmm - Version 1.12.1

This release fixes issues related to the prefix directory inside of the pkg-config files which affected maintainer builds (Linux and FreeBSD package), the package manager Spack, or people using pkg-config to determine build/linker flags. In addition, some presets are made to smooth maintainer builds under FreeBSD.

IMPROVEMENTS / CHANGES * Building samples: detect Intel MKL (when installed by a package manager). * Improved build system under FreeBSD (detect BLAS library, etc).

FIXES * Issue #331, issue #333, issue #334, and spack/spack#11413. * OpenMP build issue in one of the code samples (GCC 9.1).

- C
Published by hfp almost 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.12

This release aims to improve usability along with resolving several non-critical bugs. Beyond this, an implementation of the BLAS(-like) batched GEMM has been added (?GEMM_BATCH). The interface currently only supports the C/C++ language. However, it can be called implicitly (Fortran 77 like) or used by intercepting existing calls (static and dynamic linkage).

LIBXSMM has an interface for batched GEMMs since several versions supporting pointers as well as arrays of indexes plus Byte-sized strides to extract data from arrays of structures (AoS). The new BLAS interface only supports straight arrays of pointers to operand matrices but allows multiple groups of homogeneous batches. All batch interfaces are implemented in sequential (ST) and multi-threaded (MT) form plus synchronization in case of MT.

INTRODUCED * Interface and implementation of batched GEMMs (GEMMBATCH). * Tensorflow wrapper code for LSTM operation. * Interceptor for GEMMMBATCH, and GEMV.

IMPROVEMENTS / CHANGES * LSTM: enabled additional tensor formats for Bfloat16. * Validated with GNU GCC 9.1 release.

FIXES * Issue #331, issue #333, issue #334, and https://github.com/spack/spack/issues/11413 * Several other/minor fixes.

- C
Published by hfp almost 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.11

This release accumulated more than 1200 changes since the last release and is a major preparation for our future v2 of the library. Beside stability improvements, refining existing functionality and bug-fixes, there were several introductions of new functionality: packed/compact data layout functions for solving linear equations, new flavors of SMM-kernels along with relaxed limitations (transb), and overall support for low-precision based on the Bfloat16 FP-format.

The Deep Learning (DL) domain is still under active research and development including co-design. The API however is rather stable (DLv2 since v1.8) with an implementation that continues to receive major development. Towards LIBXSMMv2, the DL domain will undergo major code reduction (implementation) while providing the same or more functionality (first sign is the removal of the Winograd code in this release).

THANK YOU FOR YOUR CONTRIBUTION - we had again several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!

INTRODUCED * Packed function domain (compact data format) with GEMM, GETRF, TRMM, and TRSM functions. * Relaxed limitation of SMM kernels: TransB=T is now allowed (in addition to TransB=N). * Batch-reduce GEMM-kernel which is optimized for in-cache accumulation (Beta=1). * Included build setup in library (environment variable LIBXSMM_DUMP_BUILD=1). * CPU feature detection is updated for Cascadelake and Cooperlake (CLX and CPX). * Bfloat16 instruction support for Cooperlake (CPX). * Bfloat16 support for DL and SMM domain. * Fast RNGs for single-precision FP data.

IMPROVEMENTS * Cray Compiler (legacy and current versions) is supported with LIBXSMM's use of intrinsics, inline assembly, and CPUID detection, and therefore received major performance improvements. Previously, even JIT code was limited to AVX due to unsupported CPUID flow. * Updated support for tensorflow::cpuallocator for API change in TensorFlow API (v1.12.0 and beyond). * Guarantee JIT'ted function (non-NULL); see CHANGE about libxsmm[get|set]dispatchtrylock. * Call wrapper/interceptor (static/shared library) now always works i.e., no special build required. * SpMDM/Bfloat16 interface to enable TensorFlow which gained type-support for Bfloat16. * GxM framework updated for fused DL ops, Bfloat16, and a variety of new DL operators. * DL domain with LSTM and GRU cells, fully connected layer, and batch norm support. * Reduced unrolling and code size of transpose kernels (to fit i$). * Extended Fortran interface (matdiff, diff, hash, shuffle). * Purified some more routines (Fortran interface). * More statistical values (libxsmm_matdiff/info).

CHANGES * KNC support has been removed (maps to generic code). Offload infrastructure has been kept. * Winograd code has been removed from DL domain (see also introduction to this release). * Removed libxsmm[get|set]dispatchtrylock (demoted to compile-time option). * Threshold criterion of libxsmmgemm (optionally based on arithmetic intensity).

FIXES * Fixed corner case which eventually led to leaking memory (scratch). * Exhausted file handles (in ulimit'ed or restricted environments). * Fixed libxsmm_timer in case of lazy library initialization. * Flawed detection of restricted environments (SELinux). * Fixed buffer handling in case of incorrect input. * Fixed setup of AVX2 code path in SpMDM. * Ensure correct prefix in pkg-config files. * Guarantee JIT'ted function (non-NULL).

Note about platform support: an explicit compile-error (error message) is generated on platforms beside of Intel (or compatible processors) since upstreamed code was reported to produce "compilation failure". Beside of the introduced artificial error, any platform is supported with generic code (tested with ARM cross-compiler). Of course, any Open Source contribution to add JIT support is welcome.

Note about binary compatibility: LIBXSMM's API for Small Matrix Multiplications (SMMs) is stable, and all major known applications (e.g., CP2K, EDGE, NEK5K, and SeisSol) either rely on SMMs or are able (and want) to benefit from an improved API of the other domains (e.g., DL). Until at least v2.0, binary compatibility is not maintained (SONAME version goes with the semantic version).

- C
Published by hfp almost 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.10

Development accumulated many changes since the last release (v1.9) as this version (v1.10) kept slipping because of validation was not able to keep up and started over several times. On the positive side this may allow to call it the "Supercomputing 2018 Edition" which is complemented by an updated list of references including the SC'18 paper "Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures". Among several external articles, the Parallel Universe Magazine published "LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel".

The intense development of LIBXSMM brought many improvements and detailed features across domains as well as end-to-end support for Bfloat16 in LIBXSMM's Deep Learning domain (DL). The latter can be already exercised with the GxM framework which was added to the collection of sample codes. Testing and validation were updated for latest compilers and upcoming Linux distributions. FreeBSD is now formally supported (previously it was only tested occasionally). RPM-, Debian- and FreeBSD package updates will benefit from the smoothed default build-targets and compiler flags.

LIBXSMM supports "one build for all" while exploiting the existing instructions set extensions (CPUID based code-dispatch). Developers may enjoy support for pkg-config (.pc files in the lib folder) for easier linkage when using the Classic ABI (e.g., PKG_CONFIG_PATH=/path/to/libxsmm/lib pkg-config libxsmm --libs).

THANK YOU FOR YOUR CONTRIBUTION - we had several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!

INTRODUCED * Removed need to build LIBXSMM's static library in a special way for GEMM call-interception. * Moved some previously internal but generally useful code to the public interface (math etc.). * Initial support handle-based "big" GEMM (revamped libxsmm?gemmomp). * Support transposed cases in libxsmm?gemmomp; not perf.-competitive yet. * Code samples accompanying article in the Parallel Universe magazine. * Fortran interface for some previously only C-exposed functions. * Support Intel C/++ Compiler together with GNU Fortran. * Packed/SOA domain: expanded functionality (EDGE solver). * Deep Learning framework GxM (added as code sample). * RNNs, and LSTM/GRU-cell (driver code experimental). * End-to-end support for Bfloat16 (DL domain). * Fused batch-norm, and fully-connected layer. * Compact/packed TRSM kernels and interface. * Experimental TRMM code (no interface yet). * Support for pkg-config.

IMPROVEMENTS / CHANGES * Zero-mask unused register parts to avoid false positives with enabled FPEs (MM kernels). * Added libxsmmptrx helper to Fortran interface (works around CLOC portability issue). * Mapped TF low-precision to appropriate types, map unknowns to DATATYPEUNSUPPORTED. * Build banner with platform name, info about Intel VTune (available but JIT-profiling disabled). * Smoothed code base for most recent compilers (incl. improved target attribution). * Official packages for Debian, and FreeBSD (incl. OpenMP in libxsmm/ext for BSD). * LIBXSMMDUMP environment var. writes MHD-files if libxsmmmatdiff is called. * Warn when libxsmmreleasekernel is called for registered kernel. * Consolidated Deep Learning sample codes into one folder. * Revised default for AVX=3 (MIC=0 is now implicitly set). * LIBXSMMTARGET: more keys count for AVX512/Core. * Updated TF integration/documentation. * Included workarounds for flang (LLVM). * Attempt to enable OpenMP with Clang. * Install header-only form (make install). * SpMDM code dispatch for AVX2. * Improved CI/test infrastructure. * Show hint if compilation fails.

FIXES * Properly dispatch CRC32 instruction (support older CPUs). * Fixed fallback of statically generated MM kernels (rare). * Remove temporary files that were previously dangling. * Fixed termination message/statistic (code registry). * Fixed finalizing the library (corner case). * Fixed code portability of DNN domain.

- C
Published by hfp over 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.9

This release enables JIT-code generation of small matrix multiplications for SSE3 targets. Previously, only AVX and beyond has been supported using JIT code. SSE JIT-code generation is only supported for the MM domain (matrix multiplication). The compatibility of the library has been further refined and fine-tuned. The application binary interface (ABI) narrowed from above 500 functions down to ~50% due to adjusted symbol visibility. This revision prepares for a smooth transition to v2.0 and really internalizes low-level internals (descriptor handling, etc.), and two deprecated functions have been removed. More prominent, prefetch enumerators have been renamed e.g., LIBXSMMPREFETCHAL2 renamed to LIBXSMMGEMMPREFETCH_AL2.

INTRODUCED * ABI specification improved: exported functions are decorated for visibility/internal use (issue #205). * Math functions to eventually avoid LIBM dep., or to control specific requirements (libxsmm_math.h). * MM: enabled JIT-generation of SSE code for small matrix multiplications (BE and FE support). * MM: extended FE to handle multiple flavors of low-precision GEMMs (C and C++). * Detect mainainer build and avoid target flags (GCC toolchain, STATIC=0). * SMM: I16I32 and I16F32 WGEMM for SKX and future processors. * Hardening all builds by default (Linux package requirements).

IMPROVEMENTS / CHANGES * MM domain: renamed prefetch enumerators; kept "generic" names SIGONLY, NONE, and AUTO (FE). * Build system presents final summary (similar to initial summary); also mentions VTune (if enabled). * Adjusted TF scratch allocator to adopt global rather than context's allocator (limited memory). * Combined JIT-kernel samples with respective higher level samples (xgemm, transpose). * Enabled extra (even more pedantic) warnings, and adjusted the code base accordingly. * Adjusted Fortran samples for PGI compiler (failed to deduce generic procedures). * Removed deprecated libxsmm[create/release]dgemmdescriptor functions. * Included validation and compatibility information into PDF (Appendix). * MinGW: automatically apply certain compiler flags (workaround). * Internalized low-level descriptor setup (opaque type definitions). * Moved LIBXSMMDNNINTERNALAPI into internal API. * Fixed dynamic linkage with CCE (CRAY compiler).

FIXES * Take prefetch requests in libxsmmxmmdispatch (similar to libxsmm[s|d|w]mmdispatch). * SpMM: prevent to generate (unsupported) SP-kernels (incorrect condition). * Fixed code-gen. bug in GEMM/KNM, corrected K-check in WGEMM/KNM. * MinGW: correctly parse path of library requirements ("drive letter"). * Fixed VC projects to build DLLs if requested.

- C
Published by hfp almost 8 years ago

https://github.com/libxsmm/libxsmm - Version 1.8.3

Overview: while v1.9 is in the works, this release fixes two issues, and pushes for an improved (OSX w/ Intel Compiler) and wider OS/Compiler coverage (MinGW, BSD, see Compatibility). Among minor or exotic issues resolved in this release, the stand-alone JIT-generated matrix transposes (out-of-place) are now limited to matrix shapes such that only reasonable amounts of code are generated. There has been also a rare synchronization issue reproduced with CP2K/smp in LIBXSMM v1.8.1 (and likely earlier), which is resolved since the previous release (v1.8.2).

JIT code generation/dispatch performance: JIT-generating code (non-transposed GEMMs) is known to be blazingly fast, which this release (re-)confirms with the extended dispatch microbenchmark: single-threaded code generation (uncontended) of matrix kernels with M,N,K := 4...64 (equally distributed random numbers) takes less than 25 µs on typical systems, and non-cached code dispatch takes less than 50x longer than calling a function that does nothing whereas cached code-dispatch takes less than 15x longer than an empty function (code dispatch is roughly three orders of magnitudes faster than code generation i.e., Nanoseconds vs. Microseconds).

INTRODUCED * Support for mixing C and C++ code when using header-only based LIBXSMM. * Issue 202: reintroduced copy-update with LIBXSMM's install target (make). * Experimental: sketched Python support built into LIBXSMM (PYMOD=1).

IMPROVEMENTS / CHANGES * Completed revision of synchronization layer (started in v1.8.2); initial documentation. * Reduced TRACE output due to self-watching (internal) initialization/termination. * Wider OS validation incl. more exotic sets (MinGW in addition to Cygwin, BSD). * Prevent production code (non-debug) on 32-bit platforms (compilation error). * Increased test variety while staying within same turnaround time limit. * Continued to close implementation gaps (synchronization primitives). * Sparse SOA domain received fixes/improvements driven by EDGE. * More readable code snippets in documentation (reduced width). * Initial preparation for JIT-generating SSE code (disabled). * Improved detection of OpenBLAS library (Makefile.inc). * Updated (outdated) support for Intel Compiler (OSX). * Compliant soname under Linux and OSX.

FIXES * Fixed selection of statically generated code targeting Skylake server (SKX). * Sparse SOA domain: resolved issues pointed out by static analysis. * Fixed support for JIT-generated matrix transpose (code size). * Fixed selecting an incorrect prefetch strategy (BGEMM).

- C
Published by hfp about 8 years ago

https://github.com/libxsmm/libxsmm - Version 1.8.2

This last release of the 1.8.x line (before 1.9) accumulated a large number of changes to tweak interfaces, and to generally improve usability. The documentation vastly improved and extended, is more structured, and also available per ReadtheDocs (with online full-text search). In preparation of a fully revised implementation of the DNN API (rewrite), the interface of the DNN domain (Tensor API) changed in an incompatible way (our policy should have delayed this to v1.9). However, the current main user of the DNN API has been updated (integration with TensorFlow). Also notable, v1.8.2 introduces JIT-code generation with Windows call-convention (support limited to 4-argument kernels i.e., no prefetch signature for the MM domain, and no support for DNN/convolution kernels).

INTRODUCED * Introduced kernel introspection/query API for registered code: full GEMM descriptor, and code size. * Introduced explicit batch interface (and an experimental auto-batch option); parallelized/sequential. * Introduced BGEMM interface for handle-based GEMM using optimized format (copy-in/out). * More comprehensive sparse support (EDGE: Extreme Scale Fused Seismic Simulations). * More comprehensive collection of DNN test cases (DeepBench, ResNet50, etc.). * Implemented CI for DNN domain, and infrastructure for validation (libxsmmmatdiff). * Support to schedule CI/tests into a Slurm based cluster environment (.travis.sh). * Introduced "make INTRINSICS=0" to allow building with outdated Binutils. * Generate preprocessor symbols for statically generated code (presence check). * Allow FORTRAN to access (static-)configuration values using preprocessor. * FORTRAN 77 support for a much wider set of functionality (MM domain). * Introduced MHD file I/O to e.g., aid visual inspection and validation. * Cleaned up type-definitions and FE-macros (lower precision GEMM). * More comprehensive set of prefetch strategies (SMM domain). * Extended LIBXSMMVERBOSE=2 to show library version, etc. * Wider use of QFMA accross domains (MM, SpMM, DNN). * Updated application recipe for CP2K and TensorFlow. * Initial Eigen related code sample (batched SMMs). * CPUID for CPUs codenamed "Icelake".

CHANGES * Revised/unified API attribute decoration, and cleaned up header-only header. * Removed script for regenerating documentation bits (README.sh); now only per make. * Changed matcopy kernels to have column-major semantics (similar to transpose). * Support const/non-const GEMM prototypes interfering with LIBXSMM's header-only. * Slightly revised and based all F2K3 interfaces on lower-level F77 (implicit) routines. * Incorporated/enabled new/additional instructions in the code generator (BE). * Reshuffled properties/sizes in GEMM descriptor for future extensions. * Portable build-locks for improved turnaround time in parallel CI builds. * Comprehensive validation of DNN domain (all major benchmarks). * Consistent use of libxsmmblasint (libxsmmdmmdispatch). * Revised error/warning messages (LIBXSMMVERBOSE=1). * Initial support for some fused operations (DNN domain). * Removed support for small GEMM descriptors (BIG=0). * Removed libxsmmtimerxtick (libxsmmtimer.h). * Improved turnaround time in Travis CI testing. * Thread-safe scratch memory allocation. * Support VS 2017 (startup script, etc.)

FIXES * Fixed potential issue with GEMM flags being incorrectly created (GEMM wrapper). * Several fixes for improved FORTRAN interface compatibility (optional arguments, etc.). * Disabled AVX-512 code generation with Intel Compiler 2013 (SP1 brings the req. bits). * Fixed code gen. issue with SOA sparse kernels; corrected precision of SOA sample code. * Fixed index calculation in tiled libxsmm_matcopy; updated test case accordingly. * Fixed a number of issues in several DNN code paths unveiled by better testing. * Several fixes in sparse SOA domain (unveiled by LIBXSMM's integration into PyFR). * Improved support for (legacy) Clang wrt AVX-512 code generation (intrinsics). * Ported bit-scan intrinsics abstraction to yield same result with all compilers. * Allow static code generation to target SKX and KNM (Makefile). * Fixed several code generation issues for SMMs on KNM.

- C
Published by hfp about 8 years ago

https://github.com/libxsmm/libxsmm - Version 1.8.1

This release brings some new features (matcopy/2d-copy and tcopy based on JIT-generated code) as well as a number of bug fixes (TGEMM), improvements (KNM), and refinements (LIBXSMMGEMMWRAP control, etc). Given the completed copy/transpose support, this release prepares for a complete stand-alone GEMM routines.

INTRODUCED * Choice between tiled/small GEMM during call-interception (LIBXSMMGEMMWRAP=1|2). * Introduced JIT'ted transpose kernels including tiling for larger matrices. * Transpose routines now auto-dispatch JIT-kernels incl. auto-tuned tiles. * Introduced matcopy routines similar to the transpose routines (C/C++/F). * LIBXSMMDNNCONVOPTIONOVERWRITE for faster initial forward convolution. * Implemented/documented named JIT routines in TF when using VTune. * Additional statistics about MCOPY/TCOPY (LIBXSMMVERBOSE=2). * Lowered overhead of tiled/parallelized GEMM/MCOPY/TCOPY. * Made libxsmmhash function available (MEM/AUX module). * Initial support for lower precision (backward conv.)

CHANGES * AVX-512 based CPUID-dispatched input/output of Winograd transformation (forward conv.). * Adjusted build system to pick-up RPMOPTFLAGS (RPM based Linux distributions). * Moved extensive Q&A to Wiki page and cleaned up the reference documentation. * Improved/extended Getting Started Guide for TensorFlow with LIBXSMM. * Improved general backend error propagation, and avoid duplicated messages. * Iterative subdivision of large matrix transposes (tcopy) and matcopy (mcopy). * Non-task based and (optional) task based parallelization of tcopy and mcopy. * Mentioned KNM target key ("knm") in reference documentation. * Improved prefetches in KNM code path of weight update. * Adjusted initialization sequence during startup. * Improved parallelization grammar.

FIXES * Fixed pruned tile sizes and division-by-zero error in tiled GEMM. * Propagate backend errors in case of an insufficient JIT buffer. * CRC32 SW implementation issues unveiled by the CRAY Compiler. * Call parallelized transpose (C++ interface) when requested. * Fixed VTune support (named JIT code); broken in v1.8. * Fixed incorrect prefetch locations in KNM code path. * Fixed alignment condition in tcopy/mcopy code. * Fixed TF allocator integration with GCC 7.1.0. * Fixed some more warnings in sample codes.

- C
Published by hfp almost 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.8

This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMMDNNCONVALGOAUTO). Moreover, support for the Intel Xeon Phi processor code-named "Knights Mill" ("KNM") has been added (QFMA and VNNI instructions can be executed using the Intel SDE).

INTRODUCED - A summary of code samples has been added (pdf), and also a guide (mainly for contributors) to "Getting Started using TensorFlow with LIBXSMM" [PDF] - Additional sparse matrix primitives (fsspmdm domain); see "pyfr" and "edge" sample code - Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units - Improved code path selection for legacy compiler versions (functions with multiple compilation targets) - DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMMDNNCONVALGOAUTO) between LIBXSMMDNNCONVALGODIRECT, and LIBXSMMDNNCONVALGOWINOGRAD - DNN: logically padded data incl. support for Winograd based implementation - DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512) - DNN: support another custom format that blocks the minibatch dimension - SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmmxmmdispatch, libxsmmxmmcall) - SPMDM: narrowed scope of "sum" array to improve optimization on LLVM - SMM/EXT/OMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script) - SMM/DNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs - AUX/MEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime

CHANGES - Removed LIBXSMMMT and LIBXSMMTASKS environment variables, and updated documentation - COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler) - LIBXSMMTRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS) - AUX/MEM: superseded libxsmmmallocsize function with libxsmmgetmallocinfo - Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE) - Other: updated "spack" (HPC packet manager) to use more reasonable build options - SPMDM: improved load balance

FIXES - Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler) - Worked around problem/crashes due to an outdated TCMALLOC replacement of malloc/free (CCE) - TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1 - TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region - SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses - Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis - Fixed transpose FORTRAN sample code

- C
Published by hfp almost 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.7.1

This release finishes the memory allocation interface and documents the two memory allocation domains (default and scratch). Otherwise this release focuses on code quality (sample code) with no fixes or breaking changes when compared to version 1.7.

INTRODUCED - MEM: libxsmmreleasescratch has been introduced (unimplemented) - MEM: libxsmmreleasescratch now called during finalization - MEM: documented memory allocation domains - DNN: updated API documentation

CHANGES - More error/warning messages promoted to LIBXSMM_VERBOSE

FIXES - None