Recent Releases of https://github.com/libxsmm/libxsmm

https://github.com/libxsmm/libxsmm - Version 1.17

This release is porting master/main's build-system back to v1.16. The necessary code changes have been minimized. However, since some non-trivial code changes are required, the release is labeled v1.17. The release became necessary due to the aging 1.16-line of code and new compilers emerged since then. For example, issues like #562 are among similar issues when using, e.g., GNU GCC 10.x or 11.x.

Note: version v1.17 leverages the same code base as version v1.16x. All new features, fixes, and development progress remain unreleased. As per LIBXSMM's policy to keep the master/main branch stable, one can take the latter to leverage new features, fixes, and development progress.

INTRODUCED * Validated with compilers released after the original v1.16 (GNU GCC 10.x, 11.x, and several Clang releases).

IMPROVEMENTS / CHANGES * Improved default for static code-paths using certain ISA extensions (no need to adjust INTRINSICS setting).


The build systems controls several options, and generally the set of options evolved since v1.16, which is the main reason for code changes. A positive impact of more changes is thorough (re-)validation. This release adjusted to LIBXSMM's evolved test environment (1.16.x cannot be revalidated). Code validation of v1.17 again reaches the level of the original v1.16 and further includes new compiler available since then.

- C
Published by hfp about 4 years ago

https://github.com/libxsmm/libxsmm - Version 1.16.3

This update promotes fixes in LIBXSMM's master/main branch and resolves two CVEs. Version 1.16.3 continues to leverage the same code base as version 1.16.2 and 1.16.1. All new features, fixes, and development progress remain unreleased. As per LIBXSMM's policy to keep the master/main branch stable, one can rely on the latter to leverage new features, fixes, and development progress.

IMPROVEMENTS / CHANGES / FIXES * CVE-2021-39535 * CVE-2021-39536

- C
Published by hfp over 4 years ago

https://github.com/libxsmm/libxsmm - Version 1.16.2

This minor update can resolve an issue where the OS installation (on a legacy system) does not signal about saving the state for contexts using instruction set extensions like SSE. The problem was resolved in LIBXSMM's main development branch already a long time ago. The problem was discovered in certain Virtual Machine installations (VMs) as well as on some OS installations (e.g., here).

INTRODUCED * New functionality and new features continue to remain with LIBXSMM's main revision (under development).

IMPROVEMENTS / CHANGES / FIXES * Adopt code-path even if OS does not properly signal support for an ISA extension.

Note: version 1.16.2 leverages the same code base as version 1.16.1 (except for a single line of code applying above mentioned fix). All new features, fixes, and development progress remain unreleased. As per LIBXSMM's policy to keep the master/main branch stable, one can take the latter to leverage new features, fixes, and development progress.

- C
Published by hfp over 4 years ago

https://github.com/libxsmm/libxsmm - Version 1.16.1

This (minor) release fixes the issues mentioned below as well as improving on platform support.

THANK YOU to the Department of Chemistry at University of Zurich for generously covering Cray system access.

IMPROVEMENTS / CHANGES * Muted compiler warnings caused by separate OpenMP runtime (Clang based tool chains). * Sample code: prevent OpenBLAS undefined type when including 77blas.h (issue)

FIXES * Fixed compilation and runtime issues with Clang-based Cray Compiler as well as Cray Classic Compiler. * Revised Fortran implementation of libxsmm_xdiff and removed _Bool dependency (issue).

- C
Published by hfp over 5 years ago

https://github.com/libxsmm/libxsmm - Version 1.16

This is a maintenance release which is meant to capture the project´s continuous development into a stable release. A validated release allows our users to leverage several improvements and fixes (see below) especially in the light of upcoming new features.

THANK YOU FOR YOUR CONTRIBUTION - your contribution matters! This project received several contributions whether as pull request, issue report, feature suggestion, or as an informal inquiry. We would like to thank you for your effort and time spent for Open Source software!

INTRODUCED * Zero-config for all platforms with absolutely no configuration needed for header-only. Simplifies using Visual Studio as no up-front configuration or in-build custom steps are needed. Simplifies 3rd-party build systems incorporating LIBXSMM for both header-only and classic ABI. * Updated Hello LIBXSMM, and added code examples for C/C++ and Fortran, included minimal "support" for Bazel (request). The latter is not meant to change our Makefile based build setup but can rather help to get people started who prefer Bazel. * Fortran interface for user-data dispatch and a Fortran code sample using this interface to dispatch multiple kernels at once. The C interface was introduced earlier (v1.15). * Experimental: element-wise kernels with matrix elements (meltw), e.g., to scale, reduce, type-convert, etc.

IMPROVEMENTS / CHANGES * Extended list of applications using LIBXSMM. Our documentation also lists applications among popular categories (at the bottom of the left-hand side menu). * Fixed performance bug in matcopy routine; added microbenchmarks. * Improved verbose output (watermarks, additional warnings). * Disabled memory wrapper at compile-time (opt-in only). * Fully moved to Python3 shebang (fallback to Python2). * Improved Fortran interface (overloads, etc.). * Further improved support for GNU GCC 10. * Extended sparse functionality.

FIXES * Avoid manipulating GNU´s feature flags (improves header-only library). * Fixed detecting Intel VTune 2020 (SYM=1 with source'd profiler). * Consistently emit unaligned LD/ST (intrinsics based code).

- C
Published by hfp over 5 years ago

https://github.com/libxsmm/libxsmm - Version 1.15

Version 2.0 was our anticipated next release. With v1.15 the goal is to flawlessly upstream LIBXSMM with OS-distributions that soon start building packages with GNU GCC 10 (further details).

Beyond new compiler support, LIBXSMM received a slight but consistent performance improvement even for core-functionality, namely SMM-kernels including batch-reduce. The DNN domain was developed the most and continues to deliver like a rolling release. The DNN backend broadened support for low/mixed-precision kernels and kernel-fusion (batch-reduce plus-X as used by convolutional neural networks).

INTRODUCED * Small matrix multiplication and batch-reduce kernels are available for the following input types FP64, FP32, bfloat16, int16, and int8. Low-precision support exists in several type-combinations with respect to input and accumulation type leveraging AVX-512 extensions (VNNI and Bfloat16). * New C-APIs (Fortran to follow): (1) kernel introspection, takes kernel-function pointer, fills info-structure with FLOPS-count, code-size, and more; no search overhead, (2) register user-defined data with LIBXSMM's fast key-value database/query, e.g., to lower dispatch overhead for multiple kernels used in one task. * Fortran API: more flavors of certain generic procedures; can potentially avoid temporary values due to exact match (procedure overload). * Example vectorizing along finite elements (DGFEM) using LIBXSMM for sparse weight matrices. * Example showing sparse weight matrix multiplication (deep learning). * Reproducer for next-gen. CP2K/collocate implementation. * Module file generated during build (module av).

IMPROVEMENTS / CHANGES * Allow to omit full configure step under Windows; improved build VS support. Note, Windows calling convention is still pending but in the works. Necessary state is currently not call-preserved, which may or may not work (as a workaround it may help to use wrapper-functions for LIBXSMM's kernels). * Dropped code generation for convolutions, which are now based on batch-reduce kernels, and revised batch-reduce API to support (1) absolute addresses like in previous releases, (2) relative offsets/indexes, and (3) constant/identical offset/stride. * LIBXSMM/EXT: OpenMP support under macOS (w/ Apple's LLVM based compiler). * Entire code base of LIBXSMM uses SPDX-License-Identifier (BSD-3-Clause). * Verbose message about timer accuracy (virtualized platforms). * Generally improved verbosity (insight/detail, and accuracy). * New instructions support in backend. * Slightly lowered dispatch overhead. * NUMA-aware GxM framework.

FIXES * Issues #334, #347, #371, #368, and #369. * Zero defects as of Synopsys Coverity. * Rebuild issue (build system). * Library initialization.

- C
Published by hfp almost 6 years ago

https://github.com/libxsmm/libxsmm - Version 1.14

This release brings notable fixes and improvements (see below) prior to merging our reworked DL backend. This version is likely the last release of our 1.x series. For the upcoming major release of LIBXSMM, the API remains compatible for core functionality except for the DL domain. Even for the DL domain, there are only API adjustments rather than big changes (straight forward or minor).

THANK YOU FOR YOUR CONTRIBUTION: jewillco, yurivict, antoscha, breuera, jeremylt, HiSPEET, and legrosbuffle. We would like to thank all direct contributors as well as people who informally spent effort and time for this Open Source software!

INTRODUCED * Native PROCEDURE types for generic 3-/6-arguments (arity) functions (Fortran interface). * Intercepted memory allocation for applications based on LIBXSMM's scratch memory. * LIBXSMM guarantees non-NULL kernels for valid requests since several versions.
Empty shape requests are now considered valid (SMM, MCOPY, and TCOPY). * Getting Started section added to documentation ("Hello LIBXSMM").

IMPROVEMENTS / CHANGES * Termination statistic now distinct SMMs and degenerated SMMs (GEMV). * Support Immintrin-debug (https://github.com/intel/Immintrin-debug). * Emit warning if compiler support only enables low-resolution timers. * Support PGI Compiler based on GNU GCC settings; still some issues. * Generally enable ISA extensions even if not permitted by OS (XSAFE). * Enforce AVX-512 under OSX i/Mac Pro (OSX: XSAFE/ZMM disabled). * VTUNE=0: disables profiler support (even if detected and SYM=1). * Memory info to handle foreign pointers (not allocated by library). * Scratch memory allocation: avoid unnecessary warning (verbose). * Improved scratch memory allocation statistics (watermark, etc.). * Implemented exit-handler for Fortran programs using STOP. * Avoid compiler warnings previously suppressed by flags. * Make: only permit matching static/shared library builds. * Accommodate Clang based compiler under Windows. * Improved RNG performance for very short sequences. * Updated Visual Studio projects and setup (VS2019). * Updated and revised documentation. * Updated articles and applications. * Contribution #355 incorporated. * Lowered dispatch overhead.

FIXES * Fixed issue (2019/02/24) dispatching compiler-generated code (affected SpMDM and DL). * Fixed casting literal -1 to an unsigned integer when 64-bits were intended. * Resolved issue related to structure alignment/padding/copy (CCE). * Potentially invalid kernel cache with concurrently finalized library. * Potentially treated non-OpenMP lock as OpenMP lock. * Avoid potentially recursive locking at termination. * Fixed potential hang with header-only. * Incorrect LDC for intercepted GEMV. * Issues fixed: #340 and #347.

- C
Published by hfp over 6 years ago

https://github.com/libxsmm/libxsmm - Version 1.13

This release delivers improvements made to the build system and internal structures. The main purpose is to continuously deliver smooth build and run experience for latest OS environments.

THANK YOU FOR YOUR CONTRIBUTION - your contribution matters! This project received direct (and indirect) contributions whether as issue report, feature suggestion, or involvement from people who came across the project. We would like to thank you all for the effort and time you spent for Open Source software!

IMPROVEMENTS / CHANGES * Fortran: enabled libxsmmptr* to eventually return CNULL_PTR. * Avoid treating Spack environment as maintainer build (apply SSE4 flags). * Renamed structure-of-array (SOA) dense routines into "packed". * Internal preparation for upcoming features (memory allocation). * Improved build system (most recent OS environments).

FIXES * Precondition for working around missing Float128 definition (#339). * Conceptionally avoid accessing a zero-sized array (Fortran interface). * Corrected number of scratch-memory pools (LIBXSMMVERBOSE).

- C
Published by hfp over 6 years ago

https://github.com/libxsmm/libxsmm - Version 1.12.1

This release fixes issues related to the prefix directory inside of the pkg-config files which affected maintainer builds (Linux and FreeBSD package), the package manager Spack, or people using pkg-config to determine build/linker flags. In addition, some presets are made to smooth maintainer builds under FreeBSD.

IMPROVEMENTS / CHANGES * Building samples: detect Intel MKL (when installed by a package manager). * Improved build system under FreeBSD (detect BLAS library, etc).

FIXES * Issue #331, issue #333, issue #334, and spack/spack#11413. * OpenMP build issue in one of the code samples (GCC 9.1).

- C
Published by hfp almost 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.12

This release aims to improve usability along with resolving several non-critical bugs. Beyond this, an implementation of the BLAS(-like) batched GEMM has been added (?GEMM_BATCH). The interface currently only supports the C/C++ language. However, it can be called implicitly (Fortran 77 like) or used by intercepting existing calls (static and dynamic linkage).

LIBXSMM has an interface for batched GEMMs since several versions supporting pointers as well as arrays of indexes plus Byte-sized strides to extract data from arrays of structures (AoS). The new BLAS interface only supports straight arrays of pointers to operand matrices but allows multiple groups of homogeneous batches. All batch interfaces are implemented in sequential (ST) and multi-threaded (MT) form plus synchronization in case of MT.

INTRODUCED * Interface and implementation of batched GEMMs (GEMMBATCH). * Tensorflow wrapper code for LSTM operation. * Interceptor for GEMMMBATCH, and GEMV.

IMPROVEMENTS / CHANGES * LSTM: enabled additional tensor formats for Bfloat16. * Validated with GNU GCC 9.1 release.

FIXES * Issue #331, issue #333, issue #334, and https://github.com/spack/spack/issues/11413 * Several other/minor fixes.

- C
Published by hfp almost 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.11

This release accumulated more than 1200 changes since the last release and is a major preparation for our future v2 of the library. Beside stability improvements, refining existing functionality and bug-fixes, there were several introductions of new functionality: packed/compact data layout functions for solving linear equations, new flavors of SMM-kernels along with relaxed limitations (transb), and overall support for low-precision based on the Bfloat16 FP-format.

The Deep Learning (DL) domain is still under active research and development including co-design. The API however is rather stable (DLv2 since v1.8) with an implementation that continues to receive major development. Towards LIBXSMMv2, the DL domain will undergo major code reduction (implementation) while providing the same or more functionality (first sign is the removal of the Winograd code in this release).

THANK YOU FOR YOUR CONTRIBUTION - we had again several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!

INTRODUCED * Packed function domain (compact data format) with GEMM, GETRF, TRMM, and TRSM functions. * Relaxed limitation of SMM kernels: TransB=T is now allowed (in addition to TransB=N). * Batch-reduce GEMM-kernel which is optimized for in-cache accumulation (Beta=1). * Included build setup in library (environment variable LIBXSMM_DUMP_BUILD=1). * CPU feature detection is updated for Cascadelake and Cooperlake (CLX and CPX). * Bfloat16 instruction support for Cooperlake (CPX). * Bfloat16 support for DL and SMM domain. * Fast RNGs for single-precision FP data.

IMPROVEMENTS * Cray Compiler (legacy and current versions) is supported with LIBXSMM's use of intrinsics, inline assembly, and CPUID detection, and therefore received major performance improvements. Previously, even JIT code was limited to AVX due to unsupported CPUID flow. * Updated support for tensorflow::cpuallocator for API change in TensorFlow API (v1.12.0 and beyond). * Guarantee JIT'ted function (non-NULL); see CHANGE about libxsmm[get|set]dispatchtrylock. * Call wrapper/interceptor (static/shared library) now always works i.e., no special build required. * SpMDM/Bfloat16 interface to enable TensorFlow which gained type-support for Bfloat16. * GxM framework updated for fused DL ops, Bfloat16, and a variety of new DL operators. * DL domain with LSTM and GRU cells, fully connected layer, and batch norm support. * Reduced unrolling and code size of transpose kernels (to fit i$). * Extended Fortran interface (matdiff, diff, hash, shuffle). * Purified some more routines (Fortran interface). * More statistical values (libxsmm_matdiff/info).

CHANGES * KNC support has been removed (maps to generic code). Offload infrastructure has been kept. * Winograd code has been removed from DL domain (see also introduction to this release). * Removed libxsmm[get|set]dispatchtrylock (demoted to compile-time option). * Threshold criterion of libxsmmgemm (optionally based on arithmetic intensity).

FIXES * Fixed corner case which eventually led to leaking memory (scratch). * Exhausted file handles (in ulimit'ed or restricted environments). * Fixed libxsmm_timer in case of lazy library initialization. * Flawed detection of restricted environments (SELinux). * Fixed buffer handling in case of incorrect input. * Fixed setup of AVX2 code path in SpMDM. * Ensure correct prefix in pkg-config files. * Guarantee JIT'ted function (non-NULL).

Note about platform support: an explicit compile-error (error message) is generated on platforms beside of Intel (or compatible processors) since upstreamed code was reported to produce "compilation failure". Beside of the introduced artificial error, any platform is supported with generic code (tested with ARM cross-compiler). Of course, any Open Source contribution to add JIT support is welcome.

Note about binary compatibility: LIBXSMM's API for Small Matrix Multiplications (SMMs) is stable, and all major known applications (e.g., CP2K, EDGE, NEK5K, and SeisSol) either rely on SMMs or are able (and want) to benefit from an improved API of the other domains (e.g., DL). Until at least v2.0, binary compatibility is not maintained (SONAME version goes with the semantic version).

- C
Published by hfp almost 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.10

Development accumulated many changes since the last release (v1.9) as this version (v1.10) kept slipping because of validation was not able to keep up and started over several times. On the positive side this may allow to call it the "Supercomputing 2018 Edition" which is complemented by an updated list of references including the SC'18 paper "Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures". Among several external articles, the Parallel Universe Magazine published "LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel".

The intense development of LIBXSMM brought many improvements and detailed features across domains as well as end-to-end support for Bfloat16 in LIBXSMM's Deep Learning domain (DL). The latter can be already exercised with the GxM framework which was added to the collection of sample codes. Testing and validation were updated for latest compilers and upcoming Linux distributions. FreeBSD is now formally supported (previously it was only tested occasionally). RPM-, Debian- and FreeBSD package updates will benefit from the smoothed default build-targets and compiler flags.

LIBXSMM supports "one build for all" while exploiting the existing instructions set extensions (CPUID based code-dispatch). Developers may enjoy support for pkg-config (.pc files in the lib folder) for easier linkage when using the Classic ABI (e.g., PKG_CONFIG_PATH=/path/to/libxsmm/lib pkg-config libxsmm --libs).

THANK YOU FOR YOUR CONTRIBUTION - we had several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!

INTRODUCED * Removed need to build LIBXSMM's static library in a special way for GEMM call-interception. * Moved some previously internal but generally useful code to the public interface (math etc.). * Initial support handle-based "big" GEMM (revamped libxsmm?gemmomp). * Support transposed cases in libxsmm?gemmomp; not perf.-competitive yet. * Code samples accompanying article in the Parallel Universe magazine. * Fortran interface for some previously only C-exposed functions. * Support Intel C/++ Compiler together with GNU Fortran. * Packed/SOA domain: expanded functionality (EDGE solver). * Deep Learning framework GxM (added as code sample). * RNNs, and LSTM/GRU-cell (driver code experimental). * End-to-end support for Bfloat16 (DL domain). * Fused batch-norm, and fully-connected layer. * Compact/packed TRSM kernels and interface. * Experimental TRMM code (no interface yet). * Support for pkg-config.

IMPROVEMENTS / CHANGES * Zero-mask unused register parts to avoid false positives with enabled FPEs (MM kernels). * Added libxsmmptrx helper to Fortran interface (works around CLOC portability issue). * Mapped TF low-precision to appropriate types, map unknowns to DATATYPEUNSUPPORTED. * Build banner with platform name, info about Intel VTune (available but JIT-profiling disabled). * Smoothed code base for most recent compilers (incl. improved target attribution). * Official packages for Debian, and FreeBSD (incl. OpenMP in libxsmm/ext for BSD). * LIBXSMMDUMP environment var. writes MHD-files if libxsmmmatdiff is called. * Warn when libxsmmreleasekernel is called for registered kernel. * Consolidated Deep Learning sample codes into one folder. * Revised default for AVX=3 (MIC=0 is now implicitly set). * LIBXSMMTARGET: more keys count for AVX512/Core. * Updated TF integration/documentation. * Included workarounds for flang (LLVM). * Attempt to enable OpenMP with Clang. * Install header-only form (make install). * SpMDM code dispatch for AVX2. * Improved CI/test infrastructure. * Show hint if compilation fails.

FIXES * Properly dispatch CRC32 instruction (support older CPUs). * Fixed fallback of statically generated MM kernels (rare). * Remove temporary files that were previously dangling. * Fixed termination message/statistic (code registry). * Fixed finalizing the library (corner case). * Fixed code portability of DNN domain.

- C
Published by hfp over 7 years ago

https://github.com/libxsmm/libxsmm - Version 1.9

This release enables JIT-code generation of small matrix multiplications for SSE3 targets. Previously, only AVX and beyond has been supported using JIT code. SSE JIT-code generation is only supported for the MM domain (matrix multiplication). The compatibility of the library has been further refined and fine-tuned. The application binary interface (ABI) narrowed from above 500 functions down to ~50% due to adjusted symbol visibility. This revision prepares for a smooth transition to v2.0 and really internalizes low-level internals (descriptor handling, etc.), and two deprecated functions have been removed. More prominent, prefetch enumerators have been renamed e.g., LIBXSMMPREFETCHAL2 renamed to LIBXSMMGEMMPREFETCH_AL2.

INTRODUCED * ABI specification improved: exported functions are decorated for visibility/internal use (issue #205). * Math functions to eventually avoid LIBM dep., or to control specific requirements (libxsmm_math.h). * MM: enabled JIT-generation of SSE code for small matrix multiplications (BE and FE support). * MM: extended FE to handle multiple flavors of low-precision GEMMs (C and C++). * Detect mainainer build and avoid target flags (GCC toolchain, STATIC=0). * SMM: I16I32 and I16F32 WGEMM for SKX and future processors. * Hardening all builds by default (Linux package requirements).

IMPROVEMENTS / CHANGES * MM domain: renamed prefetch enumerators; kept "generic" names SIGONLY, NONE, and AUTO (FE). * Build system presents final summary (similar to initial summary); also mentions VTune (if enabled). * Adjusted TF scratch allocator to adopt global rather than context's allocator (limited memory). * Combined JIT-kernel samples with respective higher level samples (xgemm, transpose). * Enabled extra (even more pedantic) warnings, and adjusted the code base accordingly. * Adjusted Fortran samples for PGI compiler (failed to deduce generic procedures). * Removed deprecated libxsmm[create/release]dgemmdescriptor functions. * Included validation and compatibility information into PDF (Appendix). * MinGW: automatically apply certain compiler flags (workaround). * Internalized low-level descriptor setup (opaque type definitions). * Moved LIBXSMMDNNINTERNALAPI into internal API. * Fixed dynamic linkage with CCE (CRAY compiler).

FIXES * Take prefetch requests in libxsmmxmmdispatch (similar to libxsmm[s|d|w]mmdispatch). * SpMM: prevent to generate (unsupported) SP-kernels (incorrect condition). * Fixed code-gen. bug in GEMM/KNM, corrected K-check in WGEMM/KNM. * MinGW: correctly parse path of library requirements ("drive letter"). * Fixed VC projects to build DLLs if requested.

- C
Published by hfp almost 8 years ago

https://github.com/libxsmm/libxsmm - Version 1.8.3

Overview: while v1.9 is in the works, this release fixes two issues, and pushes for an improved (OSX w/ Intel Compiler) and wider OS/Compiler coverage (MinGW, BSD, see Compatibility). Among minor or exotic issues resolved in this release, the stand-alone JIT-generated matrix transposes (out-of-place) are now limited to matrix shapes such that only reasonable amounts of code are generated. There has been also a rare synchronization issue reproduced with CP2K/smp in LIBXSMM v1.8.1 (and likely earlier), which is resolved since the previous release (v1.8.2).

JIT code generation/dispatch performance: JIT-generating code (non-transposed GEMMs) is known to be blazingly fast, which this release (re-)confirms with the extended dispatch microbenchmark: single-threaded code generation (uncontended) of matrix kernels with M,N,K := 4...64 (equally distributed random numbers) takes less than 25 µs on typical systems, and non-cached code dispatch takes less than 50x longer than calling a function that does nothing whereas cached code-dispatch takes less than 15x longer than an empty function (code dispatch is roughly three orders of magnitudes faster than code generation i.e., Nanoseconds vs. Microseconds).

INTRODUCED * Support for mixing C and C++ code when using header-only based LIBXSMM. * Issue 202: reintroduced copy-update with LIBXSMM's install target (make). * Experimental: sketched Python support built into LIBXSMM (PYMOD=1).

IMPROVEMENTS / CHANGES * Completed revision of synchronization layer (started in v1.8.2); initial documentation. * Reduced TRACE output due to self-watching (internal) initialization/termination. * Wider OS validation incl. more exotic sets (MinGW in addition to Cygwin, BSD). * Prevent production code (non-debug) on 32-bit platforms (compilation error). * Increased test variety while staying within same turnaround time limit. * Continued to close implementation gaps (synchronization primitives). * Sparse SOA domain received fixes/improvements driven by EDGE. * More readable code snippets in documentation (reduced width). * Initial preparation for JIT-generating SSE code (disabled). * Improved detection of OpenBLAS library (Makefile.inc). * Updated (outdated) support for Intel Compiler (OSX). * Compliant soname under Linux and OSX.

FIXES * Fixed selection of statically generated code targeting Skylake server (SKX). * Sparse SOA domain: resolved issues pointed out by static analysis. * Fixed support for JIT-generated matrix transpose (code size). * Fixed selecting an incorrect prefetch strategy (BGEMM).

- C
Published by hfp about 8 years ago

https://github.com/libxsmm/libxsmm - Version 1.8.2

This last release of the 1.8.x line (before 1.9) accumulated a large number of changes to tweak interfaces, and to generally improve usability. The documentation vastly improved and extended, is more structured, and also available per ReadtheDocs (with online full-text search). In preparation of a fully revised implementation of the DNN API (rewrite), the interface of the DNN domain (Tensor API) changed in an incompatible way (our policy should have delayed this to v1.9). However, the current main user of the DNN API has been updated (integration with TensorFlow). Also notable, v1.8.2 introduces JIT-code generation with Windows call-convention (support limited to 4-argument kernels i.e., no prefetch signature for the MM domain, and no support for DNN/convolution kernels).

INTRODUCED * Introduced kernel introspection/query API for registered code: full GEMM descriptor, and code size. * Introduced explicit batch interface (and an experimental auto-batch option); parallelized/sequential. * Introduced BGEMM interface for handle-based GEMM using optimized format (copy-in/out). * More comprehensive sparse support (EDGE: Extreme Scale Fused Seismic Simulations). * More comprehensive collection of DNN test cases (DeepBench, ResNet50, etc.). * Implemented CI for DNN domain, and infrastructure for validation (libxsmmmatdiff). * Support to schedule CI/tests into a Slurm based cluster environment (.travis.sh). * Introduced "make INTRINSICS=0" to allow building with outdated Binutils. * Generate preprocessor symbols for statically generated code (presence check). * Allow FORTRAN to access (static-)configuration values using preprocessor. * FORTRAN 77 support for a much wider set of functionality (MM domain). * Introduced MHD file I/O to e.g., aid visual inspection and validation. * Cleaned up type-definitions and FE-macros (lower precision GEMM). * More comprehensive set of prefetch strategies (SMM domain). * Extended LIBXSMMVERBOSE=2 to show library version, etc. * Wider use of QFMA accross domains (MM, SpMM, DNN). * Updated application recipe for CP2K and TensorFlow. * Initial Eigen related code sample (batched SMMs). * CPUID for CPUs codenamed "Icelake".

CHANGES * Revised/unified API attribute decoration, and cleaned up header-only header. * Removed script for regenerating documentation bits (README.sh); now only per make. * Changed matcopy kernels to have column-major semantics (similar to transpose). * Support const/non-const GEMM prototypes interfering with LIBXSMM's header-only. * Slightly revised and based all F2K3 interfaces on lower-level F77 (implicit) routines. * Incorporated/enabled new/additional instructions in the code generator (BE). * Reshuffled properties/sizes in GEMM descriptor for future extensions. * Portable build-locks for improved turnaround time in parallel CI builds. * Comprehensive validation of DNN domain (all major benchmarks). * Consistent use of libxsmmblasint (libxsmmdmmdispatch). * Revised error/warning messages (LIBXSMMVERBOSE=1). * Initial support for some fused operations (DNN domain). * Removed support for small GEMM descriptors (BIG=0). * Removed libxsmmtimerxtick (libxsmmtimer.h). * Improved turnaround time in Travis CI testing. * Thread-safe scratch memory allocation. * Support VS 2017 (startup script, etc.)

FIXES * Fixed potential issue with GEMM flags being incorrectly created (GEMM wrapper). * Several fixes for improved FORTRAN interface compatibility (optional arguments, etc.). * Disabled AVX-512 code generation with Intel Compiler 2013 (SP1 brings the req. bits). * Fixed code gen. issue with SOA sparse kernels; corrected precision of SOA sample code. * Fixed index calculation in tiled libxsmm_matcopy; updated test case accordingly. * Fixed a number of issues in several DNN code paths unveiled by better testing. * Several fixes in sparse SOA domain (unveiled by LIBXSMM's integration into PyFR). * Improved support for (legacy) Clang wrt AVX-512 code generation (intrinsics). * Ported bit-scan intrinsics abstraction to yield same result with all compilers. * Allow static code generation to target SKX and KNM (Makefile). * Fixed several code generation issues for SMMs on KNM.

- C
Published by hfp about 8 years ago

https://github.com/libxsmm/libxsmm - Version 1.8.1

This release brings some new features (matcopy/2d-copy and tcopy based on JIT-generated code) as well as a number of bug fixes (TGEMM), improvements (KNM), and refinements (LIBXSMMGEMMWRAP control, etc). Given the completed copy/transpose support, this release prepares for a complete stand-alone GEMM routines.

INTRODUCED * Choice between tiled/small GEMM during call-interception (LIBXSMMGEMMWRAP=1|2). * Introduced JIT'ted transpose kernels including tiling for larger matrices. * Transpose routines now auto-dispatch JIT-kernels incl. auto-tuned tiles. * Introduced matcopy routines similar to the transpose routines (C/C++/F). * LIBXSMMDNNCONVOPTIONOVERWRITE for faster initial forward convolution. * Implemented/documented named JIT routines in TF when using VTune. * Additional statistics about MCOPY/TCOPY (LIBXSMMVERBOSE=2). * Lowered overhead of tiled/parallelized GEMM/MCOPY/TCOPY. * Made libxsmmhash function available (MEM/AUX module). * Initial support for lower precision (backward conv.)

CHANGES * AVX-512 based CPUID-dispatched input/output of Winograd transformation (forward conv.). * Adjusted build system to pick-up RPMOPTFLAGS (RPM based Linux distributions). * Moved extensive Q&A to Wiki page and cleaned up the reference documentation. * Improved/extended Getting Started Guide for TensorFlow with LIBXSMM. * Improved general backend error propagation, and avoid duplicated messages. * Iterative subdivision of large matrix transposes (tcopy) and matcopy (mcopy). * Non-task based and (optional) task based parallelization of tcopy and mcopy. * Mentioned KNM target key ("knm") in reference documentation. * Improved prefetches in KNM code path of weight update. * Adjusted initialization sequence during startup. * Improved parallelization grammar.

FIXES * Fixed pruned tile sizes and division-by-zero error in tiled GEMM. * Propagate backend errors in case of an insufficient JIT buffer. * CRC32 SW implementation issues unveiled by the CRAY Compiler. * Call parallelized transpose (C++ interface) when requested. * Fixed VTune support (named JIT code); broken in v1.8. * Fixed incorrect prefetch locations in KNM code path. * Fixed alignment condition in tcopy/mcopy code. * Fixed TF allocator integration with GCC 7.1.0. * Fixed some more warnings in sample codes.

- C
Published by hfp almost 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.8

This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMMDNNCONVALGOAUTO). Moreover, support for the Intel Xeon Phi processor code-named "Knights Mill" ("KNM") has been added (QFMA and VNNI instructions can be executed using the Intel SDE).

INTRODUCED - A summary of code samples has been added (pdf), and also a guide (mainly for contributors) to "Getting Started using TensorFlow with LIBXSMM" [PDF] - Additional sparse matrix primitives (fsspmdm domain); see "pyfr" and "edge" sample code - Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units - Improved code path selection for legacy compiler versions (functions with multiple compilation targets) - DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMMDNNCONVALGOAUTO) between LIBXSMMDNNCONVALGODIRECT, and LIBXSMMDNNCONVALGOWINOGRAD - DNN: logically padded data incl. support for Winograd based implementation - DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512) - DNN: support another custom format that blocks the minibatch dimension - SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmmxmmdispatch, libxsmmxmmcall) - SPMDM: narrowed scope of "sum" array to improve optimization on LLVM - SMM/EXT/OMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script) - SMM/DNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs - AUX/MEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime

CHANGES - Removed LIBXSMMMT and LIBXSMMTASKS environment variables, and updated documentation - COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler) - LIBXSMMTRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS) - AUX/MEM: superseded libxsmmmallocsize function with libxsmmgetmallocinfo - Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE) - Other: updated "spack" (HPC packet manager) to use more reasonable build options - SPMDM: improved load balance

FIXES - Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler) - Worked around problem/crashes due to an outdated TCMALLOC replacement of malloc/free (CCE) - TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1 - TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region - SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses - Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis - Fixed transpose FORTRAN sample code

- C
Published by hfp almost 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.7.1

This release finishes the memory allocation interface and documents the two memory allocation domains (default and scratch). Otherwise this release focuses on code quality (sample code) with no fixes or breaking changes when compared to version 1.7.

INTRODUCED - MEM: libxsmmreleasescratch has been introduced (unimplemented) - MEM: libxsmmreleasescratch now called during finalization - MEM: documented memory allocation domains - DNN: updated API documentation

CHANGES - More error/warning messages promoted to LIBXSMM_VERBOSE

FIXES - None

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.7

This version releases a revised DNN API to better suit with an upcoming TensorFlow integration. There is also some foundation laid out to distinct scratch memory from regular/default memory buffers.

INTRODUCED - MEM: ability to change the allocation functions; two different domains: default and scratch - MEM: C++ scoped allocator ("syntactical sugar"); incl. TensorFlow-specific adapter - MEM: optional TBB scalable malloc in both default and scratch allocator domain - DNN: more general buffer and filter link/bind functionality - LIBXSMM_VERBOSE messages rather than debug build - Improved dispatch for legacy compilers

CHANGES - DNN: revised API (breaking changes)

FIXES - SPMDM: fixed disagreement between static/dynamic code path (on top of v1.6.6) - MEM: avoid CRC memory checks for header-only library (different code versions)

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6.6

This is a bug-fix release with focus on the SPMDM domain. There are also a number of code quality improvements. This is potentially the last 1.6.x release with a number of API changes scheduled for the DNN domain (v1.7).

INTRODUCED - SPMDM: promoted error messages from debug-only builds to LIBXSMM_VERBOSE mode - README now documents on how to inspect the raw binary dumps

CHANGES - Improved code quality according to a code quality checker (potential issues)

FIXES - SPMDM: fixed setup of handle to correspond with CPUID-dispatched/available code path - SPMDM: fixed calculating the size of the scratch buffer (single-threaded case)

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6.5

This is a bug-fix release, which resolves a severe issue with concurrently modifying the code registry. The related code did not receive much development in the past (macro based), but is now cleanly erected and covered by a rigorous test case. There is also enough of an API to determine some basic registry properties (capacity, size), and a guarantee to receive JIT-code under reasonable conditions (e.g., if the registry is not exhausted). A routine allows to relax the conditions under which no JIT-code is generated (libxsmm[get|set}dispatch_trylock allows return no code if the access to the code registry is contended).

INTRODUCED - Slightly improved multi-target functions, but GCC 4.9 (and later) is needed to avoid "legacy" support. - Introduced libxsmmgetregistryinfo to receive basic metrics about the (GEMM-)code registry. - Implemented parallelization threshold for the libxsmmotransomp routines. - Improved code generation of sparse matrix domain and JIT-support (A-sparse/reg., CSR);
libxsmm
createdgemmdescriptor routine to ease language binding (pyfr sample code). - Cover a wider range of compiler versions, see our build status page. - More error/warnings covered in release builds (LIBXSMM_VERBOSE=1). - Build all possible sample codes as part of the CI tests. - Implemented sync/lock abstraction (Windows). - Optimized access to thread-local code cache.

CHANGES - TF configured THRESHOLD=0, but explicit JIT does not fallback plus THRESHOLD is an upper limit;
prevented 0-threshold, and choose default if THRESHOLD=0 is requested (128**3). - Suppress warnings about unused functions when our build system is not used. - Reduced lock-contention in JIT-code generation (more locks). - Use relaxed Atomics in JIT-code thread synchronization.

FIXES - Fixed severe issue with concurrent JIT-code generation (code registry); new/rigorous test case. - Fixed an issue when building the DNN sample code using GCC 6.3 (linker error). - SPMDM: avoid some duplicated symbols under Windows.

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6.4

This is a maintenance release with improvements for GCC and Clang compilers (function-level target compilation, and intrinsics support). The function-level target compilation is a prerequisite for good performance due to CPUID-dispatched code paths. Moreover and in preparation of v1.7, there are breaking changes in the DNN domain (buffer management is now an external responsibility). An API for logical padding has been added (DNN domain). In addition to our Travis CI, an improved test coverage for a variety of compiler versions is now in place.

INTRODUCED - SPMDM: introduced CPUID-dispatched code paths - SPMDM: support for transposing C

CHANGES - No distinction between SSE 4.1/4.2 (new enum LIBXSMMX86SSE4, removed LIBXSMMX86SSE4*) - DNN: removed createbuffer and create_filter functions since buffers are provided externally - DNN: updated googlenetv1 script to match googlenetv1 description - DNN: initial changes to support logical input padding - DNN: improved performance of weight update - DNN: new padding frontend API

FIXES - Fixed intrinsic layer for reliable target compilation (function level), and clean switches for legacy compilers, included FMA flag when targeting AVX2 on GCC and Clang - DNN: fix in image parallel forward convolution when 2d register blocking is used - DNN: fixed physical input padding for backward and weight update (all format combinations) - DNN: fixed physical padding in the fallback code path - DNN: fixed some corner case prefetching bug - SPMDM: fixed library initialization

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6.3

This is a maintenance release with minor improvements over 1.6.2.

INTRODUCED - Listed TensorFlow as an application that can make use of LIBXSMM - Environment variable LIBXSMM_TRYLOCK, and related API functions - Build key INIT=0 to omit lazy initialization overhead

CHANGES - Updated copyright banner for 2017

FIXES - Support for the Mainline version of the Clang compiler ("version 0.0.0") - Fixed non-perfetch JIT function names for AVX512 UPD code - Minor: some more target attributes for KNC (F interface)

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6.2

This is a maintenance release, which focuses (again) on the DNN API. However, this version includes bug-fixes for a number of severe issues, which have been found in various domains (SMM, DNN, SPMDM, and in general).

INTRODUCED - Documented header-only implementation of LIBXSMM - DNN: introduced routine to check code gen. (libxsmmdnngetcodegensuccess) - DNN: introduced routine for explicit transpose (libxsmmdnntransposefilter) - DNN: introduced to query number of tasks (libxsmmdnngetparalleltasks) - DNN: support external filter reduction in case of parallelization over the minibatch - MEM: exposed routine to query size of buffer allocated by libxsmm[aligned_]malloc - SPMDM: introduced support for beta, code optimizations

CHANGES - SPMDM: improved static code path selection (no CPUID dispatch) - SMM: raised THRESHOLD until which JIT code is automatically generated - Raised baseline code path to SSE4.2 to avoid CPUID-dispatched CRC32;
fixed (again) controlling the static code path according to documentation - Adjusted separation between gen-library and main library - MEM/debug: checksum for internal bookkeeping structure - MEM: streamlined internal bookkeeping structures - Improved reliability of library initialization

FIXES - SMM: evtl. wrong code version under concurrent dispatch under hash key collision - DNN: raised/fixed weight update performance to the expected level (AVX-512) - DNN: fixed a bug which was introduced by code refactoring (fwd. convolution) - DNN: fixed bug in Deepbench and refactored backward convolution code - DNN: corrected setting up the handle for the weight update convolution - MEM: fixed kernel-dump related console output (print correct address) - Avoid certain (pseudo-)AVX-512 intrinsics, which might be not present (GCC) - Avoid AVX-512/Core intrinsics prior to Clang 3.8 (3.9 brings them in) - Avoid to apply AVX-512/Core flags with earlier versions of Clang (IDEs) - Updated C++ entry points for code dispatch (remainder of issue #105);
this change fixed performance issue with CP2K/intel branch - SPMDM: fixed issue for N if not a multiple of 16

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6.1

This is a maintenance and bug-fix release, which focuses on the recently introduced API for sparse matrix multiplication. There are also internal improvements to better cover different flavors of the Linux OS.

INTRODUCED - SPMDM API since v1.6 (still experimental) for sparse matrix multiplication

CHANGES - SMM: descriptor size setting (a.k.a. BIG=1) is now part of static configuration - SPMDM: adjusted API according to the received feedback

FIXES - SPMDM: fixed minor issues, and one severe issue (incorrectly sized internal buffer) - SMM: statically gen. kernels where not considered (non-matching prefetch strategy) - SSE=0 now behaves as documented; SSE=0 and AVX=0 (both!) selects "no code path"

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.6

The revised DNN API now provides a more complete set of primitives for Convolutional Neural Networks. Forward and backward transformation as well as weight updates are supported for a variety of data formats including TensorFlow's native data format. The data/pixel types not only allow for single-precision floating point input and output layers, but also for 8 or 16-bit integer in- and outputs using integer kernels that operate on 16 or 32-bit integer data. In addition to AVX-512, AVX2 optimized kernels are now included.

INTRODUCED - LIBXSMM avail. for RPM based Linux distr. (Fedora, RHEL) via EPEL repository - Documented service functions ("secondary" API): timer, malloc, etc.
CPUID functionality now available as part of the service functions - SPMDM API (experimental) for sparse matrix multiplication - DNN: forward/backward, and weight update convolutions - DNN: Intel AVX2 support (in addition to AVX-512)

CHANGES - SMM: regular descriptor size (32-bit integer) is now default; BIG=1 (issue #109) - SMM: adjusted default prefetch strategy; more sophisticated (issue #105) - GNU Compiler Collection, Clang, and Intel Compiler fully supported/tested,
CCE (CRAY) regularly checked and supported via COMPATIBLE build key,
PGI compiler occasionally checked (supported via COMPATIBLE=1) - DNN: revised API (breaking changes as announced per v1.5.x)

FIXES - LIBXSMM cannot be linked dynamically if BLAS is linked statically documented - SMM: fixed FORTRAN interface issue with older Intel Compiler (issue #104) - Tiled GEMM fixed (min. tile size might be selected larger than leading dim.) - Fixed unaligned mem. access in stand-alone out-of-place transpose - DNN: numerous fixes since v1.5.x, v1.6/onward will report fixes separately - SYNC: barrier can be gracefully released if it was not constructed

- C
Published by hfp about 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.5.2

This release cherry-picked changes from the master revision that fix (minor) issues. The issues are mostly related to library-mechanics or infrastructure, and "to converge-out" with the 1.5 release. The overall objective is to support making this library available with regular Linux distributions. Thank you to all maintainers who are involved in the review of LIBXSMM!

INTRODUCED - The collection of changes does not break the (DNN-)API, and we encourage people to adopt the master revision for any integration work related to our DNN API (as it slightly changes in v1.6).

CHANGES - Issue #103 (question about 32-bit support): fixes the 32-bit build (as an exercise).
There is no intent to support 32-bit architecture! - Adjusted default build target to avoid building additional targets as part of the installation. - Adjusted file extension/marker of build scripts (spmdm sample code).

FIXES - Issue #104 (ifort segfaults when compiling 1.5.1's libxsmm.f): workaround in place.
Re-validated with our other supported Fortran compilers. - Fixed soname conformance needed for Linux package distribution. - Fixed build dependency when building in an out-of-tree fashion. - Fixed Fortran interface for some older CCE tool chains.

Note: the paper "LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation" will be presented at the Supercomputing Conference (SC'16); meanwhile people may ask us for a preprint of the publication.

- C
Published by hfp over 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.5.1

This (minor) release is mainly a bugfix release, which gains its urgency from a fixed bug in the Fortran interface (SMM functionality), where requesting a JIT kernel never returned a suitable PROCEDURE POINTER (always NULL). The implemented fix now reaches v1.5's goal to support a wider variety of Fortran compilers (GNU, Intel, CRAY, and PGI) while the Fortran interface code still allows to stay with GNU Fortran 4.5 (oldest supported Fortran compiler).

Beyond the above bugfix, there are four fixes for the new DNN functionality, and an improved/fixed console output of the DNN sample code. Furthermore, the out-of-place transpose code now detects when the input and output matrix are pointing to the same array (alias). Instead to return an error code in general, the most common special case (M=N, LDin=LDout) is now implemented (high-performance in-place transpose is still pending for a future release).

INTRODUCED - SC'16 paper "LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation"
=> Please consider to attend the presentation! - Self-contained Linux perf support (see PR #100): removed dependency to Linux kernel header - Additional sample code (spmdm) for sparse matrix multiplication (see PR #101)

CHANGES - Improved reliability of the out-of-place transpose, and support for in-place corner case - Additional test infrastructure e.g., allowing to test with Intel Compiler - New script (.travis.sh) to build/run Travis testset (.travis.yml; "script:" section) - DNN backend: expanded support for 8 and 16-bit integer instructions

FIXES - Fixed Fortran interface, where requesting a JIT kernel never returned a suitable PROCEDURE (NULL)
=> This issue has been introduced by v1.5, which aimed to support a wider variety of compilers - DNN backend: fixed bug in int16 convolutions (2d register blocking) - DNN: fixed bug in nhwc/rsck fallback code (forward convolutions) - DNN: fixed bug in unrolling calculation for int16 implementation - DNN: fixed case for less than 16 input channels (int16) - DNN sample code: fixed GOP and GFLOP output

- C
Published by hfp over 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.5

A major addition for LIBXSMM is the introduction of the DNN API, which can be used for e.g., Convolutional Neural Networks (CNNs). As a consequence, the banner description of LIBXSMM has been updated:

Library targeting Intel Architecture (x86) for small, dense or sparse matrix multiplications, and small convolutions.

The small convolutions are currently focused on Intel AVX-512, but compiler-generated fallback code is in place as well. Beside of AVX-512, forward convolutions (along with support for different storage formats) are also covered with Intel AVX2. Among LIBXSMM's internal storage scheme, the library supports a variety of other popular data formats one of which is Tensorflow's native NHWK storage scheme. With respect to the supported data types, single-precision convolution kernels (FP32) are fully supported by the JIT code generator. Moreover, there is initial code for Int16-based data already in place. During the past development cycle, Google Inc. stated some interest in LIBXSMM, and also contributed the Linux perf support to confirm the commitment. For others who would like to join our efforts, a preliminary Wiki page about contributions has been added (https://github.com/hfp/libxsmm/wiki/Contribute).

INTRODUCED - New DNN API, sample code, and benchmarks (Googlenetv1, DeepBench, and Overfeat) - Enabled tiled GEMM support in static/dynamic wrapper; MT support via libxsmmext - More format variations of sparse matrix multiplication (dense/sparse etc.) - Sample code showing sparse matrix multiplication (PyFR examples collection) - Published synchronization layer (atomics, and simple/bare OS-thread/lock abstraction) - Introduced mini-API for optimized barrier implementation (general multicore support) - Introduced API for memory allocation (malloc interface); mostly exposed from internal API - Beside of Intel VTune, now Linux perf and jitdump are supported (Thank you Maciej D.!) - SPECFEM sample: received nicely written example contribution (Thank you Daniel P.!) - OSX (incl. "El Capitan") now supports Intel Compiler, Apple/Clang, and GNU GCC - CRAY's Compiling Environment (CCE) is now supported - PGI compiler is now supported

CHANGES - Solidified API/impl. for out-of-place (OOP) transposes; ST/MT support (MT via libxsmmext) - Type-optimized OOP-transpose implementations, and generic/full support for any element type - Shared OpenMP infrastructure/abstraction for transposes and GEMMs. - Introduced and documented LIBXSMMMT environment variable (ST/MT/sync control). - Performance enhancements for sparse matrix multiplication (code gen., prefetches) - Support for SMM kernels (BIG=1) with larger extent(s) in terms of M, N, K, LDA, LDB, or LDC - Support for "ease of use" APIs (internal multi-threading), and external MT runtimes - Include "secondary" APIs in the first place (libxsmm.h) i.e., malloc, timer, sync.h - Included statistic into LIBXSMMVERBOSE table for kernels which exceed the MNK threshold. - Updated documentation to cover the new DNN API; added samples code (samples/dnn) - Enhanced infrastructure and portability for Variable Length Arrays (VLAs) - Library infrastructure (templates) for different element/pixel types (F32, I16, I8) - Improved development infrastructure (merging version.txt, and commit msg. hook) - Improved Travis-CI turnaround time (due to commit msg. hook [skip ci], and upload timeout) - Improved support for Clang, and bleeding edge compiler/architectures (intrinsic layer, etc) - CPUID distinction between AVX-512/Core, AVX-512/MIC, and AVX-512/Common - Better build-time support for AVX-512 (AVX=3 MIC=0|1, etc.) - Removed disabling JIT-support under Windows (still, calling convention is not in place) - Better intro-style/banner (license, Travis, etc.) for online documentation (README.sh, etc.) - Improved info message when building LIBXSMM (compiler, code path info, etc.) - Revised wrapper mechanism, static wrapper now req. special build of libxsmmext (WRAP=1|2) - Improved dispatching LIBXSMMPREFETCH strategy (common, GEMM, tiled GEMM) - Introduced LIBXSMMGEMMPREFETCH=-1|0...10 environment variable for tiled GEMM - Debug helper (internal): libxsmmmetaimagetypeinfo, libxsmmmetaimagewrite, libxsmmgemmdump - Renamed libxsmm[get|set]verbosemode to libxsmm[get|set]verbosity (verbosity level) - Improved verbose mode TRY-counter now collects rejected JIT requests (unsupported GEMM calls) - Verbose mode (>1) prints rejected GEMM calls (console), or dump (<-1) data in MHD format - Meta Image (MHD) format for data dumps (inspection via ITK-SNAP, ParaView, or similar) - TSC-based (not about CPU cycles!) libxsmmtimerxtick (in addition to libxsmmtimertick) - Improved calculating tile sizes for tiled GEMM (LIBXSMMCLMP, LIBXSMMSQRT2) - Improved header-only support, and related/new CI test target (Travis CI)

FIXES - Improvements and fixes of the backend support for sparse matrix multiplication - Bug fixes wrt code dispatch, medium-sized GEMMs, and the wrapper mechanism - Fixed issue where certain GEMM API did not respect the JIT-bypass/BLAS-fallback - Support for "no BLAS dependency" (which previously broke the static wrapper) - Correctly handle user-documented prefetch id vs. internal prefetch flag/bits - Disarm MKLDIRECTCALL/DMKLDIRECTCALLSEQ when determining original BLAS symbol - Adjusted/fixed support for dispatching statically generated SMM kernels - Fixed issue where BIG SMM kernels returned the wrong code from the registry - Fixed inline assembly for CPUID detection; issue was only exposed with Clang - Fixed/disabled LIBXSMMATOMICSTOREZERO issue (may hang) for non-LIBXSMM_GCCATOMICS - Fixed lazy initialization for certain cases/tool chains (related to c'tor/d'tor attr.) - Fixed compiler warnings with older Intel Compiler (atomics layer)

- C
Published by hfp over 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.4.4

This release improves and stabilizes previously released features while containing the necessary changes (generalized VTune Profiling support, JIT buffer management, and changes to the code registry structure) for upcoming new functionality. It also contains a number of new (preview-)features (not yet documented) such as sparse SoA matrix multiplication in the frontend, and stand-alone out-of-place general matrix transposes.

CHANGES - Introduced SONAME for shared objects (dynamic library) under Linux and OS X (see issue #79). This change may ease to include the library into Linux distributions (package repository). The Python utility script have been adjusted to output various version number formats used to format the SONAME. This change also updated the installation target (Makefile) to install symbol links rather than duplicating shared libraries. - Made PREFETCH=1 the default as it already refers to auto-prefetch based on the CPUID. This change complements previous efforts to reduce the "need" for different compile-time configurations and specializations. Performance related needs are now mostly migrated to CPUID dispatched code paths. - LIBXSMMVERBOSE mode now includes accurate heap memory consumption for the code registry and for the JIT'ted code buffers, and it also allows to dump the JIT code to files for manual inspection (issue #88). - Improved FORTRAN 2003 conformance (larger set of warnings under the PEDANTIC=2 umbrella flag), and resolved an issue with the Intel Compiler 2011 SP1 (avoid MERGE intrinsic in PARAMETER declaration). - Deprecated (actually removed) the ROWMAJOR support in preparation of including a regular CBLAS interface. This also removes the associated configuration flags in the interface while keeping some support for deployed applications which fortunately only check for COLMAJOR. - Initial sparse matrix support arrived in the interface.; such a kernel is not managed by the code registry, but rather created (libxsmmcreatedcsrsoa) and released (libxsmmdestroy) manually. - Internal library services are ported in preparation of the Windows support. This includes VTune support for executable buffers in general, which also includes manually managed kernels (sparse SOA kernels). - Initial stand-alone support for out-of-place matrix transpose (libxsmm*transpose_oop) for C/C++ and FORTRAN. The CPUID-dispatched code and the implementation of the in-place transpose are still missing. - Enabled JIT code generation under Windows (does not work yet due to incorrect calling convention). In fact, all code previously preventing the JIT facility under Windows is now removed, and thus one may call into JIT code (and fail due to the different calling convention). Prefetch signatures are still avoided under Windows (although this does not help with the calling convention). Cygwin support still avoids JIT other than exercising the related code when building a DEBUG version. - Improved Clang support, and in particular account somewhat better for the broken Intrinsic support in Clang (when the static code path is below the code path "needed" for the Intrinsics). This also played out as an improvement for the GCC based tool chain, which somewhat better supports the Intrinsics use-case (target attribute). Under OS X, the SSE 4.2 code is now enabled as the baseline/static code path (due to broken support with CRC32 intrinsics in particular). Note, under Linux the CRC32 instructions are CPUID-dispatched. - Allow for a header-only implementation of LIBXSMM to ease adoption with certain header-only C++ libraries (Eigen, etc.); see issue #86. This facility also works for C (which is quite notable), however the header-only implementation currently not allows to link C and C++ objects into a single binary. - Code which does not call any BLAS related code in LIBXSMM (e.g., the sparse SOA kernels) may now link against libxsmmext in order to get rid of the BLAS dependency. For more details see issue #82. - Updated documentation (it is still behind newer/development features); updated the CP2K guide (documentation folder).

FIXES - libxsmm_xmmdispatch now properly falls back to BLAS if the requested kernel is not supported. - There are numerous smaller improvements and CHANGES which can be perceived as fixes.

UPCOMING - Initial support for convolutions as commonly used in Machine Learning - High performance stand-alone in-place transpose - Windows JIT support

- C
Published by hfp over 9 years ago

https://github.com/libxsmm/libxsmm - Version 1.4.3

This version releases minor improvements on top of version 1.4.2. None of the changes are critical or issues, which are affecting stable operation.

CHANGES - Closed an open "todo" about using atomic operations when collecting verbose mode counters. - Fixed a compiler warning when compiling the library for AVX-512 (libxsmmintrinsicsx86.h). - Fixed CACHE flag for adjusting the size of the thread-local cache (Makefile). - Fixed TRACE to apply necessary linker flag for the Call Trace (Makefile).

- C
Published by hfp almost 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.4.2

This release implements a number of features which are non-critical for the core functionality but either service a request (API to get/set target architecture, and MATMUL wrapper), or to greatly improve usability for developers (JIT-Profiling, and Verbose Mode).

CHANGES - MATMUL-style routines (58fcb41 and e055d65) as thin wrapper around GEMM (FORTRAN only) - Issue #75 (frontend function to bypass cpu-id and to set arch_id): API to get/set target architecture - Issue #76 (Support JIT-Profiling API): show JIT-kernel insights within Intel VTune Amplifier - Issue #78 (Introduce verbose mode): extended termination message (kernel statistics)

Beside of the new features, there are two non-critical fixes. The Issue #77 lead in fact to a non-working call-wrapper mechanism for statically linked GEMM routines when using Intel MKL. The resolution not only fixes the problem, but also unifies the static call interception for all BLAS libraries (documentation is updated accordingly). The other issue was about missing to register statically generated kernels on systems which are actually not supported to JIT-generate code (pre-AVX era); the resolution includes fixes as well as an enhancement.

FIXES - Issue #77 (Statically wrapping GEMM calls does work as expected/documented) - Fixed registering statically generated code (de0af05 b05a02c, and f093543)

There is also an enhancement which became possible in version 1.4.1 (2230568), however the size of GEMM descriptor entries was not reduced because the SIMD-padding was not updated (applies to code registry and thread-local cache). Another enhancement (3610639) addressed along with Issue #75 was the extension of the available code paths where AVX-512 is now handled in two flavors (MIC and CORE). This information is currently not used to generate different code (everything is AVX-512F i.e., foundational instructions), but to eventually load different platform defaults.

Note: the new API for getting/setting the target architecture was partly present in previous releases (getter). However, this release not only adds the setter functionality but also slightly changes (in an incompatible fashion) the previously implemented wrapper. The renamed getter function also comes along with a renamed environment variable (c7ea23c: LIBXSMMJIT has been renamed to LIBXSMMTARGET).

- C
Published by hfp almost 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.4.1

This release fixes an issue where a repeated init-finalize cycle caused the thread-local code caches to likely contain invalid data i.e., pointing to (and calling) a code buffer which was already released. Beside of the bug-fix, this release also contains some preparation for issue #72 and #71 i.e., splitting the internal code registry in a SoA fashion and cleaning up unused code (unused compile-time alternatives).

- C
Published by hfp almost 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.4

This release settles the features previewed in version 1.3 by making the "medium-sized" matrix multiplication available for both the C/C++ and the FORTRAN interface as well as by including the topic into the reference documentation. Further a potential performance enhancement arrived by dispatching the default prefetch strategy according to the CPUID (when building with PREFETCH=1 or via JIT). This change was mainly triggered by a performance regression with statically generated KNC kernels, however this introduced some infrastructure to take advantage of the dispatch if needed and it is also available for JIT'ted kernels requesting LIBXSMMPREFETCHAUTO. Another enhancement is the cleanup of the reference documentation from the collected performance results. Results are now moved into an orphaned branch called results. This change was mainly triggered by an unreasonable size of what is supposed to be a source code archive. This further triggered a definition of what goes into a Git-exported archive (Tarball, ZIP file). This enhancement eases redistributing and re-hosting the archive files.

CHANGES - Settle "medium-sized" matrix multiplication for C/C++ and FORTRAN (issue #65) - Select PREFETCH strategy according to CPUID for static code and JIT (issue #69) - Move collateral results into an orphaned branch to reduce size of archives (issue #70) - Documented LIBXSMMGEMM and LIBXSMMOMP environment variables - Handle hash key collisions when registering static kernels (issue #73)

The last change about handling hash key collisions when registering statically generated code is not only a potential performance improvement when relying on static kernels, but it also triggered at least one important fix related to resolving hash key collisions.

FIXES - An incorrectly resolved a hash key collision case returned an incorrect code version (a761220) - An exhausted code registry potentially resulted in incorrect behavior (39cc809: next != i0)

- C
Published by hfp almost 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.3

This release of LIBXSMM introduces two new features one of which is a preview feature which is supposed to cover "medium-sized" GEMM routines, whereas the other (cache-)feature aims to (initially) target the NEKBOX workload (but is however accelerating any repeated kernel dispatch).

FEATURES - Small thread-local cache of most recently dispatched kernels - Issue #62 - Medium-sized GEMM routines (PREVIEW) - Issue #65

Medium-sized GEMM routines (PREVIEW): The supposedly "medium-sized" GEMM routines (libxsmm_omps_?gemm) are OpenMP based ("omp"), however are meant to remain sequential ("s") unless requested (via LIBXSMMGEMM environment variable) or when incorporated into a parallel region. Due to the experimental status, the interface is not final (C-only at this point i.e., not present in the Fortran interface). This feature is also known to rely on the Intel Compiler and OpenMP 4.0 tasks at this point. To tryout the "omps" routines, one may follow the xgemm sample code and call the aforementioned routines. In addition, one may LD_PRELOAD the shared extension library (libxsmmext) and rely on `LIBXSMMGEMM=0|1|2` (0/default: sequential SMM below THRESHOLD, 1: sequential matrix multiplications that may participate in an already opened parallel region by using OpenMP tasks, and 2: internally parallelized matrix multiplications). Please note that the (original) idea of only aiming for "medium-sized" matrix multiplications is not necessarily true going forward (also depends the feedback).

Termination message for developers (debug build): For developers aiming to know "what's going on", the library now emits a message when terminating (debug build only; DBG=1). The JIT based code path (according to the CPUID) as well as the number of JITted kernels is printed at termination time. For completeness, the number of registered static kernels is printed as well (this happens when no JIT/AVX based code path was available). Example (stderr): LIBXSMM_JIT=hsw NJIT=14 NSTATIC=0.

Renamed extension library: Having renamed the 'libxsmmld' library (the former LD_PRELOAD bits) into 'libxsmmext' is neither a feature nor a fix but may help to include future extensions; in particular extensions which require additional runtime support. At this point, 'libxsmmext' depends on OpenMP (while keeping the main library independent from a particular threading runtime).

CHANGES - Termination message in debug build which might be helpful during development - Function (libxsmm_get_target_arch) which allows to query the target architecture - Renamed "libxsmmld" (LD_PRELOAD-)library to "libxsmmext"

Since the code generator (backend) currently only supports homogeneously transposed matrices during GEMM ('NN' by default, 'TT' via RowMajor storage scheme), it is necessary to filter any requested GEMM call before attempting to generate a kernel which in turn allows to properly forward to the fallback routine (BLAS). In addition, an issue in the backend has been fixed which is related to long K-dimensions. Also, capturing the build status now works even with excessively long command lines stemming from a large specification ("MNK") of kernels to be statically generated (make).

FIXES - Call-forwarding based on supported (filtered) GEMM arguments (e.g., when using LD_PRELOAD) - Issue with extremely unrolling in K-dimension (AVX-512) - Issue #67 - Capturing the build status could overflow command line length - Minor issue with Makefile's install target ("include2")

- C
Published by hfp almost 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.2

This release focuses on OS portability, a more thorough set of tests carried out across Operating Systems and distributions, and some more applicable defaults when building the library for a wider audience. The latter is intended to suit maintainers of upcoming Linux distributions who have to think about an unpredictable audience and a wider set of use cases. The second focus point was about dispatching performance-critical functionality independent of the static code path which is selected when building the library. The third focus point was about features supporting developers who want to evaluate LIBXSMM, or incorporate the library into their application.

Here is a list of the main changes along with some more details: - Validated against Linux and OS X using Travis Continuous Integration. As a side-effect of delivering OS X support (which was requested), the code should also work under FreeBSD when accounting for the specifics (using gmake rather than make, etc.). There is also limited support for Microsoft Windows (no JIT compilation). - A new documentation section about installing LIBXSMM (https://github.com/hfp/libxsmm/#installation) has been written with package maintainers in mind. This is complemented by an removed link-time dependency on LAPACK/BLAS such that the decision about which BLAS library to link with is up to point where the actual application is linked. LIBXSMM works with any (BLAS-)library supplying ?gemm symbols. - For developers who want to incorporate LIBXSMM, the documentation mentions now how to start with a library (DBG=1) which is emitting messages about internal error/warning conditions discovered at runtime; normally the library is not performing any non-private or visible I/O. In addition, a TRACE facility has been implemented and documented to further support application developers. - Evaluating and using LIBXSMM has been made very low effort by implementing an LDPRELOAD mechanism (or DYLDINSERTLIBRARIES under OS X). In addition, another but similar mechanism has been implemented to help with an application which is statically linking against LAPACK/BLAS (link-time wrapper). The is an own section about this feature (https://github.com/hfp/libxsmm/#call-wrapper). - The dispatch mechanism of the internal code registry (which is delivering "dispatching" JIT'ted code) is now dispatched according to CPUID (checks whether the SSE 4.2 based CRC32 instructions are available). In addition to software-based CRC32 Hash keys, an alternative Hash key generator has been implemented to limit the performance penalty in case of CRC32 instructions being not available. - Fortran applications can now rely on the generated module file and link against 'libxsmmf'. This is complementing the mechanism to simply include LIBXSMM's Fortran interface ('include/libxsmm.f'). In addition, there is limited support for Fortran 77 (`libxsmm?gemm` functions only). Some more details can be found at the end of the section about building the library (https://github.com/hfp/libxsmm/#build-instructions). - Finally, the backend code generator has been tweaked for smaller instruction sizes emitted when generating Intel AVX-512 code (Intel Xeon Phi family of processors code-named Knights Landing "KNL").

- C
Published by hfp about 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.1.1

This is a minor update release fixing the following issues: selecting the main code path at build-time of the library (SSE, AVX), and parsing the version number and branch name (version.txt).

- C
Published by hfp about 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.1

This release is settling our frontend (C/C++ and Fortran interface) by bringing all language interfaces to an equal level in terms of functionality and usability, e.g. the Fortran interface has been brought up to support the prefetch signatures. In terms of language capabilities, the C++ and the Fortran interfaces both support overloaded functions (generic procedures) as well as a functor/call mechanism to help calling the backend code. Generic procedures are now available without compromising the availability of assumed shape array procedures. Moreover, the Fortran interface is fully settling with the C implementation, and therefore the previously needed glue code became superfluous.

Our JIT backend has left the "experimental" state due to being successfully deployed into several applications. Also, for our dispatch mechanism the known issue about possible hash key collisions has been resolved. Deploying the library into an unknown or inhomogeneous environment, a huge leap has been made by JITting code according to the CPUID flags. The latter is accompanied by the option to include static SSE code (which is not supported by the JIT backend) in the library while still being able to JIT for the best available code path.

The next milestone about intercepting existing calls to GEMM has been already addressed by settling the interface. The previously known as "simplified interface" has been remove, and binary compatible GEMM routines are now available. The latter allows to auto-dispatch for every GEMM call in an attempt to harvest higher performance for suitable matrix multiplications. Providing call interception is now within reach by an upcoming updated. In fact, statically linked GEMM can already be intercepted by e.g., adding -Wl,--wrap=dgemm_ -L/path/to/libxsmm -lxsmm to the link line.

- C
Published by hfp about 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.0.2

This is an intermediate release which was validated for NekBox and is integrated into NekBox, https://github.com/maxhutch/nek.

- C
Published by alheinecke about 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.0.1

This release provides a small but important bugfix to ensure a simple usage of LIBXSMM by re-enabling lazy initialization of the library when static code generation is used. Note, this only affects performance and not correctness. Version 1.0 called always the C fallback if libxsmmbuildstatic() was not explicitly called. JIT=1 was not affected by this issue.

- C
Published by alheinecke over 10 years ago

https://github.com/libxsmm/libxsmm - Version 1.0

This release completes a major refactoring of our library backend while introducing additional capabilities in the frontend (interface). The major update is the ability to generate code Just-In-Time (JIT) i.e., to “compile” matrix-multiplication kernels at run-time of an application. This is achieved by leveraging our reworked code generator, and directly emitting machine byte code into an executable buffer. Despite of the ability to automatically generate any missed kernels, there is nearly no additional overhead: the set of routines in our "CP2K collection" of 386 kernels, are only showing ~3% slow down in average, however LIBXSMM is outperforming their Intel MKL counterparts by ~2X (MKLDIRECTCALL), and the Intel Compiler (ICC) generated inlinable code by ~1.5X (on average over the aforementioned 386 kernels). Please consult the README for further details on how to use JIT-compilation.

In addition we have reimplemeted our code dispatch mechanism in order to prepare LIBXSMM for a full xGEMM interface: the assembly-kernel selection is based on a Hashtable using a CRC32 check-sum over an argument structure which is covering all xGEMM arguments already. Given Intel SSE 4.2 capabilities, the calculation is accelerated using CRC32 instructions (which are available on KNL as well). Over the course of the next minor releases we will be bringing JIT compilation out of an experimental state (adjusting code cache eviction, resource cleanup, and portability).

- C
Published by hfp over 10 years ago

https://github.com/libxsmm/libxsmm - Version 0.9.1

This is mainly a bug fix release correcting the AVX-512 code for N=9 and K being a multiple of 16 (DP) or 32 (SP). In addition, the samples (blas, dispatched, inlined, and specialized) are consolidated into a single sample folder. The latter also comes with a performance evaluation script (run script and Gnuplot script). The more complex "cp2k" code sample has been renamed as well along with slightly improved Gnuplot scripts.

- C
Published by hfp over 10 years ago

https://github.com/libxsmm/libxsmm - Version 0.9

This release settles the assembly code generator as the default code generation mechanism. The library targets Intel SSE3, AVX, AVX2, IMCI/KNCni, and Intel AVX-512 (foundational) instructions using optimized assembly code. Restrictions for the shape of the generated kernels are relaxed or actually removed, and the documentation is updated accordingly. The build system is now handling an empty code specialization request such that only an inlinable code path, and the BLAS fallback code are generated. The build system also respects the problem size threshold when generating code according to the requested specialization. The former milestone item to report some performance results is also addressed in published documentation. Moreover, additional code samples has been collected allowing an easier start as compared to the more complex CP2K proxy sample code. The documentation now starts with a Q&A section (answering how to quickly check whether LIBXSMM is beneficial for an application). In short, this release attempts to deliver a stable and complete library according to the former specification, and prepares for upcoming roadmap items such as a full xGEMM interface, and other features.

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8.6

This release is correcting the way the assembly code generator is called as well as correctly implementing the code wrapping the generated assembly code paths. Moreover, the dispatch mechanism using a direct lookup table is now correctly covering the possible problem space. At the same time, the direct lookup table has been effectively limited in size (M x N x K <= 65536) such that the table does not exceed 512 KB (64-bit architecture). The fall-back dispatch mechanism remains to be based on the binary search which does not suffer from the size issue. Feature-wise, the assembly code generator has been enabled by default. Moreover, an additional index generator scheme has been implemented and documented (MNK variable).

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8.5

As prepared in the previous release, the library now comes with an assembly code generator. The build system transparently supports the new code generator (GENASM=1) as an alternative to the Intrinsic code path. However a future revision of the library will enable the assembly code generation by default. In addition to the code generator, the documentation gives some more tuning background, and added a roadmap section guiding expectations for developments.

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8.4

This release uses a direct function pointer lookup for auto-dispatched matrix-matrix multiplications, and is introducing a SPARSITY build-time flag to optionally rely on a binary search (which allows for a compact/sparse lookup table). Beside of some tweaked code-unrolling in the Intrinsic code path, the sample/benchmark program employs more optimized parallelization settings. The library now also comes with an adjusted build system in order to support the upcoming assembly code generator (GENASM=1). While we are working on adding a public version of the assembly code generator, the library's build system is already interoperable.

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8.3

This release fixes an assumption which is hinting the compiler code generation. It also lowers the required stack size to fit with the defaults of the tested compilers, and implements an error message when exceeding the problem size that fits on stack (code sample). Furthermore, the project release also targets developer's convenience by including some IDE support.

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8.2

This release reworked the code generation allowing a more flexible way of specializing the code (Intrinsic code path). The documentation not only covers the revised code generation but also explains some of the optimizations introduced earlier (implicitly aligned leading dimension optimization). Furthermore, it is now possible to conveniently generate AVX-512 foundational instructions (AVX-512F).

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8.1

In addition to the stable interface and the Intel AVX-512 Intrinsic code path (which was validated using the Intel SDE), the code sample evolved into a benchmark program while still providing a clean and lean code sample. Moreover the entire code is a bit more tweaked and sophisticated thanks to the standalone benchmark program.

- C
Published by hfp almost 11 years ago

https://github.com/libxsmm/libxsmm - Version 0.8

This initial release is mainly acknowledging the stable interface. Moreover, the library is well tested including the AVX-512 Intrinsic code path which was validated using the Intel SDE.

- C
Published by hfp almost 11 years ago