https://github.com/amdresearch/omniprobe

Simply log all kernel durations

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

Simply log all kernel durations

Basic Info

Host: GitHub
Owner: AMDResearch
License: mit
Language: C++
Default Branch: main
Size: 604 KB

Statistics

Stars: 7
Watchers: 13
Forks: 2
Open Issues: 6
Releases: 0

Created over 1 year ago · Last pushed 7 months ago

Metadata Files

Readme License

Omniprobe

[!IMPORTANT]
This project is in an alpha state. We are making it available early because of significant interest in having access to it now. There is still some productization and packaging to do. And many more tests need to be added. It works, but if you use it enough, you will undoubtedly find corner cases where things go wrong. The good news is that you can mostly have far more performance visibility inside kernels running on AMD Instinct GPUs than has ever been possible before.

Omniprobe was originally called 'logduration' and was begun simply to provide a quick and easy way to observe all kernel durations within any ROCm application, without having to run the profiler or being saddled with all of the application perturbation profiling introduces (e.g. kernels are often serialized). It turned into something more feature-rich, however. (Because Omniprobe was originally named 'logduration', as you snoop around the code, you will invariably see references to 'logduration', including some of its naming conventions for environment variables.)

One of the longstanding challenges doing software performance optimization on AMD GPUs has been the lack of visibility into intra-kernel performance. Hardware performance counters are only attributable to specific kernel dispatches when kernels are serialized and counters are gathered on kernel dispatch boundaries (i.e. before a kernel is dispatched and after it completes.) This means that developers typically only have aggregate visibility into performance - a kind of average - but pinpointing specific bottlenecks in code can be problematic. Developers have to infer from aggregate performance what might be the source of a bottleneck. It isn't that this can't be done, it just makes the whole business of performance optimization harder and take longer. And it sometimes imposes on developers the need to reason from various aspects of specific hardware micro-architectures back to software and compiler implementations.

Omniprobe is a vehicle to facilitate attributing many common bottlenecks inside kernels to specific lines of kernel source code. It accomplishes this by injecting code at compile-time into targeted kernels. The code that it injects is selectively placed and results in instrumented kernels that stream context-laden messages to the host while they are running. logduration processes and analyzes these messages with one or multiple host-side "message handlers". From the information contained in these messages, it is possible to isolate many common-case bottlenecks that can inadvertently be written into code.

Not every possible bottleneck can be identified and isolated in this way. Instrumenting code necessarily perturbs the behavior of a kernel. But there are many common bottlenecks for which this perturbation is not a problem. Some bottleneck detection examples we have already implemented are:

Memory Access Inefficiencies
- Bank Conflicts
- Non-coalesced memory accesses
- Non-aligned memory accesses
Branchiness

We have also implemented analytics to provide fine-grained intra-kernel performance measurement (e.g. at basic block granularity), detailed instruction counting by instruction type, memory heatmap analysis, and others.

logduration is a platform for implementing new intra-kernel observation and analysis functionality. We are just getting started with new analytics and have additional useful capabilities both in development and planned.

omniprobe

omniprobe is a command-line python wrapper around the functionality provided by liblogDuration. It simplifies the process of setting up the environment and launching instrumented applications. The various environment variables are documented below, though they only need to be explicitly set by the user if logduration is needed in a context for which running the python wrapper is not feasible.

usage: omniprobe [options] -- application

Command-line interface for running intra-kernel analytics on AMD Instinct GPUs

Help: -h, --help show this help message and exit

General omniprobe arguments: -v, --verbose Verbose output -k , --kernels Kernel filters to define which kernels are instrumented. Valid ECMAScript regular expressions are supported. (cf. https://cplusplus.com/reference/regex/ECMAScript/) -i, --instrumented, --no-instrumented Run instrumented kernels (default: False) -e, --env-dump, --no-env-dump Dump all the environment variables for running liblogDuration64.so. (default: False) -d , --dispatches The dispatches for which to capture instrumentation output. This only applies when running with --instrumented. Valid options: [all, random, 1] -c , --cache-location The location of the file system cache for instrumented kernels. For Triton this is typically found at $HOME/.triton/cache -t , --log-format The format for logging results. Default is 'csv'. Valid options: [csv|json] -l , --log-location The location where all your data should be logged. By default, it will be to the console. -a [ ...], --analyzers [ ...] The analyzer(s) to use for processing data being streamed from instrumented kernels. Valid values are ['MessageLogger', 'Heatmap', 'MemoryAnalysis', 'BasicBlockAnalysis'] or a reference to any shared library that implements an omniprobe message handler. -- [ ...] Provide command for instrumenting after a double dash. ```

Environment Variables

LOGDURLOGLOCATION
- console
- file name
- /dev/null
LOGDURKERNELCACHE
- The kernel cache should be pointed at a directory containing .hsaco files which represented alternative candidates to the kernels being dispatched by the application. If running "instrumented kernels" (see the next environment variable description), logDuration will look for an identically named kernel with the same parameter list and types, but with a single additional void * parameter (needed for the data streaming to the host from instrumented kernels.) If logDuration is not running in instrumented mode (e.g. LOGDUR_INSTRUMENTED = "false"), when the kernel cache is enabled it will look for kernels in the cache having identical names and parameters. This can be useful when wanting to compare different versions of the same kernel for overall duration.
LOGDUR_INSTRUMENTED
- Value can be either "true" or "false". If set to "true", the kernel cache will replace dispatched kernels with an instrumented alternative.
LOGDUR_DISPATCHES=all | random | 1
- Default is to capture data on all dispatches. Setting to 'random' will (unsurprisingly) capture data on random dispatches. Setting to '1' will capture a single dispatch for each unique kernel in the application.
LOGDUR_INSTRUMENTED=true
LOGDUR_HANDLERS=<Message Handler for processing messages from instrumented kernels.> e.g. libLogMessages64.so
LOGDURLOGFORMAT=json
TRITONLOGGERLEVEL=3
TRITONALWAYSCOMPILE=1
TRITONDISABLELINE_INFO=0
TRITONHIPLLD_PATH=/opt/rocm-6.3.1/llvm/bin/ld.lld
LLVMPASSPLUGIN_PATH=/work1/amd/klowery/logduration/build/external/instrument-amdgpu-kernels-triton/build/lib/libAMDGCNSubmitBBStart-triton.so
HSATOOLSLIB
- Set to path of liblogDuration64.so - this causes the ROCm runtime to find and load this library.
LOGDUR_HANDLERS=libBasicBlocks64.so
- Set to the message handler(s) that will process the messages streaming out of instrumented kernels.
LDLIBRARYPATH
- Set to logduration/omniprobe along with wherever else you need the loader to search.

Building

Quick start (container)

We provide containerized execution environments for users to get started with omniprobe right away. Leverage the containers/run.sh script to jump into a container with the project and all of its dependencies pre-installed. Use the --docker or --apptainer flags to build the image for your preferred container runtime.

Example: console $ ./containers/run.sh Error: Must specify either --docker or --apptainer. Usage: ./containers/run.sh [--docker] [--apptainer] [--rocm VERSION] --docker Run using Docker container --apptainer Run using Apptainer container --rocm ROCm version (default: 6.3, supported: 6.3 6.4)

That's it! If a container matching your detected VERSION of omniprobe doesn't exist already, one will be built automatically.

Build from source

This project has several dependencies that are included as submodules. By default, logduration builds with ROCm instrumentation support.

Override the default ROCm LLVM search path via ROCM_PATH. To build with support for Triton instrumentation, we require you set TRITON_LLVM.

```shell git clone https://github.com/AARInternal/logduration.git cd logduration git submodule update --init --recursive mkdir build cd build cmake -DTRITON_LLVM=$HOME/.triton/llvm/llvm-a66376b0-ubuntu-x64 .. make

Optionally, install the program

make install ```

[!TIP] See FAQ for reccomended Triton installation procedure.

Dependencies

logDuration is a new kind of performance analysis tool. It combines many of the attributes of profilers, compilers, debuggers, and runtimes into a single tool. Because of that, logDuration is now dependent on three other libraries that provide various aspects of the functionality it needs.

kerneldb

kernelDB provides support for extracting kernel codes from HSA code objects. This can be an important capability for processing instrumented kernel output. The omniprobe memory efficiency analyzer relies on this because sometimes code optimizations are made downstream in the compiler from where instrumentation occurred. And proper analysis of, say, memory traces requires understanding how the code may have been optimized (e.g. ganging together individual loads into dwordx4)

dh_comms

dhcomms provides buffered I/O functionality for propagating messages from instrumented kernels to host code for consuming and analyzing messages from instrumented code at runtime. Because logDuration can run in either instrumented or non-instrumented mode, dhcomms functionality needs to be built into logDuration.

instrument-amdgpu-kernels

Unlike either dhcomms or kerneldb, instrument-amdgpu-kernel does not get linked into logDuration, but the llvm plugins provided by this library do the instrumentation of GPU kernels that logDuration relies on when running in instrumented mode. For now, when you build instrument-amdgpu-kernels for logDuration, you need to use the dhcommssubmitaddress branch.

FAQ

How do you recommend I install Triton?

To build with Triton instrumentation support, we require you provide the path to Triton's LLVM install (TRITON_LLVM). We recommend using a virtual Python environment to avoid clobbering your other packages. See docker/triton_install.sh for help creating this virtual environment automatically.

Where can I find more information on using Omniprobe?

We are creating some (very) informal tutorial videos that will walk you through things. An introductory tutorial video can be found here: All videos that we create will be posted at this Youtube channel: Omniprobe Youtube

Owner

Name: AMDResearch
Login: AMDResearch
Kind: organization

Repositories: 4
Profile: https://github.com/AMDResearch

GitHub Events

Total

Watch event: 2
Issue comment event: 2
Push event: 22
Pull request event: 4
Fork event: 2
Create event: 2

Last Year

Watch event: 2
Issue comment event: 2
Push event: 22
Pull request event: 4
Fork event: 2
Create event: 2

Dependencies

.github/workflows/build-triton-redhat.yml actions

actions/upload-artifact v4 composite
nick-fields/retry v3 composite

.github/workflows/build-triton-ubuntu.yml actions

actions/upload-artifact v4 composite
nick-fields/retry v3 composite

.github/workflows/redhat.yml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/github-script v7 composite
nick-fields/retry v3 composite

.github/workflows/ubuntu.yml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/github-script v7 composite
nick-fields/retry v3 composite

omniprobe/requirements.txt pypi

pyfiglet *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amdresearch/omniprobe

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Omniprobe

omniprobe

Environment Variables

Building

Quick start (container)

Build from source

Optionally, install the program

Dependencies

kerneldb

dh_comms

instrument-amdgpu-kernels

FAQ

How do you recommend I install Triton?

Where can I find more information on using Omniprobe?

Owner

GitHub Events

Total

Last Year

Dependencies