discostic-sim

A cross-architecture resource-based parallel simulation framework that can efficiently predict the performance of real or hypothetical massively parallel MPI programs on current and future heterogeneous systems.

https://github.com/rrze-hpc/discostic-sim

Keywords

current-clusters future-clusters open-source parallel-computing performance-analysis performance-engineering performance-simulation

Last synced: 10 months ago · JSON representation

Repository

A cross-architecture resource-based parallel simulation framework that can efficiently predict the performance of real or hypothetical massively parallel MPI programs on current and future heterogeneous systems.

Basic Info

Host: GitHub
Owner: RRZE-HPC
License: agpl-3.0
Language: C++
Default Branch: main
Homepage: https://github.com/RRZE-HPC/DisCostiC-Sim/wiki/DisCostiC
Size: 19.2 MB

Statistics

Stars: 7
Watchers: 9
Forks: 0
Open Issues: 0
Releases: 2

Topics

current-clusters future-clusters open-source parallel-computing performance-analysis performance-engineering performance-simulation

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Zenodo

README.md

DisCostiC: Distributed Cost in Cluster

$LaTeX Doxygen doc$

💡Description
- ⚙️ Framework workflow
- 🖍 Advantages over existing tools
:octocat: Compilation and build
:signal_strength: DisCostiC output
💻 System model
👩‍💻 Automating DSEL generation through static analysis
🧐 MPI-parallelized DisCostiC implementation
🚀 Planned features for further development
📚 References about the theory of potential application scenarios for DisCostiC
🔒 License
:warning: Disclaimer
🔗 Acknowledgement
📱Contact

💡 Description

A cross-architecture resource-based parallel simulation framework that can efficiently predict the performance of real or hypothetical massively parallel MPI programs on current and future highly parallel heterogeneous systems.✨

The runtime predictions are carried out in a controlled environment through straightforward, portable, scalable, and efficient simulations.⭐

The simulation framework is an automated version of analytical first-principle performance models at a full-scale scope, including cores, chips, nodes, networks, clusters, and their mutual interactions plus inherent bottlenecks such as memory bandwidth.🌟

The application blueprint created using Domain Specific Embedded Language (DSEL) enables key concepts and elements embedded in the modern C++ language, enhancing readability for domain experts and accurately encoding inter-process dependencies without introducing unknown consequences from the actual systems. 💥

⚙️ Framework workflow💬

```diff - Application model

* Domain knowledge of the application expressed in the DisCostiC DSEL language provides information on call dependencies and properties for computation (data access pattern, flop count) and communication (volume, mode and protocol)

Analytic first-principle performance models
- Computation models: Fundamental analytic Roofline and refined ECM models including memory bandwidth bottleneck concept at the system level
- Communication models: Fundamental latency-bandwidth and refined LogGP models
System models
- Cluster, network and node models for the full topology of the target system, including multiple chips, nodes, network interfaces and clusters
DisCostiC combines all of them to simulate the runtime cost of distributed applications. ```

🖍 Advantages over existing tools💬

```diff - DSEL approach for creating application blueprint

* Accurately encoding inter-process dependencies without introducing unknown consequences from the actual systems

A full-scale first-principle-model-based simulator
- Automating analytical first-principle performance models on the entire hierarchy of parallel systems
Efficient speed
- No intermediate tracing files requirement like any offline, trace-driven tools
- No high memory requirement like any online tools that use the host architectures to execute code
Last but not least, an open-source low-entry cost lightweight simulator enabling model-based design-space exploration ```

:octocat: Compilation and build

First, use the following command to clone the git repository:

git clone git@github.com:RRZE-HPC/DisCostiC-Sim.git && cd DisCostiC-Sim

⚡ Installation:thought_balloon:

Before proceeding, make sure the environment is prepared for the compilation. The installation steps are listed below: module load python git intel intelmpi cmake itac vampir conda create --name XYZ conda activate XYZ conda install pip cmake -DCMAKE_INSTALL_PREFIX=~/.local . && make all install One way to check the installation is to print the version and the help of the DisCostiC using ./discostic --version and ./discostic --help.

⏱️ Configuration settings:thought_balloon:

The test folder in DisCostiC offers multiple MPI-parallelized programs (benchmark_kernel) in distinct functionality for computation (kernel_mode) and communication (exchange_mode). For illustration, a few examples are given below:

|benchmark_kernel |--------------------- |HEAT |SOR |DMMM |DMVM |DMVM-TRANSPOSE |HEATHEAT |HEATSOR |HEATDIVIDE |HPCG |STENCIL-3D-7PT |STENCIL-3D-27PT |STENCIL-UXX |STENCIL-3D-LONGRANGE |STENCIL-1D-3PT |STREAM |SCHOENAUER |SCHOENAUER-DIV |WAXPY |DAXPY |SUM |VECTOR-SUM |ADD |DIVIDE |SCALE |COPY |KAHAN-DOT |SCALAR-PRODUCT | Description | | ------------- | | Two-dimensional five-point Jacobi kernel with communication | | Gauss-Seidel Successive Over-Relaxation solver with communication | | Dense Matrix Matrix Multiplication kernel with communication | | Dense Matrix Vector Multiplication kernel with communication | | Dense Matrix Transpose Vector Multiplication kernel with communication | | Back-to-back two HEAT kernels with communication | | Back-to-back HEAT and SOR kernels with communication | | Back-to-back HEAT and DIVIDE kernels with communication | | High Performance Conjugate Gradient | | Three-dimensional seven point stencil kernel with communication | | Three-dimensional twenty seven point stencil kernel with communication | | UXX stencil kernel with communication | | 3D long range stencil kernel with communication | | One-dimensional three point stencil kernel with communication | | STREAM Triad kernel with communication | | SCHOENAUER Triad kernel with communication | | SCHOENAUER divide Triad kernel with communication | | WAXPY kernel with communication | | DAXPY kernel with communication | | SUM kernel with communication | | Vector SUM kernel with communication | | ADD kernel with communication | | DIVIDE kernel with communication | | SCALE kernel with communication | | COPY kernel with communication | | KAHAN-DOT kernel with communication | | Scalar Product kernel with communication |

The following kernel_mode is exclusively offered for the flexibility of the framework; the simulator's prediction remains unaffected by the selection of the kernel_mode. Further explanation of this mode can be found in the Essential DisCostiC routines section.

| kernel_mode | Integrated tool | Description | | --------------------- | ------------- | ------------- | | COMP | no external tool integrated | This directly embeds the single-core pre-recorded ECM performance model data of the computational kernel into the simulator. | | LBL | no external tool integrated | This reads the pre-recorded ECM performance model data for the computational kernel from an external file located at nodelevel/configs folder. | | SRC | Kerncraft integrated | This directly embeds the source code of the computational kernel into the simulator. | | FILE | Kerncraft integrated | This reads the source code for the computational kernel from an external file located at nodelevel/kernels folder. |

The following exchange_mode is provided in LBL mode to enable model-based exploration through experimenting with various communication patterns in MPI applications.

| exchange_mode | Description | | --------------------- | ------------- | | message_size | It specifies the size of the message to be exchanged in bytes. | | step_size | It describes the step size of the message exchange as an int (1: distance one communication, 2: distance two communication, ...). | | direction | It specifies the direction of message exchange as an int (1: uni-directional upwards shift in only positive direction, 2: bi-directional upwards and downwards shift in both positive and negative directions). | | periodic | It enables or disables the communication periodicity as a bool (0: false, 1: true). |

The config.cfg file can be edited to select the benchmark_kernel and kernel_mode. More information about this config.cfg file is available in the DisCostiC help documentation.

🥅 Compilation:thought_balloon:

The compilation offers the following choices to enable the output report's data format generation and to enable the tracing of the simulator's own implementation. An executable will be generated after compilation.

| Command | Description | | --------------------- | ------------- | | make | This enables JSON data format without ITAC profiling of simulator implementation. | | make otf2 | This enables both JSON and OTF2 data format without ITAC profiling of simulator implementation. | | make trace_MPI | This enables JSON data format and the standard ITAC tracing mode of simulator implementation with the information about MPI call functions (enabled flag: -trace). | | make trace_all | This enables JSON data format and the verbose ITAC tracing mode of simulator implementation with the information on both user-defined and MPI call functions (enabled flag: -trace -tcollect flag). |

:greencircle: Run:thoughtballoon:

In the batch script, the number of the simulator processes is configured as the number of simulated processes plus one. To run the batch script on any system, ITAC profiling can be enabled or disabled, which will or will not dump the simulator's own trace in ITAC:

| Command | Description | | --------------------- | ------------- | | sbatch Run_Simulation.sh | This performs the simulation without tracing the own implementation of the simulator using ITAC. | | sbatch Run_Simulation_ITAC.sh | This dumps the simulator's own trace into ITAC to investigate the implementation of the simulator. |

The RunSimulationITAC.sh script only generates a single STF file (simulation.single.stf) in the main directory due to the export of the following variables:

export VT_LOGFILE_FORMAT=SINGLESTF export VT_LOGFILE_NAME=simulation export VT_LOGFILE_PREFIX=$SLURM_SUBMIT_DIR export VT_FLUSH_PREFIX=/tmp The formats, names and locations of output files in these environmental variables can be adjusted as desired.

🔁 Clean and uninstall:thought_balloon:

| Command | Description | | --------------------- | ------------- | | make clear | This cleans up the working directory by removing all unnecessary DisCostiC files, such as *.dms, *.otf, *.csv files. | | make uninstall | This uninstalls the DisCostiC framework, including installed files and CMake-specific files. |

:signal_strength: DisCostiC output

🌐 1. Web interface with Google Chrome browser (DMS file format):speech_balloon:

Upon completion of the run, DisCostiC generates a report referred to as DisCostiC.dms. DisCostiC.dms is a straightforward JSON object data format file that can be viewed using the Google Chrome browser. Use the following Google Chrome web browser to load the generated JSON file:

chrome://tracing

📊 2. Graphical user interface with Vampir (OTF2 file format):speech_balloon:

Upon completion of a run that was compiled with make otf2, DisCostiC generates a report called DisCostiC/traces.otf2. The DisCostiC/traces.otf2 file is an OTF2 object data format file and can be viewed using third-party tools like ITAC and Vampir. This format is produced by the ChromeTrace2Otf2 converter, which converts the JSON object data format file (DisCostiC.dms) to the OTF2 object data format file (DisCostiC/traces.otf2).

Use the following commond to open an OTF2 small trace file (DisCostiC/traces.ot2) in Vampir:

vampir DisCostiC/traces.otf2

📊 3. Graphical user interface with ITAC (STF file format):speech_balloon:

To convert an OTF2 file to a single Structured Trace File (STF) file format DisCostiC.single.stf and to open it in ITAC, invoke the Intel Trace Analyzer GUI and follow the below steps:

(1) traceanalyzer & (2) Go to File > Open; from the Files of a type field, select Open Trace Format, navigate to the `DisCostiC/traces.otf2` file, and double-click to open it (3) The OTF2 to STF conversion dialog appears. Review the available fields and checkboxes, and click Start to start the conversion. As a result, the OTF2 file will be converted to STF (DisCostiC/traces.otf2.single.stf), and you will be able to view it in the Intel Trace Analyzer.

🗄️ 4. Statistical or log data:thought_balloon:

    ----------------------------------------------------------------
    DisCostiC
    ----------------------------------------------------------------
    Full form: Distributed Cost in Cluster
    Version: v1.0.0
    Timestamp: Tue Jan 20 11:16:42 2024
    Author: Ayesha Afzal <ayesha.afzal@fau.de>
    Copyright: © 2024 HPC, FAU Erlangen-Nuremberg. All rights reserved

    HEAT program in FILE mode of kernel is simulated for 100000 iterations and 72 processes

    ----------------------------------------------------------------
    COMPUTATIONAL PERFORMANCE MODEL
    ----------------------------------------------------------------
    Code simulating on IcelakeSP_Platinum-8360Y chip of cores: {1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 }
    Single-precision maximum chip performance [GF/s]: {614.4 , 1228.8 , 1843.2 , 2457.6 , 3072 , 3686.4 , 4300.8 , 4915.2 , 5529.6 , 6144 , 6758.4 , 7372.8 , 7987.2 , 8601.6 , 9216 , 9830.4 , 10444.8 , 11059.2 , 11673.6 , 12288 , 12902.4 , 13516.8 , 14131.2 , 14745.6 , 15360 , 15974.4 , 16588.8 , 17203.2 , 17817.6 , 18432 , 19046.4 , 19660.8 , 20275.2 , 20889.6 , 21504 , 22118.4 }
    Double-precision maximum chip performance [GF/s]: {307.2 , 614.4 , 921.6 , 1228.8 , 1536 , 1843.2 , 2150.4 , 2457.6 , 2764.8 , 3072 , 3379.2 , 3686.4 , 3993.6 , 4300.8 , 4608 , 4915.2 , 5222.4 , 5529.6 , 5836.8 , 6144 , 6451.2 , 6758.4 , 7065.6 , 7372.8 , 7680 , 7987.2 , 8294.4 , 8601.6 , 8908.8 , 9216 , 9523.2 , 9830.4 , 10137.6 , 10444.8 , 10752 , 11059.2 }

    ----------------------------------------------------------------
    COMMUNICATION PERFORMANCE MODEL
    ----------------------------------------------------------------
    Code simulating on InfiniBand with IntelMPI library
    Hockney model parameters (intra chip): 41185.7 [ns] latency, 0.079065 [s/GB] inverse bandwidth 
    Hockney model parameters (inter chip): 64567.6 [ns] latency, 0.124122 [s/GB] inverse bandwidth 
    Hockney model parameters (inter node): 47628.7 [ns] latency, 0.091516 [s/GB] inverse bandwidth 
    Hockney model parameters (inter cluster): 47628.7 [ns] latency, 0.091516 [s/GB] inverse bandwidth 
    LogGP model additional parameters (intra chip): 24429.9 [ns] o, 199.008 [ns] g, 0.065 [ns] O
    LogGP model additional parameters (inter chip): 39748.3 [ns] o, 338.249 [ns] g, 0.085 [ns] O
    LogGP model additional parameters (inter node): 47938.9 [ns] o, 4980.51 [ns] g, 0.084 [ns] O
    LogGP model additional parameters (inter cluster): 47938.9 [ns] o, 4980.51 [ns] g, 0.084 [ns] O
    Eager threshold: 65535 bytes

    ----------------------------------------------------------------
    Simulation runtime in DisCosTiC: 15.7239 [s]
    ----------------------------------------------------------------

    ----------------------------------------------------------------
    FULL APPLICATION PERFORMANCE (for all MPI processes)
    rank                         runtime [s]
    ----------------------------------------------------------------
    0                              2154.83
    1                              2154.83
    .                              .......
    .                              .......
    .                              .......
    71                             2154.67
    ----------------------------------------------------------------

💻 System model

The system parameters, either hypothetical or actual, can be tuned by editing the config.cfg file available in the current directory. For the detailed documentation of the system model, please take a look at the config.cfg file and the YAML files available in the machine files and network files directories.

🔌 Cluster model:thought_balloon:

The resource allocation, intercluster characteristics, and runtime modalities are all included in the cluster model.

| Metadata information | Description | | --------------------- | ------------- | | number_of_nodes | Number of utilizing nodes on a cluster | | task_per_node | Number of utilizing cores on the node of a cluster| | number_of_processes | Number of running processes on the cluster | | inter_cluster or heterogeneous | Communication across clusters (0: inter cluster disabled; 1: inter cluster enabled) | | number_of_iterations | Number of iterations for the program run | | dim_x, dim_y, dim_z | Problem size for the program run; high-dimensional parameters will be disregarded for low-dimensional problems. |

🔌 Interconnect model:thought_balloon:

The YAML formatted network files and the choice of performance model and mode of communication are included in the network model.

| Metadata information | Description | | ------------------------------------- | ------------- | | name | Name of the YAML formatted interconnect file | | intra_chip | Communication inside chip | | inter_chip | Communication across chips | | inter_node | Communication across nodes | | latency | Latency in sec for various kinds of interconnects | | bandwidth | Bandwidth in GB/s for various kind of interconnects | | eager_limit | Data size in bytes at which communication mode changes from eager to rendezvous protocol | | waitio_mode | Mode of the WaitIO MPI library (socket, file or hybrid) | | comm_model | Performance model for communication (0: LogGP, 1: HOCKNEY) |

For illustration, a few examples of available network files are given below:

| Network files (YAML format) | Description | | --------------------- | ------------- | | InfiniBandWaitIOintercluster | Communication across clusters for WaitIO MPI library and InfiniBand interconnect | | InfiniBandWaitIOinternode | Communication across nodes for WaitIO MPI library and InfiniBand interconnect | | InfiniBandWaitIOinterchip | Communication across chips for WaitIO MPI library and InfiniBand interconnect | | InfiniBandWaitIOintrachip | Communication inside the chip for WaitIO MPI library and InfiniBand interconnect | | Tofu-DWaitIOinternode | Communication across nodes for WaitIO MPI library and Tofu-D interconnect | | Tofu-DWaitIOinterchip | Communication across chips for WaitIO MPI library and Tofu-D interconnect | | Tofu-DWaitIOintrachip | Communication inside the chip for WaitIO MPI library and Tofu-D interconnect | | InfiniBandIntelMPIinternode | Communication across nodes for Intel MPI library and InfiniBand interconnect | | InfiniBandIntelMPIinterchip | Communication across chips for Intel MPI library and InfiniBand interconnect | | InfiniBandIntelMPIintrachip | Communication inside the chip for Intel MPI library and InfiniBand interconnect | | OmniPathIntelMPIinternode | Communication across nodes for Intel MPI library and Omni-Path interconnect | | OmniPathIntelMPIinterchip | Communication across chips for Intel MPI library and Omni-Path interconnect | | OmniPathIntelMPIintrachip | Communication inside the chip for Intel MPI library and Omni-Path interconnect |

🔌 Node model:thought_balloon:

The machine files in YAML format, the selection of compiler settings and the performance model of computation are included in the node model.

| Metadata information | Description | | --------------------- | ------------- | | name | Name of the YAML formatted processor file | | sockets_per_node | Number of sockets in one node | | ccNUMA_domains_per_socket | Number of ccNUMA domains in one socket | | cores_per_ccNUMA_domain | Number of cores in one ccNUMA domain | | FP_instructions_per_cycle | Floating point instructions (ADD, MUL) per cycle | | FP_operations_per_instruction_(SP/DP) | Single or double precision floating point operations per instruction | | clock_frequency | Clock frequency in GHz | | memory_bandwidth | Memory bandwidth in GB/s | | compiler-flags | STD and SIMD optimizations of the compiler | | pmodel | Performance model for computation (Roofline, ECM) |

For illustration, a few examples of available machine files are given below:

| Machine files (YAML format) | Description | | --------------------- | ------------- | | A64FX | 48 core Fujitsu A64FX FX1000 CPU @ 1.8 GHz | | IcelakeSP_Platinum-8360Y | 36 core Intel(R) Xeon(R) Icelake SP Platinum 8360Y CPU @ 2.4 GHz | | CascadelakeSPGold-6248SNC | 20 core Intel(R) Xeon(R) Cascadelake SP Gold 6248 CPU @ 2.5 GHz | | SkylakeSP_Gold-5122 | 4 core Intel(R) Xeon(R) Skylake SP Gold 5122 CPU @ 3.6 GHz |
| SkylakeSP_Gold-6148 | 20 core Intel(R) Xeon(R) Skylake SP Gold 6148 CPU @ 2.4 GHz | | SkylakeSPGold-6148SNC | 20 core Intel(R) Xeon(R) Skylake SP Gold 6148 CPU with SNC enabled @ 2.4 GHz |
| SkylakeSP_Platinum-8147 | 24 core Intel(R) Xeon(R) Skylake SP Platinum 8174 CPU @ 3.1 GHz (2.7 GHz used) | | BroadwellEP_E5-2630v4 | 18 core Intel(R) Xeon(R) Broadwell EN/EP/EX E5-2697 v4 CPU @ 2.3 GHz | | BroadwellEPE5-2697v4CoD | 18 core Intel(R) Xeon(R) Broadwell EN/EP/EX E5-2697 v4 CPU with CoD enabled @ 2.3 GHz | | HaswellEP_E5-2695v3 | 14 core Intel(R) Xeon(R) Haswell EN/EP/EX E5-2695 v3 CPU @ 2.3 GHz | | HaswellEPE5-2695v3CoD | 14 core Intel(R) Xeon(R) Haswell EN/EP/EX E5-2695 v3 CPU with CoD enabled @ 2.3 GHz | | IvyBridgeEP_E5-2660v2 | 10 core Intel(R) Xeon(R) IvyBridge EN/EP/EX E5-2660 v2 CPU @ 2.2 GHz | | IvyBridgeEP_E5-2690v2 | 10 core Intel(R) Xeon(R) IvyBridge EN/EP/EX E5-2690 v2 CPU @ 3.0 GHz | | SandyBridgeEP_E5-2680 | 8 core Intel(R) Xeon(R) SandyBridge EN/EP E5-2680 CPU @ 2.7 GHz |

DisCostiC help:thought_balloon:

The help of DisCostiC (./discostic --help) lists as:

Arguments for ./discostic

 --version, -v          show simulator's version information and exit
 --help, -h             show this help message and exit

Application model for config.cfg

 benchmark_kernel       name of the kernel used in the program
 kernel_mode            mode of the kernel (FILE, SRC, LBL, COMP)

Cluster model for config.cfg

 heteregeneous          a bool flag to enable or disable the second system (1: enabled, 0: disabled)
 number_of_iterations   number of iterations of the program
 dim_x, dim_y, dim_z    problem size; high-dimensional parameters will be disregarded for low-dimensional problems.
 task_per_node          number of running processes on one node
 number_of_processes    total number of running processes

Interconnect model for config.cfg

 interconnect           name of the interconnect
 MPI_library            name of the MPI library for the first system (IntelMPI, WaitIO)
 comm_model             performance model of communication (0: LogGP, 1: HOCKNEY)
 waitio_mode            mode of the WaitIO MPI library (socket, file or hybrid)

Node model for config.cfg

 micro_architecture     name of the YAML machine file
 compiler-flags         STD and SIMD optimizations (-03 -xCORE-AVX512 -qopt-zmm-usage=high, -03 -xHost -xAVX, -Kfast -DTOFU); If not set, flags are taken from the YAML formatted machine file.
 pmodel                 performance model of computation (ECM, Roofline)
 vvv                    a bool flag to enable or disable the verbose node output (0: disabled, 1: enabled)
 cache-predictor        cache prediction with layer conditions or cache simulation with pycachesim (LC, SIM)
 penalty                runtime penalty for the computation model in nanoseconds, used only in LBL or COMP mode

Delay injection mode for config.cfg

 delay                  a bool flag to enable or disable the delay injection (0: disabled, 1: enabled)
 delay_intensity        intensity of delay as a multiple of computation time of one iteration
 delay_rank             process rank of the injected delay
 delay_timestep         iteration for the occurrence of the injected delay

Noise injection mode for config.cfg

 noise                  a bool flag to enable or disable the noise injection (0: disabled, 1: enabled)
 noise_intensity        intensity of random noise, i.e., rand() % noise_intensity 
 noise_start_timestep   starting iteration for the injected noise
 noise_stop_timestep    ending iteration for the injected noise

Output for config.cfg

 filename               output file name that contains details for the debug purpose
 chromevizfilename      output file name that contains all time-rank tracing data for visualization with Chrome tracing browser
 Verbose                a bool flag to enable or disable the verbose output (0: disabled, 1: enabled)

```

👩‍💻 Automating DSEL generation through static analysis

To perform the static analysis, the following procedure must be followed before running the DisCostiC batch script: pip install -r staticanalysis/requirements.txt python staticanalysis/Convert-<benchmark_kernel>.py

| Files | Description | | --------------------- | ------------- | | Convert-<benchmark_kernel>.py | A helper script that takes the original code and, through annotation, locates code loops and communication and identifies user-defined variables, such as dimx and dimy in the Cartesian stencil heat.c code, and ultimately generates DSEL code. | | requirements.txt | Each (sub)dependency is listed and pinned using "==" to specify a particular package version; see requirements.txt. This project makes use of the lightweight Python tree data structure anytree==2.8.0 and type hints for Python 3.7+ typing_extensions==4.4.0. These dependencies are later installed (normally in a virtual environment) through pip using the pip install -r staticanalysis/requirements.txt command. The generated tree's syntax is the same as the original C/C++ code. |

For the specified program, this will produce the following files:

| Files | Location |Description | | --------------------- | ------------- | ------------- | | <benchmark_kernel>.c | nodelevel/kernels | It comprises only the generated computational loop kernel of the MPI parallel program. | | <benchmark_kernel>_<kernel_mode>.hpp | test | It includes the entire generated code expressed in DisCostiC DSEL language. |

📝 Essential DisCostiC routines:thought_balloon:

The goal is to offer convenient, compact and practically usable application programming interfaces (APIs) with appropriate abstractions. It provides information about the call tree and attributes for communication (volume, mode, and protocol) and computation (data access pattern, flop count).

```cpp 1. DisCostiC->Rank_Init(DisCostiC::Indextype rank);

DisCostiC->SetNumRanks(DisCosTiC::Datatype numrank);
DisCostiC->Exec("LBL:STREAMTRIAD", DisCostiC::Event dependingoperations); DisCostiC->Exec("COMP:TOL=2.0||TnOL=1.0|TL1L2=3.0|TL2L3=6.0|TL3Mem=14.2", DisCostiC::Event dependingoperations); DisCostiC->Exec("FILE:copy.c//BREAK:COPY//./BroadwellEPE5-2697v4CoD.yml//18//-D N 100000", DisCostiC::Event dependingoperations); DisCostiC->Exec("SRC:double s, a[N],b[N],c[N];\n\nfor(int i=0; i<N; i++)\n\ta[i]=b[i]+s*c[i];\n//BREAK:DAXPY//./BroadwellEPE5-2697v4CoD.yml//18//-D N 100000", DisCostiC::Event depending_operations);
DisCostiC->Send(const void *message, DisCosTiC::Datatype sendingmessagesizeinbytes, DisCostiC::Datatype sendingmessagedatatype, DisCostiC::Indextype destinationrank, DisCostiC::Commtype communicator, DisCosTiC::Event dependingoperations);
DisCostiC->Isend(const void *message, DisCosTiC::Datatype sendingmessagesizeinbytes, DisCostiC::Datatype sendingmessagedatatype, DisCostiC::Indextype destinationrank, DisCostiC::Commtype communicator, DisCostiC::Request *request, DisCosTiC::Event dependingoperations);
DisCostiC->Recv(const void *message, DisCosTiC::Datatype receivingmessagesizeinbytes, DisCostiC::Datatype receivingmessagedatatype, DisCostiC::Indextype sourcerank, DisCostiC::Commtype communicator, DisCosTiC::Event dependingoperations);
DisCostiC->Irecv(const void *message, DisCosTiC::Datatype receivingmessagesizeinbytes, DisCostiC::Datatype receivingmessagedatatype, DisCostiC::Indextype sourcerank, DisCostiC::Commtype communicator, DisCostiC::Request *request, DisCosTiC::Event dependingoperations);
DisCostiC->Rank_Finalize();

```

📝 Code blueprint as DisCostiC input:thought_balloon:

The following DSEL code example shows the simplest illustration of how the domain knowledge of the applications is expressed in the DisCostiC language:

cpp DisCostiC::Event recv, send, comp; for (auto rank : DisCostiC::getRange(systemsize)) { DisCostiC->Rank_Init(rank); DisCosTiC->SetNumRanks(systemsize); /* initialization */ comp = DisCostiC->Exec("STREAM_TRIAD", recv); send = DisCostiC->Send(res, 8, ((rank + 1) % NP), MPI_COMM_WORLD, comp); if(rank != 0) { recv = DisCostiC->Recv(res, 8, MPI_DOUBLE, rank - 1, MPI_COMM_WORLD, comp); } else { recv = DisCostiC->Recv(res, 8, MPI_DOUBLE, systemsize - 1, MPI_COMM_WORLD, comp); } DisCostiC->Rank_Finalize(); }

📝 Code documentation: suite of C++ data structures and enumerated types:thought_balloon:

The HTML and LaTeX documentation can be generated by doxygen. To build the documentation, navigate to the Doxyfile file available in the current directory and run the command:

doxygen Doxyfile To read the HTML and LaTeX documentation, point to a web browser at doc/html/index.html and read doc/DisCostiC.pdf, respectively.

Single Operation: accessors for local operations and their individual information at a certain grid point

| Metadata information | Description | | ------------------------------------- | ------------- | | bufSize | Number of bytes (data size) transmitted in the communication operation (no real buffer size for comp, just added for completeness) | | DepOperations | Dependencies for blocking routines, i.e., other operations that depend on this current operation | | IdepOperations | Dependencies for non-blocking routines, i.e., other operations that depend on the current operation | | depCount | Number of dependencies for this current operation | | label | Index/identifier of this current operation for each rack | | target | Rank of target/partner (source for recv or destination for send or no real target for comp but added for completeness) | | rank | Owning rank of this current operation | | tag | Tag of this current operation (no real tag for comp, just added for completeness) | | node | Node or processing element for this current operation |
| network | Type of network for this current operation | | time | Ending time of this current operation | | starttime | Starting time of this current operation | | type | Type of this current operation |

cpp enum Operation_t { | Operation_t defines different operation types of entities SEND = 1, | Send operation type RECV = 2, | Recv operation type COMP = 3, | Computation operation type MSG = 4 | Message operation type };

| Metadata information | Description | | --------------------- | ------------- |
| mode | mode of this current calling operation |

cpp enum Mode_t { | Mode_t defines the operation type of communication NONBLOCKING, | Routines returning with the start of an operation so the next operation doesnot execute before starting the previous one; such as Isend, Irecv. BLOCKING | Routines returning only on completion of operation so the next operation doesnot execute before finishing the previous one; such as Send, Recv. };

Performance model

| Metadata information | Description | | --------------------- | ------------- | | model | Analytic first-principle performance model for computation and communication |

cpp enum Model_t { | Mode_t defines the used performance model Roofline, | Simple computation model type ECM, | Advanced computation model type LOGGP, | Simple communication model type HOCKNEY | Advanced communication model type };

Custom data types, keywords and high-level classes functionality:thought_balloon:
- accessors for local vectors and matrices and their individual components without the need for index accesses
- abstract base classes for the AST-generated grid to access the operations and their associated features
- solver-specific data types, e.g., time steps, operations, identifiers, etc., are declared globally (or part of a special global declaration block) ```cpp
- DisCostiC::Timetype
- DisCostiC::Datatype
- DisCostiC::Indextype
- DisCostiC::Networktype ```

🧐 MPI-parallelized DisCostiC implementation

The underlying principle of parallel simulation is that each operation's entire data is transmitted via blocking or non-blocking MPI routines to the processes it communicates with.

| Terms | Description | | --------------------- | ------------- | | simulated processes P_i | processes from the application point of view | | simulator processes Q_i | processes from the parallel simulation framework point of view | | master simulator process Q_0 | master process from the parallel simulation framework point of view |

Initialization:thought_balloon: In the MPI implementation, only the master process of the default communicator makes any print calls for debug and verbose output reporting, and all other processes of the new communicator initialize root operations only once in each run.

MPI-parallelized	Serialized
```cpp int ierr, process_Rank, size_Of_Cluster, sucesss, N_process_Rank, N_size_Of_Cluster; MPI_Comm newcomm; ierr = MPI_Init(NULL, NULL); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); MPI_Comm_split(MPI_COMM_WORLD, (process_Rank) < 1 ? 1 : 2, 1, &newcomm); ierr = MPI_Comm_rank(newcomm, &N_process_Rank); ierr = MPI_Comm_size(newcomm, &N_size_Of_Cluster); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); ierr = MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster); std::queue test; MPI_Status status; MPI_Request request; DisCostiC_Timetype starttimeProcess; DisCostiC::DisCostiC_OP operation, operation_og; int iter = 0; iter = ((rankLocalIt->getNumOps()) / 3); if (rank == N_process_Rank) { operation_og = copy(op, operation_og); operation = op; starttimeProcess = operation.starttime; } ```	```cpp op.numOpsInQueue = numOpsInQueue++; queue.emplace(op); ```

MPI-parallelized

Serialized

```cpp int ierr, process_Rank, size_Of_Cluster, sucesss, N_process_Rank, N_size_Of_Cluster; MPI_Comm newcomm; ierr = MPI_Init(NULL, NULL); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); MPI_Comm_split(MPI_COMM_WORLD, (process_Rank) < 1 ? 1 : 2, 1, &newcomm); ierr = MPI_Comm_rank(newcomm, &N_process_Rank); ierr = MPI_Comm_size(newcomm, &N_size_Of_Cluster); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); ierr = MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster); std::queue test; MPI_Status status; MPI_Request request; DisCostiC_Timetype starttimeProcess; DisCostiC::DisCostiC_OP operation, operation_og; int iter = 0; iter = ((rankLocalIt->getNumOps()) / 3); if (rank == N_process_Rank) { operation_og = copy(op, operation_og); operation = op; starttimeProcess = operation.starttime; } ```

```cpp op.numOpsInQueue = numOpsInQueue++; queue.emplace(op); ```

Start of simulation:thought_balloon: Do loop over the number of iterations and the number of steps per iteration. Processes connected to the new communicator, or "newcomm," are only used in the main simulation.

MPI-parallelized	Serialized
```cpp rank = N_process_Rank; bool cycle = true; int iter_num = 1; do{ //!< over the number of iterations do{ //!< over the number of steps per iteration ```	```cpp do{ queue.pop(); ```

Within each ccNUMA domain, runtime corrections are made by monitoring the number of processes that are concurrently engaged in each computation and communication step.

SEND operation:thought_balloon: The currently active process sends the operation object or array to the process listed as "operation.target," which is specific to the communication pattern simulation program.

MPI-parallelized	Serialized
```cpp operation.type = DisCostiC::Operation_t::RECV; operation.starttime = operation.time; int temp = oSuccessor[operation.rank][operation.node]; double arr[13]; *serialize(operation, temp, arr); if (operation.mode == DisCostiC::Mode_t::NONBLOCKING){ MPI_Isend(&arr, 13, MPI_DOUBLE, operation.target, operation.tag, newcomm, &request); }else{ MPI_Send(&arr, 13, MPI_DOUBLE, operation.target, operation.tag, newcomm); } finishedRankList.push_back(operation.rank); operation.type = DisCostiC::Operation_t::RECV; cycle = true; ```	```cpp std::swap(operation.rank,operation.target); queue.emplace(operation); ```

MPI-parallelized

Serialized

```cpp operation.type = DisCostiC::Operation_t::RECV; operation.starttime = operation.time; int temp = oSuccessor[operation.rank][operation.node]; double arr[13]; *serialize(operation, temp, arr); if (operation.mode == DisCostiC::Mode_t::NONBLOCKING){ MPI_Isend(&arr, 13, MPI_DOUBLE, operation.target, operation.tag, newcomm, &request); }else{ MPI_Send(&arr, 13, MPI_DOUBLE, operation.target, operation.tag, newcomm); } finishedRankList.push_back(operation.rank); operation.type = DisCostiC::Operation_t::RECV; cycle = true; ```

```cpp std::swap(operation.rank,operation.target); queue.emplace(operation); ```

RECV operation:thought_balloon: The currently running process receives the operation object or array from the process listed as "operation.target".

MPI-parallelized	Serialized
```cpp double brr[13]; if (operation.mode == DisCostiC::Mode_t::NONBLOCKING){ MPI_Irecv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, operation.tag, newcomm, &request); }else{ MPI_Irecv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, operation.tag, newcomm, &request); } *deserialize(operation, temp, brr); oSuccessor[operation.rank][operation.node] = brr[12]; cycle=true; operation.numOpsInQueue = numOpsInQueue++; finishedRankList.push_back(operation.rank); operation.type = DisCostiC::Operation_t::MSG; ```	```cpp finishedRankList.push_back(operation.rank); DisCostiC::DisCostiC_queueOP matchedOP; if(M.listmatch(operation, &UMQ[operation.rank], &matchedOP)) else { DisCostiC::DisCostiC_queueOP nOp; nOp.bufSize = operation.bufSize; nOp.src = operation.target; nOp.tag = operation.tag; nOp.label = operation.label; recvQueue[operation.rank].push_back(nOp); } ```

MPI-parallelized

Serialized

```cpp double brr[13]; if (operation.mode == DisCostiC::Mode_t::NONBLOCKING){ MPI_Irecv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, operation.tag, newcomm, &request); }else{ MPI_Irecv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, operation.tag, newcomm, &request); } *deserialize(operation, temp, brr); oSuccessor[operation.rank][operation.node] = brr[12]; cycle=true; operation.numOpsInQueue = numOpsInQueue++; finishedRankList.push_back(operation.rank); operation.type = DisCostiC::Operation_t::MSG; ```

```cpp finishedRankList.push_back(operation.rank); DisCostiC::DisCostiC_queueOP matchedOP; if(M.listmatch(operation, &UMQ[operation.rank], &matchedOP)) else { DisCostiC::DisCostiC_queueOP nOp; nOp.bufSize = operation.bufSize; nOp.src = operation.target; nOp.tag = operation.tag; nOp.label = operation.label; recvQueue[operation.rank].push_back(nOp); } ```

MSG operation:thought_balloon: If MSG is not found, the cycle will continue (cycle = true) until it is found.

MPI-parallelized	Serialized
```cpp if ((earliestFinish = operation.time) <= ...){ ... operation.numOpsInQueue = numOpsInQueue++; ... iter_num++; cycle = false; }else{ ... cycle = true; } ```	```cpp if ((earliestFinish = operation.time) <= ...){ ... DisCostiC::DisCostiC_queueOP matchedOP; if(M.listmatch(operation, &recvQueue[operation.rank], &matchedOP)) { ... }else { #ifdef USE_VERBOSE std::cout << "[MSG ==> NOT FOUND] in receive queue - add to unexpected queue"<< std::endl; #endif DisCostiC::DisCostiC_queueOP nOp; nOp.bufSize = operation.bufSize; nOp.src = operation.target; nOp.tag = operation.tag; nOp.label = operation.label; nOp.starttime = operation.time; // when it was started UMQ[operation.rank].push_back(nOp); } }else{ ... queue.emplace(operation); } ```

MPI-parallelized

Serialized

```cpp if ((earliestFinish = operation.time) <= ...){ ... operation.numOpsInQueue = numOpsInQueue++; ... iter_num++; cycle = false; }else{ ... cycle = true; } ```

```cpp if ((earliestFinish = operation.time) <= ...){ ... DisCostiC::DisCostiC_queueOP matchedOP; if(M.listmatch(operation, &recvQueue[operation.rank], &matchedOP)) { ... }else { #ifdef USE_VERBOSE std::cout << "[MSG ==> NOT FOUND] in receive queue - add to unexpected queue"<< std::endl; #endif DisCostiC::DisCostiC_queueOP nOp; nOp.bufSize = operation.bufSize; nOp.src = operation.target; nOp.tag = operation.tag; nOp.label = operation.label; nOp.starttime = operation.time; // when it was started UMQ[operation.rank].push_back(nOp); } }else{ ... queue.emplace(operation); } ```

Adding new sorted operations in queue in order:thought_balloon:

MPI-parallelized	Serialized
```cpp ```	```cpp newOps=false; newOps = true; op.numOpsInQueue=numOpsInQueue++; queue.emplace(op); ```

MPI-parallelized

Serialized

```cpp

```

```cpp newOps=false;

newOps = true; op.numOpsInQueue=numOpsInQueue++;

queue.emplace(op); ```

End of while statement:thought_balloon: The end of the do loops over the number of iterations and the number of steps per iteration. The necessary information gathered from all processes of the new communicator is communicated to the default communicator's master process.

MPI-parallelized	Serialized
```cpp while (cycle); cycle = true; operation_og.time = oSuccessor[operation.rank][operation.node]; operation = copy(operation_og, operation); } while (iter_num <= iter); double arr[13]; double brr[13]; serialize(operation, oSuccessor[rank][operation.node], arr); finalize(oSuccessor[rank][operation.node], numRanks, numOpsInQueue, rank, operation.node, brr); MPI_Send(&brr, 13, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD); } else if (process_Rank == 0) { bool cycle = true; double brr[13]; while (test.size() < numRanks) { MPI_Recv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); if (status.MPI_TAG == 1) { test.emplace(status.MPI_SOURCE); oSuccessor[brr[3]][brr[4]] = brr[0]; numOpsInQueue = brr[2]; } } } ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); MPI_Barrier(MPI_COMM_WORLD); ```	```cpp while(!queue.empty() \|\| newOps) ```

MPI-parallelized

Serialized

```cpp while (cycle); cycle = true; operation_og.time = oSuccessor[operation.rank][operation.node]; operation = copy(operation_og, operation); } while (iter_num <= iter); double arr[13]; double brr[13]; *serialize(operation, oSuccessor[rank][operation.node], arr); *finalize(oSuccessor[rank][operation.node], numRanks, numOpsInQueue, rank, operation.node, brr); MPI_Send(&brr, 13, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD); } else if (process_Rank == 0) { bool cycle = true; double brr[13]; while (test.size() < numRanks) { MPI_Recv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); if (status.MPI_TAG == 1) { test.emplace(status.MPI_SOURCE); oSuccessor[brr[3]][brr[4]] = brr[0]; numOpsInQueue = brr[2]; } } } ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); MPI_Barrier(MPI_COMM_WORLD); ```

```cpp while(!queue.empty() || newOps) ```

🚀 Planned features for further development

Threading model beyond message passing
Networking-level contention model
Energy consumption model

📚 References about the theory of potential application scenarios for DisCostiC

[1]: A. Afzal et al.: Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study. DOI:10.1109/CLUSTER.2019.8890995

[2]: A. Afzal et al.: Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs. DOI:10.1007/978-3-030-50743-5_20

[3]: A. Afzal et al.: Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact. DOI:10.1007/978-3-030-78713-4_19

[4]: A. Afzal et al.: The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs. DOI:10.1109/TPDS.2022.3221085

[5]: A. Afzal et al.: Analytic performance model for parallel overlapping memory-bound kernels. DOI:10.1002/cpe.6816

[6]: A. Afzal et al.: Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications. DOI:10.1007/978-3-031-30442-2_12

[7]: A. Afzal et al.: Making applications faster by asynchronous execution: Slowing down processes or relaxing MPI collectives. DOI:10.1016/j.future.2023.06.017

🔒 License

AGPL-3.0

:warning: Disclaimer

[!NOTE] A note to the reader: Please report any bugs to the issue tracker or contact ayesha.afzal@fau.de.

🔗 Acknowledgement

This work is funded by the KONWHIR project OMI4PAPPS.

📱Contact

_{@AyeshaAfzal91}

Ayesha Afzal, Erlangen National High Performance Computing Center (NHR@FAU)

mailto: ayesha.afzal@fau.de

Owner

Name: RRZE-HPC
Login: RRZE-HPC
Kind: organization
Email: hpc@fau.de
Location: Erlangen, Germany

Website: http://www.hpc.fau.de/
Repositories: 50
Profile: https://github.com/RRZE-HPC

GitHub Events

Total

Watch event: 4

Last Year

Watch event: 4

Committers

Last synced: 12 months ago

All Time

Total Commits: 28
Total Committers: 2
Avg Commits per committer: 14.0
Development Distribution Score (DDS): 0.25

Past Year

Commits: 28
Committers: 2
Avg Commits per committer: 14.0
Development Distribution Score (DDS): 0.25

Top Committers

Name	Email	Commits
AyeshaAfzal91	a**l@f**e	21
Ayesha Afzal	6****1	7

Committer Domains (Top 20 + Academic)

fau.de: 1

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

discostic-sim

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DisCostiC: Distributed Cost in Cluster

Table of Contents

💡 Description

:octocat: Compilation and build

:signal_strength: DisCostiC output

💻 System model

Arguments for ./discostic

Application model for config.cfg

Cluster model for config.cfg

Interconnect model for config.cfg

Node model for config.cfg

Delay injection mode for config.cfg

Noise injection mode for config.cfg

Output for config.cfg

👩‍💻 Automating DSEL generation through static analysis

🧐 MPI-parallelized DisCostiC implementation

🚀 Planned features for further development

📚 References about the theory of potential application scenarios for DisCostiC

🔒 License

:warning: Disclaimer

🔗 Acknowledgement

📱Contact

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies