cpu-performance-tests

This repository contains the code to benchmark CPU cache miss latency and branch misprediction penalty

https://github.com/diamondinoia/cpu-performance-tests

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.1%) to scientific vocabulary

Keywords

branch-prediction branchless cache-latency cpp hpc memory-latency

Last synced: 6 months ago · JSON representation ·

Repository

This repository contains the code to benchmark CPU cache miss latency and branch misprediction penalty

Basic Info

Host: GitHub
Owner: DiamonDinoia
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 252 KB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

branch-prediction branchless cache-latency cpp hpc memory-latency

Created about 3 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

cpu-performance-tests

This repository contains the code to benchmark CPU cache miss latency and branch misprediction penalty.

Branch penalty

The pseudocode for the benchmark is as follows: cpp // Generate a random array with a discrete distribution int random[N] = {Uniform_distribution(0,100)} volatile int sum = 0; // volatile is used to prevent vectorization and other optimisations //count the clock cycles using rdtsc const auto start = __rdtsc(); for(i to N){ if (random[i] < P){ sum += random[i]; } } // count the clock cycles const auto stop = __rdtsc(); // compute the clock cycles per iteration const auto cyclesPerIteration = double(stop - start)/num_op; The code executes a simple addition in a loop. The addition is executed only if the random value is smaller than P. This allow to tune the branching probability. The key volatile is used to prevent the compiler from optimizing and vectorizing the loop. The number of clock cycles is computed using rdtsc. Finally, the number of total clock cycles is divided by the number of iterations to compute the cycles per iteration. Clock cycles are used as a metric instead of time as for such small times fluctuation in clock frequency due to energy saving and boost behaviors might influence the results dramatically.

branch penalty chart

The results show a somewhat expected behavior: when the probability is not exactly 1/2, the branch predictor can guess the most likely outcome, thus reducing the cycles per iteration. When the probability is exactly either 0 or 1, there is no stochastic behavior, and the loop executes in two or three clock cycles.

Branchless programming can be applied in this case to avoid the branching entirely. The result of comparison is usually intrepreted as a 0 or 1 that means that in this case it is possible to multiply random[i] with the result of the comparison random[i] < P. The resulting code looks as follows: cpp // Generate a random array with a discrete distribution int random[N] = {Uniform_distribution(0,100)} volatile int sum = 0; // volatile is used to prevent vectorization and other optimisations //count the clock cycles using rdtsc const auto start = __rdtsc(); for(i to N){ sum += random[i] * (random[i] < P); } // count the clock cycles const auto stop = __rdtsc(); // compute the clock cycles per iteration const auto cyclesPerIteration = double(stop - start)/num_op;

branch penalty chart

The results show that the branchless execution is faster then the branching solution in almost all cases. Towards 100% branching probability the previous solution has a slight advantage

Memory latency

The pseudocode for the benchmark is as follows: cpp auto vector = [i, padding for i in N] vector.shuffle(); auto index = 0UL; for (auto i = 0; i < num_ops; ++i) { // access the next node index = list[index].index; } The code begins by creating an array where array[i] = i with padding and then shuffling it using MersenneTwister.

Then the array is accessed in a for loop using the index of the list. This causes stochastic uniform random accesses that the prefetcher fails to predict. This is a variation of the linked list approach found here who uses a linked list. The linked list method did not work on my machine and provided unreliable results. Moreover, is much slower.

memory latency chart

The results show:

L1 latency = 1ns
L2 latency = 10ns
L3 latency = 25-30ns
DRAM latency = 100-200ns

The DRAM latency varies between 100 and 200 because the tests are executed on a NUMA node.

Owner

Name: Marco Barbone
Login: DiamonDinoia
Kind: user
Location: New York
Company: Simons Foundation

Repositories: 2
Profile: https://github.com/DiamonDinoia

Software Engineer | High Performance Computing | Monte Carlo simulations

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: CPU Performance Tests
message: >-
  This software contains the code to benchmark CPU cache
  miss latency and branch misprediction penalty.
type: software
authors:
  - given-names: Marco
    family-names: Barbone
    email: m.barbone19@imperial.ac.uk
    affiliation: Imperial College London
    orcid: 'https://orcid.org/0000-0003-3495-7997'

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 4
Total Committers: 1
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Marco	m**9@i**k	4

Committer Domains (Top 20 + Academic)

imperial.ac.uk: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

cpu-performance-tests

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

cpu-performance-tests

Branch penalty

Memory latency

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels