cpu-performance-tests
This repository contains the code to benchmark CPU cache miss latency and branch misprediction penalty
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.1%) to scientific vocabulary
Keywords
Repository
This repository contains the code to benchmark CPU cache miss latency and branch misprediction penalty
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
cpu-performance-tests
This repository contains the code to benchmark CPU cache miss latency and branch misprediction penalty.
Branch penalty
The pseudocode for the benchmark is as follows:
cpp
// Generate a random array with a discrete distribution
int random[N] = {Uniform_distribution(0,100)}
volatile int sum = 0; // volatile is used to prevent vectorization and other optimisations
//count the clock cycles using rdtsc
const auto start = __rdtsc();
for(i to N){
if (random[i] < P){
sum += random[i];
}
}
// count the clock cycles
const auto stop = __rdtsc();
// compute the clock cycles per iteration
const auto cyclesPerIteration = double(stop - start)/num_op;
The code executes a simple addition in a loop. The addition is executed only if the random value is smaller than P. This allow to tune the branching probability.
The key volatile is used to prevent the compiler from optimizing and vectorizing the loop. The number of clock cycles is computed using rdtsc. Finally, the number of total clock cycles is divided by the number of iterations to compute the cycles per iteration. Clock cycles are used as a metric instead of time as for such small times fluctuation in clock frequency due to energy saving and boost behaviors might influence the results dramatically.

The results show a somewhat expected behavior: when the probability is not exactly 1/2, the branch predictor can guess the most likely outcome, thus reducing the cycles per iteration. When the probability is exactly either 0 or 1, there is no stochastic behavior, and the loop executes in two or three clock cycles.
Branchless programming can be applied in this case to avoid the branching entirely. The result of comparison is usually intrepreted as a 0 or 1 that means that in this case it is possible to multiply random[i] with the result of the comparison random[i] < P. The resulting code looks as follows:
cpp
// Generate a random array with a discrete distribution
int random[N] = {Uniform_distribution(0,100)}
volatile int sum = 0; // volatile is used to prevent vectorization and other optimisations
//count the clock cycles using rdtsc
const auto start = __rdtsc();
for(i to N){
sum += random[i] * (random[i] < P);
}
// count the clock cycles
const auto stop = __rdtsc();
// compute the clock cycles per iteration
const auto cyclesPerIteration = double(stop - start)/num_op;

The results show that the branchless execution is faster then the branching solution in almost all cases. Towards 100% branching probability the previous solution has a slight advantage
Memory latency
The pseudocode for the benchmark is as follows:
cpp
auto vector = [i, padding for i in N]
vector.shuffle();
auto index = 0UL;
for (auto i = 0; i < num_ops; ++i) {
// access the next node
index = list[index].index;
}
The code begins by creating an array where array[i] = i with padding and then shuffling it using MersenneTwister.
Then the array is accessed in a for loop using the index of the list. This causes stochastic uniform random accesses that the prefetcher fails to predict. This is a variation of the linked list approach found here who uses a linked list. The linked list method did not work on my machine and provided unreliable results. Moreover, is much slower.

The results show:
- L1 latency = 1ns
- L2 latency = 10ns
- L3 latency = 25-30ns
- DRAM latency = 100-200ns
The DRAM latency varies between 100 and 200 because the tests are executed on a NUMA node.
Owner
- Name: Marco Barbone
- Login: DiamonDinoia
- Kind: user
- Location: New York
- Company: Simons Foundation
- Repositories: 2
- Profile: https://github.com/DiamonDinoia
Software Engineer | High Performance Computing | Monte Carlo simulations
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: CPU Performance Tests
message: >-
This software contains the code to benchmark CPU cache
miss latency and branch misprediction penalty.
type: software
authors:
- given-names: Marco
family-names: Barbone
email: m.barbone19@imperial.ac.uk
affiliation: Imperial College London
orcid: 'https://orcid.org/0000-0003-3495-7997'
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Marco | m****9@i****k | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0