https://github.com/cs683-iitb-autumn-2023/pa1-the-matrix-opcodeoutlaws
pa1-the-matrix-opcodeoutlaws created by GitHub Classroom
https://github.com/cs683-iitb-autumn-2023/pa1-the-matrix-opcodeoutlaws
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.3%) to scientific vocabulary
Repository
pa1-the-matrix-opcodeoutlaws created by GitHub Classroom
Basic Info
- Host: GitHub
- Owner: cs683-iitb-autumn-2023
- Language: C
- Default Branch: master
- Size: 29.3 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
PA1-The-Matrix
Perform matrix multiplication using various optimization techniques: - blocked matrix multiplication - SIMD instructions - software prefetching
and combination of these techniques.
Task 1: Blocked Matrix Multiplication
Implementation:
Basic Idea of blocking/tilling
- We have divided the matrices into small tiles and trying to reuse the blocks fetched in the cache
- To implement it we are using 6 for loops, where first three loops select which tiles to multiply and last three loops use normal matrix multiplication to multiply the selected tiles. #### Comparison with Normal Matrix Multiplication
- In normal matrix multiplication, while performing C = A * B, we access B in column-major order. If the size of a column is greater than the cache size whatever block of matrix B that will be brought into the cache while fetching elements of matrix B will be replaced and we will be facing a cache miss for every element access of matrix B
- The block that were brought into the cache while accessing B are not used expect for the 1st element. So the idea of Blocked matrix multiplication is to use the other elements of that block before that block is evicted from the cache.
#### Miss rate reduction using Blocking - Now in blocking the martix will be divided into small tiles (lets say of size B) and these tiles will be multiplied with each other instead of whole matrix.
- Now to compute a tile in the result matrix C, consider a tile Ci,j ,
We multipy ith row tiles of matrix A with jth column tiles of matrix B
As seen from the above image, to compute tile C1,1,
C<sub>1,1</sub> = A<sub>1,1</sub> x B<sub>1,1</sub> + A<sub>1,2</sub> x B<sub>2,1</sub>
- Now while accessing a tile of B in column major order, the blocks that are fetched in the cache won't be evicted till the whole tile is accessed. This is because the tile size will be small and fit completely into the cache.
- So while accessing the later elements (2nd column onwards) of that tile in column major order we won't get a miss as the blocks are already in the cache. This is how the blocks brought into the cache while accessing the tile of B are reused and will reduce the cache miss rate.
Observations:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Analysis:
- It can seen from the table that we have a significant decrease in the miss rate specially for matrix of dimension 800. This is because the single column of martix will be very large. And each element access will bring a block (of size 64B) into the cache. So accessing just one column of the matrix B will be very large (800 * 64B) and won't fit completely in the cache. So the blocks fetched will be evicted and would result in unnecessary cache misses.
- This won't happen for matrix of dimension 100 and 200 as their coloumns will be of smaller size. Because of this we can see a high miss rate (4.5 %) for normal matrix multiplication in 800 dimension matrix. Blocking reduces the miss rate to 0.1 % in each case by reusing the fetched blocks and gives a good speedup.
Task 2: SIMD instructions
Implementation:
SIMD Registers
- We are using 128 bits wide registers for implementing SIMD instructions. One double is of 64 bits in size. Hence we can store two doubles adjacent to each other in a 64 bit register. Here's how it looks:
SIMD Instructions
- Here we'll see how SIMD multiplication works when we have two
_m128dregisters containing a total of four doubles.
Here is the list of SIMD functions we are using for our implementation:
_mm_setzero_pd_mm_loadu_pd_mm_mul_pd_mm_add_pd_mm_shuffle_pd_mm_storeu_pd
Informaton about each of them can be found here: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Working
- Below is a snippet of how C1 and C2 is calculated in one step.
- We first load A11, A12 in one
_m128dregister. Alongside we also load A21, A22 in another_m128dregister. This is done using the_mm_loadu_pd. - We load 16 elements from matrix B into 8
_m128dregisters. - We need to shuffle the registers to align them for multiplication as shown in above figure. We use
_mm_shuffle_pdfor the same. - Next step is to multiply the aligned registers. This gives us multiplication of
A11×B11andA12×B12,A11×B21andA12×B22and so on. - We realign the registers using
_mm_shuffle_pdfor addition so that we can getA11 × B11+A12 × B12and so on. - We have applied loop unrolling to get some additional boost in performance.
- Finally we store the result to C11, C12, C13 & C14 as two registers using
_mm_storeu_pd
Observations:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Analysis:
- First we tried to use more prefetch instruction in the 3rd for loop just before their access
Limitations:
- Because of some loop unrolling and the way we have accessed elements using
_mm_storeu_pd, the matrix multiplication works for dimensions which are multiples of 4.
Task 3: Software Prefetching
Software prefetching uses explicit prefetch instruction that needs to be inserted in the code. It can be used to minimize cache miss latency by prefetching the data blocks before their access.
Implementation:
__builtin_prefetchis a function provided by GCC that needs 3 arguments(memory address,rw,locality) among which rw & locality options are optional.- First we did loop unrolling such that more prefetch instruction can be inserted in the code.
- We have used
__builtin_prefetchinstruction in such a way that needed blocks should be brought to l1 cache ahead of their access. - To bring blocks ahead of their access we inserted prefetch instructions 1 iteration before the block access.
Observation:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Analysis:
- There is a decent speedup, however the effect is majorly beacause of loop unrolling.
- Loop unrolling reduces the number of instructions being executed, thus reduing the total cycles spent while doing so. All this at an expense of slight reduction in code readability.
- We see the number of I refs has decreased in the prefetching implementation despite the fact that we aare adding the prefetch function calls.
- Upon refering multiple sources we have found that the
__builtin_prefetchdoes not affect the performance that much (although not using it properly can reduce the performance) because of the default optimizations of the compiler. (https://stackoverflow.com/questions/8460563/builtin-prefetch-how-much-does-it-read/8460606#8460606)
Bonus Task 1: Blocked Matrix Multiplication + SIMD instructions
This implementation takes advantage of both: blocking and simd instructions. We aim to reduce the number of multiplication operations and also reduce the miss rate with this implementation.
Implementation:
- We have started with the approach that we used for blocking
- And we changed the actual matrix multiplication taking place in the innermost for loop of blocking implementation with SIMD based multiplication.
- Some amount of loop unrolling is also done to acheive greater performance.
Observations:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Limitations:
- Here we have limitations of SIMD implementation intersected with limitations of blocking implementation.
- The dimension of matrix should be divisible by 4 and also be divisible by
BLOCK_SIZE. In order to satisfy the requirement of optimizing matrix multiplication of dimensions 100, 200 and 800, we have decided theBLOCK_SIZEto be 20.
Bonus Task 2: Blocked Matrix Multiplication + Software Prefetching
Implementation:
Basic Idea
- Here we are using software prefetching along with blocked matrix multiplication. In blocked multiplication we already reduced the miss rate by reusing the blocks of matrix B.
- Here we try to prefetch the blocks in a tile of both matrices A and B to have a decrease in the miss penalty.
Miss rate reduction using blocking & prefetching
- Now to access a tile (lets say of size B x B) of either matrix A or matrix B, we get B2/8 misses.
- The cache line size of our machine is 64 Bytes and each element (type double) is 8 bytes, so every cache line will contain 8 elements.
- We prefetch the first element of each block so we get the whole block into the cache before we try to access the elements. We do this by unrolling the loop and use
__builtin_prefetchfor the 8th element in the current iteration and fetch 0th to 7th element normally. - So in the next iteration we already have the 8th element and so here we prefetch the 16th element and fetch elements 9th to 15th (as they are already loaded in the cache beacause of the prefetch in the previous iteration) and so on to the next iteration.
- So we should get less miss penalty for other blocks except for the first 1st block.
Observation:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Analysis:
- Because of blocking we get a good improvement in the miss rate compared to normal matrix multiplication and prefetch improves miss penalty.
- As prefetch adds two extra prefetch instructions in the loop, we get an increase in total instruction references.
- Assuming that either becasue of compiler optimization or hardware prefetching we don't see that much effect of
__builtin_prefetchand the performance we see is not so different from that of blocking.
Bonus Task 3: SIMD instructions + Software Prefetching
Implementation:
- This combines the idea of software prefetching and SIMD instructions.
- We aim to fetch the blocks that are accessed in current loop at the begin before the elements are actually accessed. We expect this would improve the cache hits when performing the multiplications.
Observation:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Analysis:
- The performance is not all that different from SIMD multiplications.
- Given our assumptions in the previous sections, we have not seen a significant improvement after adding software prefetching.
Limitations:
- Given this uses the logic from SIMD matrix multiplication code, this implementation only works for matrix dimensions which are multiples of 4.
Bonus Task 4: Bloced Matrix Multiplication + SIMD instructions + Software Prefetching
Implementation:
- Here we combine all the ideas for optimization
- We aim to reduce the miss rates significantly beacause of blocking while prefetching the blocks which are about to be accessed all the while reducing the total number of actual multiplication operations.
Observation:
- We have recorded the execution time, data cache miss rate and speed up for different matrix sizes and represented in the table below:
Analysis:
- The performance is not very different from SIMD+Blocking as we have seen this recurring pattern of software prefetching not providing a significant boost.
- The speedup decreases as the matrix dimension increases.
Limitations:
- Here we have limitations of SIMD implementation intersected with limitations of blocking implementation.
- The dimension of matrix should be divisible by 4 and also be divisible by
BLOCK_SIZE. In order to satisfy the requirement of optimizing matrix multiplication of dimensions 100, 200 and 800, we have decided theBLOCK_SIZEto be 20.
All optimizations comparison
- The below chart is plotted cosidering the highest speedup we acheived using all techniques and cache block size of 64 Bytes
- Here we can observe that we have acheived the best speedup at matrix dimension 100 for BLOCKING+SIMD.
- Even though we expected the BLOCKING+SIMD+PREFETCHING to provide the best speedup, we have observed that the PREFETCH has been a limiting factor because of pre-optimizations from compiler.
- The speedup decreases as the matrix dimension increases. We estimate this to be beacause of the fact that as the dimension increases we start receiving more and more page faults as the data that needs to be brought into the cache is so huge. Hence the speedup factor falls short against the large penalties of page faults.
Owner
- Name: cs683-iitb-autumn-2023
- Login: cs683-iitb-autumn-2023
- Kind: organization
- Repositories: 1
- Profile: https://github.com/cs683-iitb-autumn-2023