In parallel programs is very important to regard cache size and hit rates on a single CPU, but it’s even more important to to consider how the caches of multiple processors/cores interact. Let’s consider a single representative example, which demonstrates the important cache optimisation and emphasizes the value of good tools when it comes to performance optimisation in general.
Let’s first examine the first sequential method, it performs the rudimentary task of summing all the elements in a two-dimensional array of integers and returns the result:
public static int MatrixSumSequential(int [,] matrix) { int sum = 0; int rows = matrix.GetUpperBound(0); int cols = matrix.GetUpperBound(1); for(int i = 0; i < rows; i++) { for(int j = 0; j < cols; j++) { sum += matrix[i, j]; } } return sum; }
We could have used TPL but let’s ignore the huge arsenal of tools TPL provides in our simple example. The following attempt at parallelisation may appear sufficiently reasonable to harvest the fruits of multi-core execution, and even implements a crude aggregation to avoid synchronisation on the shared sum variable:
public static int MatrixSumParallel(int [,] matrix) { int sum = 0; int rows = matrix.GetUpperBound(0); int cols = matrix.GetUpperBound(1); const int THREADS = 4; int chunk = row / THREADS; int [] localSums = new int[THREADS]; Threads [] threads = new Threads[THREADS]; for(int i = = 0; i < THREADS; i++) { int start = chunk * i; int end - chunk * (1 + i); int threadNum = i; threads[i] = new Thread(() => { for(int row = start; row < end; r++) { for(int col = 0; col < cols; col++) { localSums[threadNum] += matrix[row, col]; } } }); threads[i].Start(); foreach(var thread in threads) thread.Join(); } return localSums.Sum(); }
Executing each of the two methods several time on an i7 machine with 6 cores produced the following results for a 2,000 x 2,000 matrix of integers:
* 325ms average for sequential method
* 935ms for the parallel method. Three times as slow as the sequential method!
The obvious question is why?
This is not an example of too fine grained parallelism, because the number of threads is only 4. However if you accept the premise that the problem is somehow the cache related, it would make sense to measure the number of cache misses introduced by the 2 methods above.
The Visual Studio profiler when sampling the execution of each methods with a 2,000 x 2,000 matrix reported 963 exclusive samples in the parallel version and only 659 exclusive samples in the sequential version, the vast majority of samples being on the inner loop line that reads from the matrix.
Why would a line of code writing to localSums introduce so many cache misses in comparison to writing to sum local variable? The answer is that the writes to the shared array invalidate cache lines at other processors/cores, causing every += operating to be a cache miss.
When processor writes to a memory location that is in the cache of another processor/core cache, the hardware causes a cache invalidation, that marks the cache line as invalid. Accessing that line results in a cache miss.
The moral of the story do not blindly introduce parallelisation in a hope that that would also result in performance increase. Always test both versions, you might be surprised at the results!