Table of Contents
Loop Parallelism
In this lecture, we learned different ways of expressing parallel loops. The most general way is to think of each iteration of a parallel loop as an async task, with a finish construct encompassing all iterations. This approach can support general cases such as parallelization of the following pointer-chasing while loop (in pseudocode):
1 | finish { |
However, further efficiencies can be gained by paying attention to counted-for loops for which the number of iterations is known on entry to the loop (before the loop executes its first iteration). We then learned the forall notation for expressing parallel counted-for loops, such as in the following vector addition statement (in pseudocode):
1 | forall (i : [0:n-1]) |
We also discussed the fact that Java streams can be an elegant way of specifying parallel loop computations that produce a single output array, e.g., by rewriting the vector addition statement as follows:
1 | a = IntStream.rangeClosed(0, N-1).parallel().toArray(i -> b[i] + c[i]); |
Parallel Matrix Multiplication
In this lecture, we reminded ourselves of the formula for multiplying two n × n matrices, a and b, to obtain a product matrix, c, of \[ C_{ij} = \sum^{n-1}_{k=0}a_{ik} * b_{kj}\]
This formula can be easily translated to a simple sequential algorithm for matrix multiplication as follows (with pseudocode for counted-for loops):
1 | for(i : [0:n-1]) { |
}
Upon a close inspection, we can see that it is safe to convert for-i and for-j into forall loops, but for-k must remain a sequential loop to avoid data races.
Barriers in Parallel Loops
In this lecture, we learned the barrier construct through a simple example that began with the following forall parallel loop (in pseudocode):
1 | forall (i : [0:n-1]) { |
We discussed the fact that the HELLO’s and BYE’s from different forall iterations may be interleaved in the printed output, e.g., some HELLO’s may follow some BYE’s. Then, we showed how inserting a barrier between the two print statements could ensure that all HELLO’s would be printed before any BYE’s.
Thus, barriers extend a parallel loop by dividing its execution into a sequence of phases. While it may be possible to write a separate forall loop for each phase, it is both more convenient and more efficient to instead insert barriers in a single forall loop, e.g., we would need to create an intermediate data structure to communicate the myId values from one forall to another forall if we split the above forall into two (using the notation next) loops. Barriers are a fundamental construct for parallel loops that are used in a majority of real-world parallel applications.
One-Dimensional Iterative Averaging
In this lecture, we discussed a simple stencil computation to solve the recurrence, \(x_{i} = \frac{x_{i-1} + x_{i+1}}{2}\) with boundary conditions, \(x_{0} = 0\) and \(x_{1} = 1\).
1 | import random |
A naive approach to parallelizing this method would result in the following pseudocode:
1 | for (iter: [0:nsteps-1]) { |
Though easy to understand, this approach creates nsteps × (n − 1) tasks, which is too many. Barriers can help reduce the number of tasks created as follows:
1 | forall ( i: [1:n-1]) { |
In this case, only (n − 1) tasks are created, and there are nsteps barrier (next) operations, each of which involves all (n − 1) tasks. This is a significant improvement since creating tasks is usually more expensive than performing barrier operations.
Iteration Grouping: Chunking of Parallel Loops
We observed that this approach creates n tasks, one per forall iteration, which is wasteful when (as is common in practice) n is much larger than the number of available processor cores.
1 | forall (i : [0:n-1]) a[i] = b[i] + c[i] |
To address this problem, we learned a common tactic used in practice that is referred to as loop chunking or iteration grouping, and focuses on reducing the number of tasks created to be closer to the number of processor cores, so as to reduce the overhead of parallel execution:
With iteration grouping/chunking, the parallel vector addition example above can be rewritten as follows:
1 | forall (g:[0:ng-1]) |
Note that we have reduced the degree of parallelism from n to the number of groups, ng, which now equals the number of iterations/tasks in the forall construct.
There are two well known approaches for iteration grouping: block and cyclic. The former approach (block) maps consecutive iterations to the same group, whereas the latter approach (cyclic) maps iterations in the same congruence class (mod ng) to the same group. With these concepts, you should now have a better understanding of how to execute forall loops in practice with lower overhead.