Table of Contents
- Split-phase Barriers with Java Phasers
- Point-to-Point Synchronization with Phasers
- One-Dimensional Iterative Averaging with Phasers
- Pipeline Parallelism
- Data Flow Parallelism
Split-phase Barriers with Java Phasers
In this lecture, we examined a variant of the barrier example that we studied earlier: 1
2
3
4
5forall (i : [0:n-1]) {
print HELLO, i;
myId = lookup(i); // convert int to a string
print BYE, myId;
}
We learned about Javaβs Phaser class, and that the operation ππ.πππππππ°πππ°π ππππ°ππππππ() can be used to implement a barrier through phaser object ππ. We also observed that there are two possible positions for inserting a barrier between the two print statements above β before or after the call to ππππππ(π). However, upon closer examination, we can see that the call to ππππππ(π) is local to iteration i and that there is no specific need to either complete it before the barrier or to complete it after the barrier. In fact, the call to ππππππ(π) can be performed in parallel with the barrier. To facilitate this split-phase barrier (also known as a fuzzy barrier) we use two separate APIs from Java Phaser class β ππ.ππππππ() and ππ.ππ ππππ°ππππππ(). Together these two APIs form a barrier, but we now have the freedom to insert a computation such as ππππππ(π) between the two calls as follows:
1 |
|
"party" ζ― Phaser δΈηδΈδΈͺζ―θ―οΌηΈε½δΊζ―ηΊΏη¨ηζζοΌε½δΈδΈͺ party ε°θΎΎοΌε°±ζ―ηΊΏη¨ε°θΎΎζζε°±ζ―ηΊΏη¨ε°δΊεζ₯ηε±ι(Barrier)γ
Phaser Understanding
1 | import java.util.ArrayList; |
Doing so enables the barrier processing to occur in parallel with the call to ππππππ(π), which was our desired outcome.
Point-to-Point Synchronization with Phasers
In this lecture, we looked at a parallel program example in which the span (critical path length) would be 6 units of time if we used a barrier, but is reduced to 5 units of time if we use individual phasers as shown in the following table:
Task0 | Task1 | Task2 |
1a:X=A();//cost=1 | 1b:Y=B();//cost=2 | 1c:Z=C();//cost=3 |
2a:ph0.arrive(); | 2b:ph1.arrive(); | 2c:ph2.arrive(); |
3a:ph1.awaitAdvance(0); | 3b:ph0.awaitAdvance(0); | 3c:ph1.awaitAdvance(0); |
4a:D(X,Y);//cost=3 | 4b:ph2.awaitAdvance(0); | 4c:F(Y,Z);//cost=1 |
5b:E(X,Y,Z);//cost=2 |
Each column in the table represents execution of a separate task, and the calls to ππππππ() and ππ ππππ°ππππππ(πΆ) represent synchronization across different tasks via phaser objects, πππΆ, πππ·, and πππΈ, each of which is initialized with a party count of 1 (only one signalling task). (The parameter 0 in ππ ππππ°ππππππ(πΆ) represents a transition from phase 0 to phase 1.)
One-Dimensional Iterative Averaging with Phasers
In this lecture, we revisited the barrier-based Iterative Averaging example that we studied earlier, and observed that a full barrier is not necessary since forall iteration i only needs to wait for iterations i β 1 and i + 1 to complete their current phase before iteration i can move to its next phase. This idea can be captured by phasers, if we allocate an array of phasers as follows:
1 | // Allocate array of phasers |
As we learned earlier, grouping/chunking of parallel iterations in a forall can be an important consideration for performance (due to reduced overhead). The idea of grouping of parallel iterations can be extended to forall loops with phasers as follows:
1 | // Allocate array of phasers proportional to number of chunked tasks |
Pipeline Parallelism
In this lecture, we studied how point-to-point synchronization can be used to build a one-dimensional pipeline with p tasks (stages), T0,...Tn. For example, three important stages in a medical imaging pipeline are denoising, registration, and segmentation.
We performed a simplified analysis of the WORK and SPAN for pipeline parallelism as follows. Let n be the number of input items and p the number of stages in the pipeline, WORK = n Γ p is the total work that must be done for all data items, and CPL = n + p β 1 is the span or critical path length for the pipeline. Thus, the ideal parallelism is PAR = WORK /CPL = np / (n + p β 1). This formula can be validated by considering a few boundary cases. When p = 1, the ideal parallelism degenerates to PAR = 1, which confirms that the computation is sequential when only one stage is available. Likewise, when n = 1, the ideal parallelism again degenerates to PAR = 1, which confirms that the computation is sequential when only one data item is available. When n is much larger than p (n Β» p), then the ideal parallelism approaches PAR = p in the limit, which is the best possible case.
The synchronization required for pipeline parallelism can be implemented using phasers by allocating an array of phasers, such that phaser ππ[π] is βsignalledβ in iteration i by a call to ππ[π].ππππππ() as follows:
1 | // Code for pipeline stage i |
Data Flow Parallelism
Thus far, we have studied computation graphs as structures that are derived from parallel programs. In this lecture, we studied a dual approach advocated in the data flow parallelism model, which is to specify parallel programs as computation graphs. The simple data flow graph studied in the lecture consisted of five nodes and four edges: A β C, A β D, B β D, B β E. While futures can be used to generate such a computation graph, e.g., by including calls to A.get() and B.get() in task D, the computation graph edges are implicit in the get() calls when using futures. Instead, we introduced the asyncAwait notation to specify a task along with an explicit set of preconditions (events that the task must wait for before it can start execution). With this approach, the program can be generated directly from the computation graph as follows:
1 | async( () -> {/* Task A */; A.put(); } ); // Complete task and trigger event A |
Interestingly, the order of the above statements is not significant. Just as a graph can be defined by enumerating its edges in any order, the above data flow program can be rewritten as follows, without changing its meaning:
1 | asyncAwait(A, () -> {/* Task C */} ); // Only execute task after event A is triggered |
Finally, we observed that the power and elegance of data flow parallel programming is accompanied by the possibility of a lack of progress that can be viewed as a form of βdeadlockβ if the program omits a put() call for signalling an event.