## MPI(Message Passing Interface)

• MPI不同主机之间的进程协调工作需要安装mpi软件（例如mpich）来完成。
• MPI 是一种标准而不是特定的实现，具体的可以有很多不同的实现，例如MPICH、OpenMPI等。
• 是一种消息传递编程模型，顾名思义，它就是专门服务于进程间通信的。

PI的工作方式很好理解，我们可以同时启动一组进程，在同一个通信域中不同的进程都有不同的编号，程序员可以利用MPI提供的接口来给不同编号的进程分配不同的任务和帮助进程相互交流最终完成同一个任务。就好比包工头给工人们编上了工号然后指定一个方案来给不同编号的工人分配任务并让工人相互沟通完成任务。

mpi4py是一个很强大的库，它实现了很多MPI标准中的接口，包括:

1. 点对点通信
2. 组内集合通信
3. 非阻塞通信
4. 重复非阻塞通信
5. 组间通信等

### MPI的工作方式

-np5 指定启动5个mpi进程来执行后面的程序。相当于对脚本拷贝了5份，每个进程运行一份，互不干扰。在运行的时候代码里面唯一的不同，就是各自的rank也就是ID不一样。所以这个代码就会打印5个hello world和5个不同的rank值，从0到4.

### 群体通信

#### 收集gather

Root进程将数据通过scatter等分发给所有的进程，等待所有的进程都处理完后（这里只是简单的乘以2），root进程再通过gather回收他们的结果，和分发的原则一样，组成一个list。

## Single Program Multiple Data (SPMD) Model

Lecture Summary: n this lecture, we studied the Single Program Multiple Data (SPMD) model, which can enable the use of a cluster of distributed nodes as a single parallel computer. Each node in such a cluster typically consist of a multicore processor, a local memory, and a network interface card (NIC) that enables it to communicate with other nodes in the cluster. One of the biggest challenges that arises when trying to use the distributed nodes as a single parallel computer is that of data distribution.In general, we would want to allocate large data structures that span multiple nodes in the cluster; this logical view of data structures is often referred to as a global view. However, a typical physical implementation of this global view on a cluster is obtained by distributing pieces of the global data structure across different nodes, so that each node has a local view of the piece of the data structure allocated in its local memory. In many cases in practice, the programmer has to undertake the conceptual burden of mapping back and forth between the logical global view and the physical local views. Since there is one logical program that is executing on the individual pieces of data, this abstraction of a cluster is referred to as the Single Program Multiple Data (SPMD) model.

In this module, we will focus on a commonly used implementation of the SPMD model, that is referred to as the Message Passing Interface (MPI). When using MPI, you designate a fixed set of processes that will participate for the entire lifetime of the global application. It is common for each node to execute one MPI process, but it is also possible to execute more than one MPI process per multicore node so as to improve the utilization of processor cores within the node. Each process starts executing its own copy of the MPI program, and starts by calling the mpi.MPI_Init() method, where mpi instance of the MPI class used by the process. After that, each process can call the MPI application and the MPI_Comm_rank(mpi.MPI_COMM_WORLD) method to determine the process’ own rank within the range,0…(S-1), where S = MPI_Comm_size()

In this lecture, we studied how a global view, XG, of array X can be implemented by S local arrays (one per process) of size, XL.length = XG.length / S. For simplicity, assume that XG.length is a multiple of S. Then, if we logically want to set $XG[i] := i$ for all logical elements of XG, we can instead set XL[i]:= LR + i in each local array, where L = XL.length and R = MPI_Comm_rank(). Thus process 0’s copy of XL will contain logical elements XG[0…L-1], process 1’s copy of XL will contain logical elements XG[L…2L-1], and so on. Thus, we see that the SPMD approach is very different from client server programming, where each process can be executing a different program.

## Point-to-Point Communication

Lecture Summary: In this lecture, we studied how to perform point-to-point communication in MPI by sending and receiving messages. In particular, we worked out the details for a simple scenario in which process 0 sends a string, “ABCD’’, to process 1. Since MPI programs follow the SPMD model, we have to ensure that the same program behaves differently on processes 0 and 1. This was achieved by using an if-then-else statement that checks the value of the rank of the process that it is executing on. If the rank is zero, we include the necessary code for calling MPI_Send(); otherwise, we include the necessary code for calling MPI_Recv() (assuming that this simple program is only executed with two processes). Both calls include a number of parameters. The MPI_Send() call specifies the substring to be sent as a subarray by providing the string, offset, and data type, as well as the rank of the receiver, and a tag to assist with matching send and receive calls (we used a tag value of 99 in the lecture). The MPI_Recv() call (in the else part of the if-then-else statement) includes a buffer in which to receive the message, along with the offset and data type, as well as the rank of the sender and the tag. Each send/receive operation waits (or is blocked) until its dual operation is performed by the other process. Once a pair of parallel and compatible MPI_Send() and MPI_Recv() calls is matched, the actual communication is performed by the MPI library. This approach to matching pairs of send/receive calls in SPMD programs is referred to as two-sided communication.

As indicated in the lecture, the current implementation of MPI only supports communication of (sub)arrays of primitive data types. However, since we have already learned how to serialize and deserialize objects into/from bytes, the same approach can be used in MPI programs by communicating arrays of bytes.

Lecture Summary: In this lecture, we studied some important properties of the message-passing model with send/receive operations, namely message ordering and deadlock. For message ordering, we discussed a simple example with four MPI processes, R0,R1,R2,R3 (with ranks 0…3 respectively). In our example, process R1 sends message A to process R0 and process R2 sends message B to process R3. We observed that there was no guarantee that process R1’s send request would complete before process R2’s request, even if process R1 initiated its send request before process R2 Thus, there is no guarantee of the temporal ordering of these two messages. In MPI, the only guarantee of message ordering is when multiple messages are sent with the same sender, receiver, data type, and tag — these messages will all be ordered in accordance with when their send operations were initiated.

We learned that send and receive operations can lead to an interesting parallel programming challenge called deadlock. There are many ways to create deadlocks in MPI programs. In the lecture, we studied a simple example in which process R0 attempts to send message X to process R1, and process R1 attempts to send message Y to Process R0. Since both sends are attempted in parallel, processes R0 and R1 remain blocked indefinitely as they wait for matching receive operations, thus resulting in a classical deadlock cycle.

We also learned two ways to fix such a deadlock cycle. The first is by interchanging the two statements in one of the processes (say process R1). As a result, the send operation in process R0 will match the receive operation in process R1, and both processes can move forward with their next communication requests. Another approach is to use MPI’s sendrecv() operation which includes all the parameters for the send and for the receive operations. By combining send and receive into a single operation, the MPI runtime ensures that deadlock is avoided because a sendrecv() call in process R0 can be matched with a sendrecv() call in process R1 instead of having to match individual send and receive operations.

## Non-Blocking Communications

In this lecture, we studied non-blocking communications, which are implemented via the MPI_Isend() and MPI_Irecv() API calls.
The I in MPI_Isend() and MPI_Irecv() stands for “Immediate’’ because these calls return immediately instead of blocking until completion. Each such call returns an object of type MPI_Request which can be used as a handle to track the progress of the corresponding send/receive operation.

The main benefit of this approach is that the amount of idle time spent waiting for communications to complete is reduced when using non-blocking communications, since the Isend and Irecv operations can be overlapped with local computations. Also, while it is common for Isend and Irecv operations to be paired with each other, it is also possible for a nonblocking send/receive operation in one process to be paired with a blocking receive/send operation in another process.

## Collective Communication

For a broadcast operation, all MPI processes execute an MPI_Bcast() API call with a specified root process that is the source of the data to be broadcasted. A key property of collective operations is that each process must wait until all processes reach the same collective operation, before the operation can be performed. This form of waiting is referred to as a barrier. After the operation is completed, all processes can move past the implicit barrier in the collective call. In the case of MPI_Bcast(), each process will have obtained a copy of the value broadcasted by the root process.

Donate article here