In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.[1][2]

It is especially suitable for computers laid out in an N × N mesh.[3] While Cannon's algorithm works well in homogeneous 2D grids, extending it to heterogeneous 2D grids has been shown to be difficult.[4]

The main advantage of the algorithm is that its storage requirements remain constant and are independent of the number of processors.[2]

The Scalable Universal Matrix Multiplication Algorithm (SUMMA)[5] is a more practical algorithm that requires less workspace and overcomes the need for a square 2D grid. It is used by the ScaLAPACK, PLAPACK, and Elemental libraries.

Algorithm overview

edit

When multiplying two n×n matrices A and B, we need n×n processing nodes p arranged in a 2D grid.

// PE(i , j)
k := (i + j) mod N;
a := a[i][k];
b := b[k][j];
c[i][j] := 0;
for (l := 0; l < N; l++) {
    c[i][j] := c[i][j] + a * b;
        concurrently {
            send a to PE(i, (j + N − 1) mod N);
            send b to PE((i + N − 1) mod N, j);
        } with {
            receive a' from PE(i, (j + 1) mod N);
            receive b' from PE((i + 1) mod N, j );
        }
    a := a';
    b := b';
}

We need to select k in every iteration for every Processor Element (PE) so that processors don't access the same data for computing .

Therefore processors in the same row / column must begin summation with different indexes. If for example PE(0,0) calculates in the first step, PE(0,1) chooses first. The selection of k := (i + j) mod n for PE(i,j) satisfies this constraint for the first step.

In the first step we distribute the input matrices between the processors based on the previous rule.

In the next iterations we choose a new k' := (k + 1) mod n for every processor. This way every processor will continue accessing different values of the matrices. The needed data is then always at the neighbour processors. A PE(i,j) needs then the from PE(i,(j + 1) mod n) and the from PE((i + 1) mod n,j) for the next step. This means that has to be passed cyclically to the left and also cyclically upwards. The results of the multiplications are summed up as usual. After n steps, each processor has calculated all once and its sum is thus the searched .

After the initial distribution of each processor, only the data for the next step has to be stored. These are the intermediate result of the previous sum, a and a . This means that all three matrices only need to be stored in memory once evenly distributed across the processors.

Generalisation

edit

In practice we have much fewer processors than the matrix elements. We can replace the matrix elements with submatrices, so that every processor processes more values. The scalar multiplication and addition become sequential matrix multiplication and addition. The width and height of the submatrices will be .

The runtime of the algorithm is , where is the time of the initial distribution of the matrices in the first step, is the calculation of the intermediate results and and stands for the time it takes to establish a connection and transmission of byte respectively.

A disadvantage of the algorithm is that there are many connection setups, with small message sizes. It would be better to be able to transmit more data in each message.

See also

edit

References

edit
  1. ^ Cannon, Lynn Elliot (14 July 1969). A cellular computer to implement the Kalman Filter Algorithm (PhD). Montana State University.
  2. ^ a b Gupta, H.; Sadayappan, P. (1994). Communication Efficient Matrix-Multiplication on Hypercubes (Technical report). Stanford Infolab.
  3. ^ "4.2 Matrix Multiplication on a Distributed Memory Machine". Numerical Linear Algebra. Computational Science Education Project. 1991–1995. Archived from the original on 1 April 2018.
  4. ^ Pineau, Jean-François (October 2010). Communication-aware scheduling on heterogeneous master-worker platforms (PhD). Ecole normale supérieure de lyon. tel-00530131.
  5. ^ van de Geijn, Robert A.; Watts, Jerrell (April 1997). "SUMMA: scalable universal matrix multiplication algorithm". Concurrency: Practice and Experience. 9 (4): 255–274. doi:10.1002/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2.
edit

📚 Artikel Terkait di Wikipedia

List of algorithms

An algorithm is a fundamental set of rules or defined procedures that are typically designed and used to be a simpler way to solve a specific problem

Matrix multiplication algorithm

central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix

List of numerical analysis topics

zero matrix Algorithms for matrix multiplication: Strassen algorithm Coppersmith–Winograd algorithm Cannon's algorithm — a distributed algorithm, especially

Void (astronomy)

results of large-scale surveys of the universe. Of the many different algorithms, virtually all fall into one of three general categories. The first class

40CT cannon

surplus worth £70 million of 40CTC cannons was revealed by the UK MoD following this cancellation. A total of 515 cannon had been ordered by the MoD in 2015

Summa (disambiguation)

baseball player Scalable Universal Matrix Multiplication Algorithm; see Cannon's algorithm Somma (disambiguation) Summa potestas (law) This disambiguation

List of Xbox One games (A–L)

Home Interactive Sep 10, 2019 Sep 10, 2019 Sep 10, 2019 Green: The Life Algorithm Side-scrolling action, Metroidvania Estacion Pi Y Diseno Estacion Pi Y

Karmakar

Miah, Md. Abul Hashem (June 1991). "Cannons in the Subcontinent with a Special Reference to the Historical Cannons of Bengal". Journal of the Asiatic Society