A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.
Another variation of the problem is the single-source-shortest-paths (SSSP) problem, which also has parallel approaches: Parallel single-source shortest path algorithm.
Let
G=(V,E,w)
V
E\subseteqV x V
e\inE
w(e)
In the remainder of the article it is assumed that the graph is represented using an adjacency matrix. We expect the output of the algorithm to be a distancematrix
D
D
di,j
G
i
j
The Floyd algorithm presented later can handle negative edge weights, whereas the Dijkstra algorithm requires all edges to have a positive weight.
The Dijkstra algorithm originally was proposed as a solver for the single-source-shortest-paths problem. However, the algorithm can easily be used for solving the All-Pair-Shortest-Paths problem by executing the Single-Source variant with each node in the role of the root node.
In pseudocode such an implementation could look as follows: 1 func DijkstraSSSP(G,v) 5 6 func DijkstraAPSP(G)
In this example we assume that DijkstraSSSP
takes the graph
G
v
dv
dv
i
v
i
dv
v
D
DijkstraAPSP
iterates over all nodes of the graph G
DijkstraSSSP
with each as root node while storing the results in D
The runtime of DijkstraSSSP
is
O(|V|2)
DijkstraAPSP
has a total sequential runtime of O(|V|3)
A trivial parallelization can be obtained by parallelizing the loop of DijkstraAPSP
in line8.However, when using the sequential DijkstraSSSP
this limits the number of processors to be used by the number of iterations executed in the loop.Therefore, for this trivial parallelization
|V|
For example, let the number of processors
p
|V|
DijkstraSSSP
exactly once in parallel.However, when there are only for example p= | |V| |
2 |
DijkstraSSSP
twice.In total this yields a runtime of
O(|V|2 ⋅
|V| | |
p |
)
|V|
p
p
p
Another benefit of this parallelization is that no communication between the processors is required. However, it is required that every processor has enough local memory to store the entire adjacency matrix of the graph.
If more than
|V|
DijkstraSSSP
computation. For this reason, the parallelization is split across into two levels.For the first level the processors are split into
|V|
D
DijkstraSSSP
execution with a fixed root node.With this definition each partition has a size of k= | p |
|V| |
p=|V|
The main difficulty is the parallelization of multiple processors executing DijkstraSSSP
for a single root node. The idea for this parallelization is to distribute the management of the distancelist
dv
|V| | |
k |
dv
|V|=4
p=8
k=2
dv,1
dv,2
dv,3
dv,4
dv=[dv,1,dv,2,dv,3,dv,4]
The DijkstraSSSP
algorithm mainly consists of the repetition of two steps: First, the nearest node
x
dv
x
dv
These steps have to be altered as follows because for the parallelization
dv
x
dv
dv
\tilde{x}
x
dv
\tilde{x}
x
x
dv
x
x
dv
DijkstraSSSP
performed by a partition of size k
\tilde{x}
O( | |V| |
k |
)
O(logk)
|V|
O(|V|(
|V| | |
k |
+logk))
k
DijkstraAPSP
: O( | |V|3 |
p |
+logp)
The main benefit of this parallelization is that it is not required anymore that every processor stores the entire adjacency matrix.Instead, it is sufficient when each processor within a partition only stores the columns of the adjacency matrix of the nodes for which he is responsible.Given a partition size of
k
|V| | |
k |
The graph used in this example is the one presented in the image with four nodes.
The goal is to compute the distancematrix with
p=8
The computation of the distancelist across the different iterations is visualized in the second image.
The top row in the image corresponds to
dA
dA
dA
dA
dA
dA
The Floyd–Warshall algorithm solves the All-Pair-Shortest-Paths problem for directed graphs. With the adjacency matrix of a graph as input, it calculates shorter paths iterative. After |V| iterations the distance-matrix contains all the shortest paths. The following describes a sequential version of the algorithm in pseudo code:
1 func Floyd_All_Pairs_SP(A)
Where A is the adjacency matrix, n = |V| the number of nodes and D the distance matrix.
The basic idea to parallelize the algorithm is to partition the matrix and split the computation between the processes. Each process is assigned to a specific part of the matrix. A common way to achieve this is 2-D Block Mapping. Here the matrix is partitioned into squares of the same size and each square gets assigned to a process. For an
n x n
n/\sqrtp x n/\sqrtp
p=n2
n2
pi,j
As the calculation of the parts of the distance matrix is dependent on results from other parts the processes have to communicate between each other and exchange data. In the following we refer with
(k) | |
d | |
i,j |
(k) | |
d | |
i,j |
(k-1) | |
d | |
i,j |
(k-1) | |
d | |
i,k |
(k-1) | |
d | |
k,j |
(k-1) | |
d | |
i,j |
Additionally each process needs a part of the k-th row and the k-th column of the
Dk-1
(k-1) | |
d | |
i,k |
(k-1) | |
d | |
k,j |
(k) | |
d | |
i,j |
Dk-1
Dk-1
For the 2-D block mapping we have to modify the algorithm as follows:
1 func Floyd_All_Pairs_Parallel(
D(0)
In line 5 of the algorithm we have a synchronisation step to ensure that all processes have the data necessary to compute the next iteration. To improve the runtime of the algorithm we can remove the synchronisation step without affecting the correctness of the algorithm. To achieve that each process starts the computation as soon as it has the data necessary to compute its part of the matrix. This version of the algorithm is called pipelined 2-D block mapping.
The runtime of the sequential algorithm is determined by the triple nested for loop. The computation in line 6 can be done in constant time (
O(1)
O(n3)
The runtime of the parallelized algorithm consists of two parts. The time for the computation and the part for communication and data transfer between the processes.
As there is no additional computation in the algorithm and the computation is split equally among the p processes, we have a runtime of
O(n3/p)
In each iteration of the algorithm there is a one-to-all broadcast operation performed along the row and column of the processes. There are
n/\sqrtp
Tcomm=n(Tsynch+Tbroadcast)
For the whole algorithm we have the following runtime:
T=O\left(
n3 | |
p\right) |
+n(Tsynch+Tbroadcast)
For the runtime of the data transfer between the processes in the pipelined version of the algorithm we assume that a process can transfer k elements to a neighbouring process in
O(k)
n/\sqrtp
O(n/\sqrtp)
\sqrtp
p\sqrt
O(n)
The values of successive rows and columns follow after time
O(n2/p)
p\sqrt
n3/p
n
O(n)
The overall runtime for the pipelined version of the algorithm is:
T=O\left(
n3 | |
p\right) |
+O(n)