Using 3D Tensors on GPUs for Shortest-Path Finding
3D Augmented Cost Matrix

Using 3D Tensors on GPUs for Shortest-Path Finding

The Floyd-Warshall(FW) algorithm, is an ancient but a largely important algorithm used to solve the all-pairs simple-paths(APSP) problem. While the algorithm is available for use in open-source graph optimization libraries such as NetworkX, they do not take advantage of modern parallel processing hardware such as Graphics Processing Units(GPUs), which would reduce compute time to a fraction of its iterative or recursive implementations. In this work, a re-implementation of the Floyd-Warshall algorithm using open-source GPU libraries such as PyTorch is presented. A further implementation of R-Kleene is also described, a slightly newer algorithm used for solving the APSP problem in a divide-and-conquer, recursive but highly parallelized architecture. In addition, a random graph generator that generates a wide range of graphs of different scales is also contributed, where the densities and connectivities are controlled using some heuristics. The run-times of the GPU accelerated FW algorithm and R-Kleene on these heuristically generated graphs are evaluated against each other and to the widely used implementation from NetworkX. The code for the GPU implementation of the algorithms, the random graph generator, and the Blender animation file are available here.



What is APSP?

The All-Pairs Shortest Path(APSP) problem requires that we determine shortest path distances between every pair of nodes in a network. One approach to solving this problem is via a general all-pairs label-correcting algorithm. Such algorithms are guaranteed to yield a solution for dense networks hence they generalize to sparse networks as well. One such example of a popular algorithm is the Floyd-Warshall algorithm, which runs in O(n^3) time.

The most freely available implementation of the Floyd-Warshall algorithm is from NetworkX, a popular network analysis library in Python that allows academics and industrialists to perform creation, manipulation, and study of the structure, dynamics, and functions of complex networks. However, the package is restricted for use on Central-Processing-Units(CPU) machines, making it unscalable for larger graphs in large domains beyond logistics, space travel and internet routing where real-time performance may be demanded. Modern Graphics-Processing-Units(GPUs) are an excellent powerhouse for processing power and computation. Their intended use is in the gaming industry but they have recently been adopted in academic research to perform chemical simulations, protein folding and train deep neural networks in the artificial intelligence domain. This is only possible due to the extremely parallelized Tensor cores made available through hardware acceleration. This specifically means that large matrix operations on hyper-dimensional matrices(termed as Tensors) are performed in a fraction of the time that it would take a CPU to do the same.


I conjecture that the APSP problem can be augmented into a clever matrix, and a custom matrix operation kernel known as min-plus can be implemented in a similar way to matrix-matrix multiplication. This then allows the APSP problem to be processed on GPUs that will yield a result in a fraction of the time of NetworX's CPU implementation. Further, GPU-friendly APSP algorithms such as R-Kleene is also explored and benchmarked. I show that the GPU implementation of Floyd-Warshall runs significantly faster than the CPU implementation, and further show that R-Kleene is even faster at scale. Graphs of a wide variety of scales and densities are used for the benchmarking and show computational complexity trends. Furthermore, the implementations are made publicly available here and is implemented in Pytorch, an open-source, automatic-differentiation framework freely available to all and runs in Python environments.



How Much Faster is the GPU?

Well, to quantify this, I created a random graph generator that creates 1000 unique graphs, each with varying densities from extremely sparse to fully-connected.

Then, I ran the path-finder using the readily available CPU baseline method, against two of my GPU implementations, and recorded the time costs:

No alt text provided for this image
Showing time costs of the three APSP alorithms: Floyd-Warshall on the GPU(FW-GPU), R-Kleene on the GPU(R-Kleene-GPU) and NetworkX’s CPU implementation of Floyd-Wasrhall(FW- CPU-NX).

What the above diagram shows is that as the graph gets larger and denser, the CPU version easily quickly becomes infeasible, while the GPU versions stay rather stable. Let's take a closer look at the GPU versions for better comparability:

No alt text provided for this image
Showing time costs of the two APSP alorithms: Floyd-Warshall on the GPU(FW-GPU), R-Kleene on the GPU(R-Kleene-GPU)

Here, we see something interesting. Despite being on the GPU, FW_GPU still performs with exponential complexity, while R_Kleene_GPU performs somewhat linearly. This implies that simply parallelizing the APSP problem is not sufficient for feasibility at large scales. Therefore, recursive methods that divide and conquer are a force to be reckoned with.

Note that graphs larger than 1000 nodes with fully-connected edges were difficult to run despite having 24GB VRAM on an RTX-3090. To solve this issue, I have created a novel method called RODNoZ ( Recursive Omni-Dimensional Non-Zero ). I will write about this in a future article. The basic idea is to further divide and conquer the memory block on the GPU to avoid saturating the cores. Another feature that the implementation will support is exploiting multiple GPUs.



Background and Technical Details

If you'd like to dive in deeper on this topic, I recommend this paper. It softly introduces the problem and gives the background of how it's formulated. It further contains links to even more technical documents for those who like to get into the knitty-gritty implementation details.


Cheers :)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了