Using 3D Tensors on GPUs for Shortest-Path Finding

Taher Anjary

Computer Vision Engineer for Autonomous Driving

发布日期: 2023年6月4日

The Floyd-Warshall(FW) algorithm, is an ancient but a largely important algorithm used to solve the all-pairs simple-paths(APSP) problem. While the algorithm is available for use in open-source graph optimization libraries such as NetworkX, they do not take advantage of modern parallel processing hardware such as Graphics Processing Units(GPUs), which would reduce compute time to a fraction of its iterative or recursive implementations. In this work, a re-implementation of the Floyd-Warshall algorithm using open-source GPU libraries such as PyTorch is presented. A further implementation of R-Kleene is also described, a slightly newer algorithm used for solving the APSP problem in a divide-and-conquer, recursive but highly parallelized architecture. In addition, a random graph generator that generates a wide range of graphs of different scales is also contributed, where the densities and connectivities are controlled using some heuristics. The run-times of the GPU accelerated FW algorithm and R-Kleene on these heuristically generated graphs are evaluated against each other and to the widely used implementation from NetworkX. The code for the GPU implementation of the algorithms, the random graph generator, and the Blender animation file are available here.

What is APSP?

The All-Pairs Shortest Path(APSP) problem requires that we determine shortest path distances between every pair of nodes in a network. One approach to solving this problem is via a general all-pairs label-correcting algorithm. Such algorithms are guaranteed to yield a solution for dense networks hence they generalize to sparse networks as well. One such example of a popular algorithm is the Floyd-Warshall algorithm, which runs in O(n^3) time.

The most freely available implementation of the Floyd-Warshall algorithm is from NetworkX, a popular network analysis library in Python that allows academics and industrialists to perform creation, manipulation, and study of the structure, dynamics, and functions of complex networks. However, the package is restricted for use on Central-Processing-Units(CPU) machines, making it unscalable for larger graphs in large domains beyond logistics, space travel and internet routing where real-time performance may be demanded. Modern Graphics-Processing-Units(GPUs) are an excellent powerhouse for processing power and computation. Their intended use is in the gaming industry but they have recently been adopted in academic research to perform chemical simulations, protein folding and train deep neural networks in the artificial intelligence domain. This is only possible due to the extremely parallelized Tensor cores made available through hardware acceleration. This specifically means that large matrix operations on hyper-dimensional matrices(termed as Tensors) are performed in a fraction of the time that it would take a CPU to do the same.

I conjecture that the APSP problem can be augmented into a clever matrix, and a custom matrix operation kernel known as min-plus can be implemented in a similar way to matrix-matrix multiplication. This then allows the APSP problem to be processed on GPUs that will yield a result in a fraction of the time of NetworX's CPU implementation. Further, GPU-friendly APSP algorithms such as R-Kleene is also explored and benchmarked. I show that the GPU implementation of Floyd-Warshall runs significantly faster than the CPU implementation, and further show that R-Kleene is even faster at scale. Graphs of a wide variety of scales and densities are used for the benchmarking and show computational complexity trends. Furthermore, the implementations are made publicly available here and is implemented in Pytorch, an open-source, automatic-differentiation framework freely available to all and runs in Python environments.

How Much Faster is the GPU?

Well, to quantify this, I created a random graph generator that creates 1000 unique graphs, each with varying densities from extremely sparse to fully-connected.

Ashwani Patel 2 年前

Why GPUs Are Revolutionizing the World of Artificial…

Daniel Wiczew 1 年前

Distributed platform for Machine Learning

Sreehari S S 5 年前

Then, I ran the path-finder using the readily available CPU baseline method, against two of my GPU implementations, and recorded the time costs:

No alt text provided for this image — Showing time costs of the three APSP alorithms: Floyd-Warshall on the GPU(FW-GPU), R-Kleene on the GPU(R-Kleene-GPU) and NetworkX’s CPU implementation of Floyd-Wasrhall(FW- CPU-NX).

What the above diagram shows is that as the graph gets larger and denser, the CPU version easily quickly becomes infeasible, while the GPU versions stay rather stable. Let's take a closer look at the GPU versions for better comparability:

Here, we see something interesting. Despite being on the GPU, FW_GPU still performs with exponential complexity, while R_Kleene_GPU performs somewhat linearly. This implies that simply parallelizing the APSP problem is not sufficient for feasibility at large scales. Therefore, recursive methods that divide and conquer are a force to be reckoned with.

Note that graphs larger than 1000 nodes with fully-connected edges were difficult to run despite having 24GB VRAM on an RTX-3090. To solve this issue, I have created a novel method called RODNoZ ( Recursive Omni-Dimensional Non-Zero ). I will write about this in a future article. The basic idea is to further divide and conquer the memory block on the GPU to avoid saturating the cores. Another feature that the implementation will support is exploiting multiple GPUs.

Background and Technical Details

If you'd like to dive in deeper on this topic, I recommend this paper. It softly introduces the problem and gives the background of how it's formulated. It further contains links to even more technical documents for those who like to get into the knitty-gritty implementation details.

Cheers :)

Using 3D Tensors on GPUs for Shortest-Path Finding

Taher Anjary

Computer Vision Engineer for Autonomous Driving

What is APSP?

How Much Faster is the GPU?

领英推荐

Background and Technical Details

更多精彩文章

社区洞察

其他会员也浏览了

Why GPUs Are Revolutionizing the World of Artificial Intelligence

Distributed platform for Machine Learning

Notes on Writing a Ray Tracer using CUDA

Stream MultiProcessors in GPU

4> Optimizing C++ - N-Body Problem, Reloaded (with CUDA)

AI Hardware: CPU vs GPU vs NPU

Accelerated Computing with C++

Benchmark comparison for Jetson Nano, TX2, Xavier NX and AGX

Which GPU to use for deep learning?

What is APSP?

How Much Faster is the GPU?

领英推荐

Background and Technical Details

Learning Locomotion

2021年7月22日

Collaborative Approaches in Artificial Intelligence

2021年7月16日

社区洞察

其他会员也浏览了

Why GPUs Are Revolutionizing the World of Artificial Intelligence

Distributed platform for Machine Learning

Notes on Writing a Ray Tracer using CUDA

Stream MultiProcessors in GPU

4> Optimizing C++ - N-Body Problem, Reloaded (with CUDA)

AI Hardware: CPU vs GPU vs NPU

Accelerated Computing with C++

Benchmark comparison for Jetson Nano, TX2, Xavier NX and AGX

Which GPU to use for deep learning?