Vectorized Quick Sort In JDK21

Vectorized Quick Sort In JDK21

This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort algorithm. With CPU-bound applications performing frequent sorting tasks, this research has significant practical implications.

QuickSort works by selecting a pivot element and rearranging the array such that elements less than the pivot are placed to its left and elements greater than the pivot are placed to its right. This effectively partitions the array, allowing for a divide-and-conquer approach where the left and right sub-arrays can be sorted recursively.

The Vector API leverages Single Instruction Multiple Data (SIMD) instructions, enabling the comparison and manipulation of multiple data elements simultaneously. Modern CPUs, like those supporting AVX512, can potentially compare and potentially swap up to 16 integer or floating-point elements per clock cycle, significantly exceeding the capabilities of scalar (single-element) operations.

However, applying the Vector API to QuickSort presents some challenges:

  • When the data chunk size falls below the vector register size (e.g., less than 16 elements), the algorithm needs special handling. Padding the data and using masks to isolate relevant elements can significantly increase complexity.
  • Traditionally, QuickSort operates in-place, swapping elements within the original array. However, vectorized operations might require writing sets of elements to temporary buffers for optimal performance.

To overcome these challenges and simplify implementation, we propose the following approach:

  • Utilize a separate buffer to store elements greater than the pivot. This avoids in-place modifications to the original array and facilitates vectorized writes.
  • For sub-arrays shorter than the vector register size, a standard scalar Quick Sort implementation can be used to maintain efficiency.
  • Since Quick Sort is recursive, consider allocating the buffer outside the sorting function to avoid unnecessary memory allocation on each recursive call. Additionally, allocate a buffer slightly larger than the anticipated maximum sub-array size to prevent overflows.

I will skip scalar portion of the algorithm since you can easily google it. I annotated code and wrote a simple unit test to make sure regular "Arrays.sort()" and my code produce the same result.

public class VectorizedQuickSort {
    //Maximum available width of the vector
    private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_MAX;
    private int[] tmp;

    public void sort(int[] arr, int low, int high) {
        //sort is recursive, so need to move buffer allocation out of it
        tmp = new int[arr.length];
        //actual sort
        innerSort(arr, low, high);
    }
    void innerSort(int[] arr, int low, int high) {
        if (low < high) {
            int pivot = 0;
            //If the length of the array is longer than vector
            if ((high - low + 1)/ SPECIES.length() >= 1) {
                pivot = vectorPartition(arr, low, high);
            } else {
                //otherwise regular scalar partitioner
                pivot = scalarPartition(arr, low, high);
            }
            innerSort(arr, low, pivot - 1);
            innerSort(arr, pivot + 1, high);
        }
    }
    private int vectorPartition(int[] arr, int low, int high) {
        int pivot = arr[high];
        //That's the position of the bigger elements in the buffer
        int highPos = 0;
        //the tail of the smaller elements in the original array
        int lowPos = low;
        //current position we check
        int cPos = low;
        //while we have more strides
        while (cPos + SPECIES.length() < high) {
            //populate vector
            IntVector vec = IntVector.fromArray(SPECIES, arr, cPos);
            //compare with pivot and get a mask
            VectorMask<Integer> lessThan = vec.lt(pivot);
            //compress smaller elements into separate vector
            IntVector lt = vec.compress(lessThan);
            //compress bigger elements into separate vector
            IntVector gte = vec.compress(lessThan.not());
            //flush smaller elements back to array after current tail
            lt.intoArray(arr, lowPos);
            //call to trueCount causes quite a lot of logic to be executed
            int trueCount = lessThan.trueCount();
            //increase the tail by the number of elements flushed
            lowPos += trueCount;
            //flush bigger elements into a buffer
            gte.intoArray(tmp, highPos);
            //update buffer counter
            highPos += SPECIES.length() - trueCount;
            //update current position
            cPos += SPECIES.length();
        }
        //this is small tail which does not fill entire vector
        //I still use buffer to simplify the last step
        for (int j = cPos; j < high; j++) {
            if (arr[j] < pivot) {
                arr[lowPos++] = arr[j];
            } else {
                tmp[highPos++] = arr[j];
            }
        }
        //since bigger elements are in buffer we don't need to swap
        arr[lowPos] = arr[high];
        int result = lowPos;

        int j = 0;
        //flush buffer back to array
        while (j < highPos) {
            arr[++lowPos] = tmp[j++];
        }
        return result;
    }
}        

Surprisingly, calling true count on vector mask is quite heavy operation and JIT does not optimize multiple invocations, so storing result in a temporary variable helps to improve performance a bit.

To compare the performance of QuickSort implementations I implemented a simple JMH benchmark. Please keep in mind that algorithm sorts in place so setup method should run per invocation, thus adding quite a lot of noise if the size of the array is small. Here are the results

Benchmark results
Benchmark           (size)  Mode       Score           Units
QSBench.arrays   1000   avgt        19.602          us/op
QSBench.scalar   1000   avgt        27.693          us/op
QSBench.vector   1000   avgt        15.349          us/op         

Raw data for the smallest array size. As you can see my simple and straightforward implementation is 22% faster than state of the art dual pivot algorithm from Arrays class even on relatively small arrays.

But the most astonishing revelation is that the biggest bottleneck of this algorithm is not the buffer or memory copy. It is scalar portion of the algorithm.

....[Hottest Regions]...............................................................................
  20.60%  c2, level 4  test.VectorizedQuickSort::vectorPartition, version 7, compile id 1161 
  17.78%  c2, level 4  test.VectorizedQuickSort::vectorPartition, version 7, compile id 1161 
  16.72%  c2, level 4  test.VectorizedQuickSort::innerSort, version 2, compile id 1166 
  16.55%  c2, level 4  test.VectorizedQuickSort::innerSort, version 2, compile id 1166 
  10.60%  c2, level 4  test.VectorizedQuickSort::innerSort, version 2, compile id 1166 
   9.14%  c2, level 4  test.VectorizedQuickSort::vectorPartition, version 7, compile id 1161 
   1.38%  c2, level 4  test.QSBench::setup, version 6, compile id 1170         

I inspected ASM code printed in perfasm mode and items 3, 4, and 5 are actually inlined scalar sorting of the smaller subarrays.

The next step could be implementing a masked version for smaller subarrays to improve performance.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了