Vectorized Quick Sort In JDK21
This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort algorithm. With CPU-bound applications performing frequent sorting tasks, this research has significant practical implications.
QuickSort works by selecting a pivot element and rearranging the array such that elements less than the pivot are placed to its left and elements greater than the pivot are placed to its right. This effectively partitions the array, allowing for a divide-and-conquer approach where the left and right sub-arrays can be sorted recursively.
The Vector API leverages Single Instruction Multiple Data (SIMD) instructions, enabling the comparison and manipulation of multiple data elements simultaneously. Modern CPUs, like those supporting AVX512, can potentially compare and potentially swap up to 16 integer or floating-point elements per clock cycle, significantly exceeding the capabilities of scalar (single-element) operations.
However, applying the Vector API to QuickSort presents some challenges:
To overcome these challenges and simplify implementation, we propose the following approach:
I will skip scalar portion of the algorithm since you can easily google it. I annotated code and wrote a simple unit test to make sure regular "Arrays.sort()" and my code produce the same result.
领英推荐
public class VectorizedQuickSort {
//Maximum available width of the vector
private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_MAX;
private int[] tmp;
public void sort(int[] arr, int low, int high) {
//sort is recursive, so need to move buffer allocation out of it
tmp = new int[arr.length];
//actual sort
innerSort(arr, low, high);
}
void innerSort(int[] arr, int low, int high) {
if (low < high) {
int pivot = 0;
//If the length of the array is longer than vector
if ((high - low + 1)/ SPECIES.length() >= 1) {
pivot = vectorPartition(arr, low, high);
} else {
//otherwise regular scalar partitioner
pivot = scalarPartition(arr, low, high);
}
innerSort(arr, low, pivot - 1);
innerSort(arr, pivot + 1, high);
}
}
private int vectorPartition(int[] arr, int low, int high) {
int pivot = arr[high];
//That's the position of the bigger elements in the buffer
int highPos = 0;
//the tail of the smaller elements in the original array
int lowPos = low;
//current position we check
int cPos = low;
//while we have more strides
while (cPos + SPECIES.length() < high) {
//populate vector
IntVector vec = IntVector.fromArray(SPECIES, arr, cPos);
//compare with pivot and get a mask
VectorMask<Integer> lessThan = vec.lt(pivot);
//compress smaller elements into separate vector
IntVector lt = vec.compress(lessThan);
//compress bigger elements into separate vector
IntVector gte = vec.compress(lessThan.not());
//flush smaller elements back to array after current tail
lt.intoArray(arr, lowPos);
//call to trueCount causes quite a lot of logic to be executed
int trueCount = lessThan.trueCount();
//increase the tail by the number of elements flushed
lowPos += trueCount;
//flush bigger elements into a buffer
gte.intoArray(tmp, highPos);
//update buffer counter
highPos += SPECIES.length() - trueCount;
//update current position
cPos += SPECIES.length();
}
//this is small tail which does not fill entire vector
//I still use buffer to simplify the last step
for (int j = cPos; j < high; j++) {
if (arr[j] < pivot) {
arr[lowPos++] = arr[j];
} else {
tmp[highPos++] = arr[j];
}
}
//since bigger elements are in buffer we don't need to swap
arr[lowPos] = arr[high];
int result = lowPos;
int j = 0;
//flush buffer back to array
while (j < highPos) {
arr[++lowPos] = tmp[j++];
}
return result;
}
}
Surprisingly, calling true count on vector mask is quite heavy operation and JIT does not optimize multiple invocations, so storing result in a temporary variable helps to improve performance a bit.
To compare the performance of QuickSort implementations I implemented a simple JMH benchmark. Please keep in mind that algorithm sorts in place so setup method should run per invocation, thus adding quite a lot of noise if the size of the array is small. Here are the results
Benchmark (size) Mode Score Units
QSBench.arrays 1000 avgt 19.602 us/op
QSBench.scalar 1000 avgt 27.693 us/op
QSBench.vector 1000 avgt 15.349 us/op
Raw data for the smallest array size. As you can see my simple and straightforward implementation is 22% faster than state of the art dual pivot algorithm from Arrays class even on relatively small arrays.
But the most astonishing revelation is that the biggest bottleneck of this algorithm is not the buffer or memory copy. It is scalar portion of the algorithm.
....[Hottest Regions]...............................................................................
20.60% c2, level 4 test.VectorizedQuickSort::vectorPartition, version 7, compile id 1161
17.78% c2, level 4 test.VectorizedQuickSort::vectorPartition, version 7, compile id 1161
16.72% c2, level 4 test.VectorizedQuickSort::innerSort, version 2, compile id 1166
16.55% c2, level 4 test.VectorizedQuickSort::innerSort, version 2, compile id 1166
10.60% c2, level 4 test.VectorizedQuickSort::innerSort, version 2, compile id 1166
9.14% c2, level 4 test.VectorizedQuickSort::vectorPartition, version 7, compile id 1161
1.38% c2, level 4 test.QSBench::setup, version 6, compile id 1170
I inspected ASM code printed in perfasm mode and items 3, 4, and 5 are actually inlined scalar sorting of the smaller subarrays.
The next step could be implementing a masked version for smaller subarrays to improve performance.