The original description of the k-d tree recognized that rebalancing techniques, such as are used to build an AVL tree or a red-black tree, are not applicable to a k-d tree. Hence, in order to build a balanced k-d tree, it is necessary to find the median of the data for each recursive subdivision of those data. The sort or selection that is used to find the median for each subdivision strongly influences the computational complexity of building a k-d tree. This paper discusses an alternative algorithm that builds a balanced k-d tree by presorting the data in each of k dimensions prior to building the tree. It then preserves the order of these k sorts during tree construction and thereby avoids the requirement for any further sorting. Moreover, this algorithm is amenable to parallel execution via multiple threads. Compared to an algorithm that finds the median for each recursive subdivision, this presorting algorithm has equivalent performance for four dimensions and better performance for three or fewer dimensions.
The original description of k-d trees recognized that rebalancing techniques used for AVL trees or red-black trees are not applicable to k-d trees. Therefore, to construct a balanced k-d tree, it is necessary to find the median for each recursive subdivision of the data. The sorting or selection algorithm used to find the median for each subdivision strongly influences the computational complexity of building a k-d tree. This paper discusses an alternative algorithm that constructs a balanced k-d tree by pre-sorting the data on each of the k dimensions before tree construction. The k sorted orders are then maintained during the tree-building process, thereby eliminating the need for further sorting. Furthermore, the algorithm is amenable to parallel execution via multithreading. Compared to algorithms that find the median for each recursive subdivision, this pre-sorting algorithm achieves equivalent performance in four dimensions and superior performance in three or fewer dimensions.
Importance of k-d Trees: The k-d tree is an important data structure introduced by Bentley in 1975 for storing k-dimensional data, with widespread applications in multidimensional search, nearest neighbor queries, range queries, and other scenarios.
Challenges of Balancing: Unlike standard binary trees, k-d trees use different keys for partitioning at different levels, making traditional rebalancing techniques (such as rotations in AVL or red-black trees) inapplicable to k-d trees.
Limitations of Existing Methods:
Traditional methods require finding the median at each recursive subdivision
Using Quicksort for median finding: O(n) best case, O(n²) worst case
Using Merge sort or Heap sort: guarantees O(n log n), but results in overall complexity of O(n log² n)
Blum et al.'s O(n) median algorithm, while theoretically excellent, is complex to implement
Input: A set of n k-dimensional data points
Output: A balanced k-d tree supporting efficient multidimensional search operations
Constraints: Maintain tree balance and avoid duplicate data points
Algorithm Flow:
1. Select the median element of the current dimension's index array as the partition point
2. Partition the index arrays of other dimensions by this partition point
3. Partition process maintains the sorted order within each array
4. Recursively process left and right sub-arrays, cycling through different dimensions
This paper cites 21 important references, covering:
Bentley's original k-d tree paper 4
Blum et al.'s linear-time median algorithm 6
Classical sorting algorithm literature 8,12,20
Related work on parallel computing and performance modeling 2,10
Applications in nearest neighbor search and reverse nearest neighbor 7,13
Overall Assessment: This is a high-quality algorithmic paper that proposes an innovative pre-sorting method in the field of k-d tree construction. The paper features rigorous theoretical analysis, comprehensive experimental design, and high practical value. While it has limitations in high-dimensional cases, it provides an effective solution for low-dimensional spatial data processing and holds significant reference value for related fields.