A dive into spatial search algorithms

geophile · on April 28, 2017

This discussion is incomplete without mentioning space-filling curves. These curves map the data into a 1-d space and then you can use any conventional search structure -- sorted array with binary search, balanced tree, skiplist, b-tree, etc. This allows you great flexibility, including the use of 3rd party libraries of data structures.

(A quadtree is basically the z-order space-filling curve combined with a trie of degree 4. Octtree -- degree 8.)

This approach also handles non-point data very cleanly. A non-point object gives rise to multiple index entries. (You can tune the number of index entries quite easily.)

Space-filling curves also leads to a very nice kNN algorithm: To find the k nearest neighbors of p, first search the index for p. Find k neighbors along the space-filling curve, (these are adjacent in the index), and of these, pick the point q maximizing distance(p, q). All k nearest neighbors of p must be in a circle centered at p with radius distance(p, q). Now search your index again using this circle and pick the k points n1, ..., nk minimizing distance(p, ni).

robrenaud · on April 28, 2017

Yeah, the effectiveness of that idea mad me a little sad when I worked on google maps. I spent all this time studying computational geometry in college and it went out the window. Use space filling curves and then standard text indexing techniques and you don't need to deal with any fancy geometric algorithms.

amelius · on April 28, 2017

Space filling curves, although elegant, are not a solution for every problem in computational geometry of course.

bane · on April 29, 2017

I'd really enjoy reading anything you're aware of on this topic.

tps5 · on April 28, 2017

Do you have any references to recommend on this subject?

geophile · on April 29, 2017

I've been out of it for a while, but the wikipedia page on z-order is probably a good place to start. https://en.wikipedia.org/wiki/Z-order_curve. This discusses the basic idea of using a space-filling curve to map data to 1-d, and then using a conventional search structure. Hilbert curves should preserve proximity better.

Sorry to toot my own horn, but I wrote some papers that should be useful to someone learning the topic:

1) Algorithms based on space-filling curves for spatial join, a generalization of range search, points in a polygon, etc.: Orenstein & Manola, IEEE Transactions on Software Engineering, Volume 14.

2) How to tune the index for non-point spatial objects: Orenstein, ACM SIGMOD 1989.

You might also find my software, which implements these ideas, useful: https://github.com/geophile

There have been many papers elaborating on the idea, but I haven't kept up with the area. Given that you can get such good results using space-filling curves and standard data structures, I think it's nuts to do anything else. I mean look at Postgres and MySQL. The spatial indexes in these systems have never been fully integrated because they are distinct index structures, always lagging behind progress with the main index structures.

anarazel · on April 29, 2017

In postgres not much has happened around geospatial in core, because there's a lot happening in postgis.

FWIW, you can define additional types of index access methods in postgres (from scratch), or you can gist / gin / spgist access methods where you need to care about a lot less. All are used in the wild.

vvanders · on April 28, 2017

Along similar lines Real-Time Collision Detection[1] is a tour-de-force in spatial partitioning algorithms. It's basically data-structures for 2D/3D along with some really good coverage of common floating point numerical issues and cache friendly algorithms.

Easily the in the top 3 technical books I've purchased. Each page is solid gold and very little space is wasted.

[1] - https://www.amazon.com/Real-Time-Collision-Detection-Interac...

mattnewport · on April 28, 2017

Coming from a games background, I've not heard of an R tree before. Quadtrees (or octtrees for 3D) are probably the most common spatial query data structure used in game engines. I'm surprised they weren't mentioned in the article. Gonna have to do some research on what the pros and cons of an R tree are vs a quadtree now.

jandrewrogers · on April 28, 2017

R-trees are relatively slow, particularly if you have a very high mutation rate. Their sole redeeming feature is that they can handle non-point geometries with constant space complexity whereas quad-trees have pathological space complexity for geometries like polygons.

If you are just dealing with points, an R-tree is a poor algorithm choice. Quad-trees are pretty close to ideal in terms of performance for that use case.

dlubarov · on April 28, 2017

Octrees can be used for non-point geometries too. You just have to store some geometries at non-leaf nodes. When inserting, you search the tree for the deepest voxel which encapsulates the entire geometry (or, equivalently, its bounding box). At least that's how I've always done it.

The advantage of R-trees and kd-trees is that you can partition the space more intelligently. With octrees, a voxel is always divided into eight equal children, even if one of the eight children will contain all the points/geometries. It's usually not a big problem, but if you have a few small objects with lots of detail and far from the origin, you might get a poorly-balanced octree with some very deep branches.

As you mention, R-trees and kd-trees have slow insertion, but if you're path tracing a static scene and the rendering is going to take hours, you don't really care about that.

norkakn · on April 28, 2017

I imagine that you are working in a domain where it is true that R-Trees are slow and only have the one redeeming quality. Anything that I've done involving things on a screen have performed better with quadtrees for me, so I'm not discounting your experience at all.

In other domains though, R-Trees have many other advantages, and often perform better.

http://www.dpi.inpe.br/livros/bdados/artigos/oracle_r_tree.p...

"For these datasets, R-trees consistently outperformed Quadtrees by 2-3 times" ... "In short, a quadtree could be recommended for update-intensive applications using simple poly-gon geometries, high concurrency update databases, or when specialized masks such as touch are frequently used in queries. However, users have to fine-tune the tiling level to obtain best performance. R-trees, which do not require any such tuning, could be used in all other cases to obtain nearly equivalent or better performance."

phkahler · on April 28, 2017

I've found octtrees quite useful for geometric primitives. There is the pathological case of long thin objects, but I place those in a large number of small nodes rather than one large node. Insert and delete are generally O(log n).

santaclaus · on April 28, 2017

> quad-trees have pathological space complexity for geometries like polygons

Any idea what is used in the wild for closest point queries against non-point data (triangle meshes or otherwise)?

geophile · on April 28, 2017

Quadtrees are fine. Just don't approximate the polygon to the limit of resolution of the space. I wrote a paper many years ago on a related problem. The conclusion was that a very small number of quads would be optimal. Too many is obviously a problem; too few and you have to filter out a lot of false negatives.

The paper was in ACM SIGMOD 89, Redundancy in Spatial Databases.

greg7mdp · on April 28, 2017

You should check out the Churchill Challenge (http://churchillnavigation.com/challenge/) and the 54 solutions (source code available at https://github.com/churchillnavigation/challenge1). I came in at #7.

Beyond the classic spatial search data structures, the fastest solutions all used SSE instructions (see https://github.com/sDessens/churchill-challange for nice writeup). People did try pretty hard to win the grand prize of $5000, and the fastest solution was 7 times faster than the best optimized code that Churchill's team had previously.

Boothroid · on April 28, 2017

In my experience most of the work in this area has been done in that you will rarely find yourself having to think seriously about spatial indexing, and even when the task forces you to, most of the tools one would commonly choose between have different indexing options implemented as standard.

neel8986 · on April 28, 2017

Nice algorithm. I am curious what happens when the dimension becomes really large (more than 100). I have heard about kd tree but as far as I know the performance becomes really bad for d > 16. What should be the strategy of finding nearest neighbours for such high dimensions?

snovv_crash · on April 28, 2017

It often depends on what the actual dimensionality of the problem is. If it can be embedded in a lower dimensional space, as many problems are, then a well tuned kD-Tree can still do a good job up to about 16 dimensions of used space, as you say. Cover trees also work well in this case, and are better at extracting just the salient dimensions than kD-Trees are, especially as the discrepancy between the subspace and data space grows.

If the dimensionality gets larger then a better approach is to instead remove overhead and improve parallelism, for example by pushing the problem to the GPU or discretizing the measurement space down to 8-bits or even 1-bit per dimension. I know of products that do both, and solve O(N^2) problems 20x larger than a naïve approach could in the same time. POPCNT runs really fast on GPUs.

MrQuincle · on April 29, 2017

Euclidean?

Then nearest neighbor itself is rarely what you want: https://link.springer.com/chapter/10.1007%2F3-540-49257-7_15. Quote from abstract:

"We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point"

In the end - in the physical world - everything maps to 3D give or take a few dimensions that are very small if we have to believe some physicists. You just have to find the mapping. :-)

argonaut · on April 29, 2017

For large dimensions, a normal linear search.