问题
I want to use a data structure for sorting space-time data (x,y,z,time).
Currently a processing algorithm searches a set of 4D (x,y,z,time) points, given a spherical (3d) spacial radius and a linear (1d) time radius, marking for each point, which other points are within those radii. The reason is that after processing, I can ask any 4D point for all of its neighbours in O(1) time.
However in some common configurations of space and time radii, the first run of the algorithm takes about 12 hours. Believe it or not, that's actually fast compared to what exists in our industry. Nevertheless, I want to help speed up the initial runs and so I want to know: Is a kd-tree suitable for 4D space-time data?
Note that I am not looking for implementations of nearest-neighbour search or k-nearest-neighbours search.
More Info:
An example dataset has 450,000 4D points.
Some datasets are time-dense so ordering by time certainly saves processing, but still leads to many distance checks.
Time is represented by Excel-style dates, with typical ranges between 30,000-39,000 (approximate). The space ranges are sometimes higher values, sometimes lower values, but the range between each space co-ordinate is similar to time (e.g. maxX-minX ~ maxT-minT).
Even more info:
I thought I'd add some more slightly irrelevant data in case anybody has dealt with a similar dataset.
Basically I'm working with data that represents space-time events that are recorded and corroborated by multiple sensors. Error is involved, so only events that meet an error threshold are included.
The time span of these datasets ranges between 5-20 years of data.
For the really old data (>8 years old), the events were often very spacially dense for two reasons: 1) there were relatively few sensors available back then, and 2) the sensors were placed close together so that nearby events could be properly corroborated with low error. Further events could be recorded, but they had too high an error
For the newer data (<8 years old), the events are often very time dense, for the inverse reasons: 1) there are usually many sensors available, and 2) the sensors are placed at regular intervals over a larger distance.
As a result, the datasets cannot typically be said to be only time-dense or only spacially dense (except in the case of datasets that contain only new data).
Conclusion
I clearly should be asking more questions on this site.
I will be testing out several solutions over the next while which will include the 4d kd-tree, a 3d kd-tree followed by time distance check (suggested by Drew Hall), and the current algorithm I have. Also, I have been suggested another data structure called TSP (time space partitioning) tree, which uses an octree for space and a bsp on each node for time, so I may test that as well.
Assuming I remember, I'll be sure to post some profiling benchmarks on different time/space radii configurations.
Thanks all
回答1:
To expand a little bit on my comments to an answer above:
According to the literature, kd-trees require data with Euclidean coordinates. They are probably not strictly necessary, but they're certainly sufficient: guaranteeing that all coordinates are Euclidean ensures that the normal rules of space apply, and makes it possible to easily partition points by their location and build up the tree structure.
Time is a little bit strange. Under the rules of special relativity, you use a Minkowski metric, not a standard Euclidean metric, when you're working with time coordinates. This causes all kinds of problems (most severe among them destroying the meaning of "simultaneity"), and generally makes people afraid of time coordinates. That fear is not well-founded, though, because unless you know you're working on physics, your time coordinate almost certainly actually will be Euclidean in practice.
What does it mean for a coordinate to be Euclidean? It should be independent of all the other coordinates. Saying time is a Euclidean coordinate means that you can answer the question "Are these two points close together in time?" by looking only at their time coordinates, and ignoring any extra information. It's easy to see why not having that property might break a scheme that partitions points by the values of their coordinates; if two points can have radically different time coordinates but still be considered "close in time", then a tree which sorts them by time coordinate is not going to work very well.
An example of a Euclidean time coordinate would be any time specified in a single, consistent time zone (like UTC times). If you have two clocks, one in New York and one in Tokyo, you know that if you have two measurements labelled "12:00 UTC" then they were taken at the same time. But if the measurements are taken in local time, so one says "12:00 New York time" and one is "12:00 Tokyo time", you have to use extra information about the locations and time zones of the cities to figure out how much time elapsed between the two measurements.
So as long as your time coordinate is consistently measured and sane, it will be Euclidean, and that means it will work just fine in a kd-tree or similar data structure.
回答2:
If you stored an index to your points sorted in the time dimension, couldn't you first perform an initial pruning in the 1-d time dimension, thus reducing the number of distance calculations? (Or is that an oversimplfication?)
回答3:
You haven't really given enough information to answer this.
But sure, in general kd-trees are perfectly suitable for 4 (or 5 or 6 or...) dimensional data --- if the spatial (or in your case space/time-ial) distribution lends itself to kd-tree decomposition. In other words, it depends (sound familiar?).
kd-trees are just one method of spatial decomposition that lend themselves to certain localized searches. As you go to higher dimensions, the curse of dimensionality problem rears it's head, of course, but 4d isn't too bad (you probably want at least a several hundred points though).
In order to know if this will work for you, you have to analyse some other criteria. Is approximate NN search good enough (this can help a lot). Is tree balancing likely to be expensive? etc.
回答4:
If your data is relatively time-dense (and relatively space-sparse), it might work best to use a 3d kd-tree on the spatial dimensions, then simply reject the points that are outside the time window of interest. That would get around your mixed space/time metric problem, at the expense of a slightly more complex point struct.
来源:https://stackoverflow.com/questions/788005/is-a-kd-tree-suitable-for-4d-space-time-data-x-y-z-time