Efficient KNN implementation which allows inserts

假装没事ソ 提交于 2019-12-13 18:24:20

问题


Suppose I have multi-dimensional datasets, which have many vectors as data. I am writing an algorithm which needs to do k nearest neighbour searches for all those vectors - classical KNN. However, during my algorithm I add new vectors to the overall dataset and need to include those new vectors into my KNN search. I want to do that efficiently. I looked into KD tree and ball tree of scikit-learn, but they don't allow inserts (by the nature of the concepts). I am not sure whether SR tree or R tree would provide inserts, but in any case, I was not able to find a python implementation for data beyond 3D.

Regarding the search I am fine with either the query "give me the closest vector" (so 1-NN) or "give me all vectors that are closer then radius".


回答1:


General comment: I don't quite understand why KD-Trees are so popular for high-dimensional kNN queries. In my experience, other trees scale much better with high dimensionality or large datasets (I tested up to 25Million points and (only) up to 40 dimensions). Some more details:

  • KD-Trees: As far as I know, KD-Trees should support insertion at any time, but there is a chance that they get imbalanced. I don't use python, so I don't know why your KD-tree does not support insertion/deletion on the fly.
  • Quadtree: Depending on the dimensionality, you could also use quadtree/octrees, but standard implementations are not good for more than 10 dimensions or so. In the reference above I tested a quadtree with a special 'hypecube' navigation approach. That requires a lot of memory but scales much better with dimensionality in terms of performance.
  • R-Tree/R*Tree: The original R-Trees are not very good with insertion on the fly. However, if you look at R+Trees, (R-Plus-Tree), they are quite fast with reinsertion and kNN queries.
  • PH-Trees have basically the same kNN performance as R+Trees, but much better insertion time, because PH-Trees do not need rebalancing, while having inherently limited depth and nodesize. Unfortunately, implementations gets a lot more complicated for >=64 dimensions (the tree uses one bit of a long integer for each dimensions). I'm not aware of an implementation that supports more than 63 dimensions.

Python:

  • R+Plus trees should be available for Python. If not, you could adapt a normal R-Tree (only the insertion algorithm is different)
  • I heard once of someone starting to implement a PH-Tree in Python, but I haven't seen any open-source variant yet.
  • If you have some time/interest to do your own implementation, you could look at the Java implementations here and translate them to Python. The library contains various multidimensional indexes, except KD-Trees. KD-Tree implementations that allow on-the-fly insertion can be found here and here.


来源:https://stackoverflow.com/questions/45887680/efficient-knn-implementation-which-allows-inserts

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!