Efficient comparison of 100.000 vectors

后端 未结 10 2190
礼貌的吻别
礼貌的吻别 2021-01-31 21:20

I save 100.000 Vectors of in a database. Each vector has a dimension 60. (int vector[60])

Then I take one and want present vectors to the user in order of decreasing sim

10条回答
  •  無奈伤痛
    2021-01-31 21:59

    Update:

    After you made clear that 60 is the dimension of your space, not the length of the vectors, the answer below is not applicable for you, so I'll keep it just for history.


    Since your vectors are normalized, you can employ kd-tree to find the neighbors within an MBH of incremental hypervolume.

    No database I'm aware of has native support of kd-tree, so you can try to implement the following solution in MySQL, if you are searching for a limited number of closest entries:

    • Store the projections of the vectors to each of 2-dimensional space possible (takes n * (n - 1) / 2 columns)
    • Index each of these columns with a SPATIAL index
    • Pick a square MBR of a given area within any projection. The product of these MBR's will give you a hypercube of a limited hypervolume, which will hold all vectors with a distance not greater than a given one.
    • Find all projections within all MBR's using MBRContains

    You'll still need to sort within this limited range of values.

    For instance, you have a set of 4-dimensional vectors with magnitude of 2:

    (2, 0, 0, 0)
    (1, 1, 1, 1)
    (0, 2, 0, 0)
    (-2, 0, 0, 0)
    

    You'll have to store them as follows:

    p12  p13  p14  p23  p24  p34
    ---  ---  ---  ---  ---  ---
    2,0  2,0  2,0  0,0  0,0  0,0
    1,1  1,1  1,1  1,1  1,1  1,1
    0,2  0,0  0,0  2,0  2,0  0,0
    -2,0 -2,0 -2,0 0,0  0,0  0,0
    

    Say, you want similarity with the first vector (2, 0, 0, 0) greater than 0.

    This means having the vectors inside the hypercube: (0, -2, -2, -2):(4, 2, 2, 2).

    You issue the following query:

    SELECT  *
    FROM    vectors
    WHERE   MBRContains('LineFromText(0 -2, 4 2)', p12)
            AND MBRContains('LineFromText(0 -2, 4 2)', p13)
            …
    

    , etc, for all six columns

提交回复
热议问题