Hausdorff distance for large dataset in a fastest way

前提是你 提交于 2021-01-22 06:48:11

问题


Number of rows in my dataset is 500000+. I need Hausdorff distance of every id between itself and others. and repeat it for the whole dataset

I have a huge data set. Here is the small part:

df = 

id_easy ordinal latitude    longitude            epoch  day_of_week
0   aaa     1.0  22.0701       2.6685   01-01-11 07:45       Friday
1   aaa     2.0  22.0716       2.6695   01-01-11 07:45       Friday
2   aaa     3.0  22.0722       2.6696   01-01-11 07:46       Friday
3   bbb     1.0  22.1166       2.6898   01-01-11 07:58       Friday
4   bbb     2.0  22.1162       2.6951   01-01-11 07:59       Friday
5   ccc     1.0  22.1166       2.6898   01-01-11 07:58       Friday
6   ccc     2.0  22.1162       2.6951   01-01-11 07:59       Friday

I want to calculate Haudorff Distance:

import pandas as pd
import numpy as np

from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import pdist, squareform

u = np.array([(2.6685,22.0701),(2.6695,22.0716),(2.6696,22.0722)]) # coordinates of `id_easy` of taxi `aaa`
v = np.array([(2.6898,22.1166),(2.6951,22.1162)]) # coordinates of `id_easy` of taxi `bbb`
directed_hausdorff(u, v)[0]

Output is 0.05114626086039758


Now I want to calculate this distance for the whole dataset. For all id_easys. Desired output is matrix with 0 on diagonal (because distance between the aaa and aaa is 0):

     aaa      bbb    ccc
aaa    0  0.05114   ...
bbb    ...   0
ccc             0

回答1:


At first I define a method which provides some sample data. It would be a lot easier if you provide something like that in the question. In most performance related problems the size of the real problem is needed to find a optimal solution.

In the following answer I will assume that the average size of id_easy is 17 and there are 30000 different ids which results in a data set size of 510_000.

Create sample data

import numpy as np
import numba as nb

N_ids=30_000
av_id_size=17

#create_data (pre sorting according to id assumed)
lat_lon=np.random.rand(N_ids*av_id_size,2)

#create_ids (sorted array with ids)
ids=np.empty(N_ids*av_id_size,dtype=np.int64)
ind=0
for i in range(N_ids):
    for j in range(av_id_size):
        ids[i*av_id_size+j]=ind
    ind+=1

Hausdorff function

The following function is a slightly modified version from scipy-source. The following modifications are made:

  • For very small input arrays I commented out the shuffling part (Enable shuffling on larger arrays and try out on your real data what's best
  • At least on Windows the Anaconda scipy function looks to have some performance issues (much slower, than on Linux), LLVM based Numba looks to be consitent
  • Indices of the Hausdorff pair removed
  • Distance loop unrolled for the (N,2) case

    #Modified Code from Scipy-source
    #https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx
    #Copyright (C)  Tyler Reddy, Richard Gowers, and Max Linke, 2016
    #Copyright © 2001, 2002 Enthought, Inc.
    #All rights reserved.
    
    #Copyright © 2003-2013 SciPy Developers.
    #All rights reserved.
    
    #Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
    #Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
    #Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following 
    #disclaimer in the documentation and/or other materials provided with the distribution.
    #Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived 
    #from this software without specific prior written permission.
    
    #THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, 
    #BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 
    #IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, 
    #OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 
    #OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
    #(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
    @nb.njit()
    def directed_hausdorff_nb(ar1, ar2):
        N1 = ar1.shape[0]
        N2 = ar2.shape[0]
        data_dims = ar1.shape[1]
    
        # Shuffling for very small arrays disbabled
        # Enable it for larger arrays
        #resort1 = np.arange(N1)
        #resort2 = np.arange(N2)
        #np.random.shuffle(resort1)
        #np.random.shuffle(resort2)
    
        #ar1 = ar1[resort1]
        #ar2 = ar2[resort2]
    
        cmax = 0
        for i in range(N1):
            no_break_occurred = True
            cmin = np.inf
            for j in range(N2):
                # faster performance with square of distance
                # avoid sqrt until very end
                # Simplificaten (loop unrolling) for (n,2) arrays
                d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2
                if d < cmax: # break out of `for j` loop
                    no_break_occurred = False
                    break
    
                if d < cmin: # always true on first iteration of for-j loop
                    cmin = d
    
            # always true on first iteration of for-j loop, after that only
            # if d >= cmax
            if cmin != np.inf and cmin > cmax and no_break_occurred == True:
                cmax = cmin
    
        return np.sqrt(cmax)
    

Calculating Hausdorff distance on subsets

@nb.njit(parallel=True)
def get_distance_mat(def_slice,lat_lon):
    Num_ids=def_slice.shape[0]-1
    out=np.empty((Num_ids,Num_ids),dtype=np.float64)
    for i in nb.prange(Num_ids):
        ar1=lat_lon[def_slice[i:i+1],:]
        for j in range(i,Num_ids):
            ar2=lat_lon[def_slice[j:j+1],:]
            dist=directed_hausdorff_nb(ar1, ar2)
            out[i,j]=dist
            out[j,i]=dist
    return out

Example and Timings

#def_slice defines the start and end of the slices
_,def_slice=np.unique(ids,return_index=True)
def_slice=np.append(def_slice,ids.shape[0])

%timeit res_1=get_distance_mat(def_slice,lat_lon)
#1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



回答2:


You're talking about calculating 500000^2+ distances. If you calculate 1000 of these distances every second, it will take you 7.93 years to complete your matrix. I'm not sure whether the Hausdorff distance is symmetric, but even if it is, that only saves you a factor of two (3.96 years).

The matrix will also take about a terabyte of memory.

I recommend calculating this only when needed, or if you really need the whole matrix, you'll need to parallelize the calculations. On the bright side, this problem can easily be broken up. For example, with four cores, you can split the problem thusly (in pseudocode):

n = len(u)
m = len(v)
A = hausdorff_distance_matrix(u[:n], v[:m])
B = hausdorff_distance_matrix(u[:n], v[m:])
C = hausdorff_distance_matrix(u[n:], v[:m])
D = hausdorff_distance_matrix(u[n:], v[m:])
results = [[A, B],
           [C, D]]

Where hausdorff_distance_matrix(u, v) returns all distance combinations between u and v. You'll probably need to split it into a lot more than four segments though.

What is the application? Can you get away with only calculating these piece-wise as needed?




回答3:


Try using the computed distance from scipy

from scipy.spatial.distance import cdist

hausdorff_distance = cdist(df[['latitude', 'longitude']], df[['latitude', 'longitude']], lambda u, v: directed_hausdorff(u, v)[0])

hausdorff_distance_df  = pd.DataFrame(hausdorff_distance)

As a note though, whatever method you will end up using - it will take a lot of time to calculate, just to due to the sheer volume of the data. Ask yourself, if you genuinely need every single pair of distances.

Practically, these kind of problems are solved by restricting the number of pairings to a manageable number. For example slice your data frame into smaller sets, with each set restricted to a geographical area and then find the pair of distances within that geographical area.

The above approach is used by supermarkets to identify spots for their new stores. They are not calculating a pair of distances between every single store they own and their competitors own. First they restrict the area, which will have only 5-10 stores in total and only then they proceed to calculate the distances.




回答4:


It comes down to your exact use case:

  • If what you want is to pre-compute values because you need ballpark answers quickly, you can 'bin' your answers. In the most rough case, imagine you have 1.2343 * 10^7 and 9.2342 * 10^18. You drop the resolution markedly and do a straight lookup. That is, you lookup '1E7 and 1E19' in your precomputed lookup table. You get the correct answer accurate to a power of 10.
  • If you want to characterize your distribution of answers globally, pick a statistically sufficient number of random pairs and use that instead.
  • If you want to graph your answers, run successive passes of more precise answers. For example, make your graph from 50x50 points equally spaced. After they are computed, if the user is still there, start filling in 250x250 resolution, and so on. There are cute tricks to deciding where to focus more computational power, just follow the user's mouse instead.
  • If you may want the answers for some pieces, but just don't know which one's until a later computation, then make the evaluation lazy. That is, attempting to __getitem__() the value actually computes it.
  • If you think you actually need all the answers now, think harder. There is no problem with that many moving parts. You should find some exploitable constraint in the data that already exists, otherwise the data would already be overwhelming.

Keep thinking! Keep hacking! Keep notes.



来源:https://stackoverflow.com/questions/58856749/hausdorff-distance-for-large-dataset-in-a-fastest-way

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!