Vectorizing a function in pandas

前端 未结 3 836
日久生厌
日久生厌 2020-12-08 23:33

I have a dataframe that contains a list of lat/lon coordinates:

d = {\'Provider ID\': {0: \'10001\',
  1: \'10005\',
  2: \'10006\',
  3: \'10007\',
  4: \'1         


        
相关标签:
3条回答
  • 2020-12-09 00:13

    You don't need anything fancy, just a few mods to your function.

    First, don't use the math library. If you're doing real math or science, you're likely better off with numpy.

    Second, we're going to use the dataframe method apply. What apply does is it takes a function and runs every row (axis=1) or column (axis=0) through it, and builds a new pandas object with all of the returned values. So we need to set up haversine totake row of a dataframe and unpack the values. It becomes:

    def haversine(row):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        import numpy as np
    
        # convert all of the row to radians
        row = np.radians(row)
    
        # unpack the values for convenience
        lat1 = row['lat1']
        lat2 = row['lat2']
        lon1 = row['lon1']
        lon2 = row['lon2']
    
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a)) 
    
        # 6367 km is the radius of the Earth
        km = 6367 * c
        return km
    

    Ok, now we need to get your dataframe in shape. In your question, everything is a string and that's not good for doing math. So using your variable d, I said:

    df = pandas.DataFrame(d).set_index('Provider ID').astype(float)
    

    So that created the dataframe of strings, set the provider as the index, and then converted all of the columns to floats, since we're doing math.

    Now we need to make rows with two sets of coords. For that we'll use the shift method and join the result to the original dataframe. Doing that all at once looks like this:

    df = df.join(df.shift(), lsuffix='1', rsuffix='2')
    print(df.head())
    
                      lat1       lon1       lat2       lon2
    Provider ID                                            
    10001        31.215379 -85.361466        NaN        NaN
    10005        34.221335 -86.159375  31.215379 -85.361466
    10006        34.795040 -87.685075  34.221335 -86.159375
    10007        31.292160 -86.255399  34.795040 -87.685075
    10008        31.693116 -86.265495  31.292160 -86.255399
    

    rsuffix and lsuffix are what append "1" and "2" to the column names during the join operation.

    The "2" columns are from df.shift() and you'll notice that they are equal to the "1" columns of the previous row. Also you'll see that the first row of the "2" columns are NaN since there's nothing previous to the first row.

    So now we can apply the Haversine function:

    distance = df.apply(haversine, axis=1)
    print(distance)
    Provider ID
    10001                 NaN
    10005          342.261590
    10006          153.567591
    10007          411.393751
    10008           44.566642
    10011          214.661170
    10012          125.775583
    10016          163.973219
    10018           27.659157
    10019          160.901128
    dtype: float64
    
    0 讨论(0)
  • 2020-12-09 00:25

    To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,

    import numpy as np
    def h2(df, p):
        inrad = df.applymap(radians)
        dlon = inrad.longitude-inrad.longitude[p]
        dlat = inrad.latitude-inrad.latitude[p]
        lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
        a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
        c = 2 * 1/np.sin(np.sqrt(a))
        km = 6367 * c
        return km
    
    df = df.set_index('Provider ID')
    df = df.astype(float)
    df2 = pd.DataFrame(index = df.index, columns = df.index)
    for c in df2.columns:
        df2[c] = h2(df, c)
    
    print (df2)
    

    This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)

    Provider ID         10001         10005         10006         10007  \
    Provider ID                                                           
    10001                 inf  5.021936e+05  5.270062e+05  1.649088e+06   
    10005        5.021936e+05           inf  9.294868e+05  4.985233e+05   
    10006        5.270062e+05  9.294868e+05           inf  4.548412e+05   
    10007        1.649088e+06  4.985233e+05  4.548412e+05           inf   
    10008        1.460299e+06  5.777248e+05  5.246954e+05  3.638231e+06   
    10011        6.723581e+05  2.004199e+06  1.027439e+06  6.394402e+05   
    10012        4.559090e+05  3.265536e+06  7.573411e+05  4.694125e+05   
    10016        7.680036e+05  1.429573e+06  9.105474e+05  7.517467e+05   
    10018        7.096548e+05  1.733554e+06  1.020976e+06  6.701920e+05   
    10019        5.436342e+05  9.278739e+05  2.891822e+07  4.638858e+05   
    
    Provider ID         10008         10011         10012         10016  \
    Provider ID                                                           
    10001        1.460299e+06  6.723581e+05  4.559090e+05  7.680036e+05   
    10005        5.777248e+05  2.004199e+06  3.265536e+06  1.429573e+06   
    10006        5.246954e+05  1.027439e+06  7.573411e+05  9.105474e+05   
    10007        3.638231e+06  6.394402e+05  4.694125e+05  7.517467e+05   
    10008                 inf  7.766998e+05  5.401081e+05  9.496953e+05   
    10011        7.766998e+05           inf  1.341775e+06  4.220911e+06   
    10012        5.401081e+05  1.341775e+06           inf  1.119063e+06   
    10016        9.496953e+05  4.220911e+06  1.119063e+06           inf   
    10018        8.236437e+05  1.242451e+07  1.226941e+06  5.866259e+06   
    10019        5.372119e+05  1.051748e+06  7.514774e+05  9.362341e+05   
    
    Provider ID         10018         10019  
    Provider ID                              
    10001        7.096548e+05  5.436342e+05  
    10005        1.733554e+06  9.278739e+05  
    10006        1.020976e+06  2.891822e+07  
    10007        6.701920e+05  4.638858e+05  
    10008        8.236437e+05  5.372119e+05  
    10011        1.242451e+07  1.051748e+06  
    10012        1.226941e+06  7.514774e+05  
    10016        5.866259e+06  9.362341e+05  
    10018                 inf  1.048895e+06  
    10019        1.048895e+06           inf  
    
    [10 rows x 10 columns]
    
    0 讨论(0)
  • 2020-12-09 00:33

    You should be able to operate on the whole thing. Im not very familiar with Pandas so I'll just work with the underlying numpy arrays. Using your data d:

    df = pd.DataFrame(d)
    df1 = df.astype(float)
    a = np.radians(df1.values[:,1:])
    # a.shape is 10,2, it contains the Lat/Lon only
    
    # transpose and subtract
    # add a new axes so they can be broadcast
    diff = a[...,np.newaxis] - a.T
    # diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]
    
    b = np.square(np.sin(diff / 2))
    # b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]
    
    # make this term: cos(Lat1) * cos(Lat2)
    cos_Lat = np.cos(a[:,0])
    c = cos_Lat * cos_Lat[:, np.newaxis]    # shape 10x10
    
    # sin^2(dLon/2) is b[:,1,:]
    b[:,1,:] = b[:,1,:] * c
    g = b.sum(axis = 1)
    h = 6367000 * 2 * np.arcsin((np.sqrt(g)))   # meters
    

    Back to a pandas.DataFrame

    df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)
    

    I didn't try any performance tests. There is a lot of intermediate array creation going on and it can be expensive - using the optional output argument of the ufuncs might alleviate that.

    Same thing with in-place operations:

    df = pd.DataFrame(d)
    df_A = df.astype(float)
    z = df_A.values[:,1:]
    
    # cos(Lat1) * cos(Lat2)
    w = np.cos(z[:,0])
    w = w * w[:, np.newaxis]    # w.shape is (10,10)
    
    # sin^2(dLat/2) and sin^2(dLon/2)
    np.radians(z, z)
    z = z[...,np.newaxis] - z.T
    np.divide(z, 2, z)
    np.sin(z, z)
    np.square(z,z)
    # z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]
    
    # cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
    np.multiply(z[:,1,:], w, z[:,1,:])
    
    # sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
    z = z.sum(axis = 1)
    
    np.sqrt(z, z)
    np.arcsin(z,z)
    np.multiply(z, 6367000 * 2, z)   #meters
    
    df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)
    
    0 讨论(0)
提交回复
热议问题