Conditional Nearest Neighbor in Python

筅森魡賤 提交于 2020-02-22 04:09:10

问题


I’m trying to do some nearest neighbour type analysis in Python using Pandas/Numpy/Scipy etc. and having tried a few different approaches, I’m stumped.

I have is 2 dataframes as follows:

df1

Lon1    Lat1    Type
10      10      A
50      50      A
20      20      B

df2

Lon2    Lat2    Type    Data-1  Data-2  
11      11      A       Eggs    Bacon       
51      51      A       Nuts    Bread   
61      61      A       Beef    Lamb    
21      21      B       Chips   Chicken
31      31      B       Sauce   Pasta
71      71      B       Rice    Oats
81      81      B       Beans   Peas

I’m trying to identify the 2 nearest neighbours in df2 (based upon the Lon / Lat values using Euclidean distance) and then merge the appropriate Data-1 and Data-2 values onto df1 so it looks like this:

Lon1    Lat1    Type    Data-1a     Data-2a     Data-1b     Data-2b
10      10      A       Eggs        Bacon       Nuts        Bread
50      50      A       Nuts        Bread       Beef        Lamb
20      20      B       Chips       Chicken     Sauce       Pasta

I’ve tried both long and wide form approaches and am leaning toward using ckd tree from scipy, however is there a way to do this so it only looks at rows with the appropriate Type?

Thanks in advance.

** Edit **

I've made some progress as follows:

Typelist = df2['Type'].unique().tolist()
df_dict = {'{}'.format(x): df2[(df2['Type'] == x)] for x in Rlist}

def treefunc(row):
    if row['Type'] == 'A':     
        type = row['Type']
        location = row[['Lon1','Lat1']].values
        tree = cKDTree(df_dict[type][['Lon2','Lat2']].values)
        dists, indexes = tree.query(location, k=2)
        return dists,indexes

dftest = df1.apply(treefunc,axis=1)

This gives me a list of the distances and indexes of the 2 nearest neighbours which is great! However I still have some issues:

  1. I tried to test the row['Type'] column for membership of the Typelist using .isin but this didn't work - are there any other ways to do this?

  2. How can I get Pandas to create new columns for the dists and indexes produced by the kdtree?

  3. Also how can I return Data-1 and Data-2 using the indexes?

Thanks in advance.


回答1:


This is pretty messy but I think it might be a good starting point. I've used scikit's implementation, only because I'm more comfortable (though very green myself).

import pandas as pd
from io import StringIO

s1 = StringIO(u'''Lon2,Lat2,Type,Data-1,Data-2
11,11,A,Eggs,Bacon
51,51,A,Nuts,Bread
61,61,A,Beef,Lamb
21,21,B,Chips,Chicken
31,31,B,Sauce,Pasta
71,71,B,Rice,Oats
81,81,B,Beans,Peas''')

df2 = pd.read_csv(s1)

#Start here

from sklearn.neighbors import NearestNeighbors
import numpy as np

dfNN = pd.DataFrame()

idx = 0
for i in pd.unique(df2.Type):
    dftype = df2[df2['Type'] == i].reindex()
    X = dftype[['Lon2','Lat2']].values
    nbrs = NearestNeighbors(n_neighbors=2, algorithm='kd_tree').fit(X)
    distances, indices = nbrs.kneighbors(X)
    for j in range(len(indices)):
        dfNN = dfNN.append(dftype.iloc[[indices[j][0]]])
        dfNN.loc[idx, 'Data-1b'] = dftype.iloc[[indices[j][1]]]['Data-1'].values[0]
        dfNN.loc[idx, 'Data-2b'] = dftype.iloc[[indices[j][1]]]['Data-2'].values[0]
        dfNN.loc[idx, 'Distance'] = distances[j][1]
        idx += 1
    dfNN = dfNN[['Lat2', 'Lon2', 'Type', 'Data-1', 'Data-2','Data-1b','Data-2b','Distance']]



来源:https://stackoverflow.com/questions/33394238/conditional-nearest-neighbor-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!