How do I find the closest values in a Pandas series to an input number?

I have seen:

These relate to vanilla python and not pandas.

If I have the series:

And I input 3, how can I (efficiently) find?

The index of 3 if it is found in the series
The index of the value below and above 3 if it is not found in the series.

Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).

You could use argsort() like

Say, input = 3

In [198]: input = 3

In [199]: df.ix[(df['num']-input).abs().argsort()[:2]]
Out[199]:
   num
2    4
4    2

df_sort is the dataframe with 2 closest values.

In [200]: df_sort = df.ix[(df['num']-input).abs().argsort()[:2]]

For index,

In [201]: df_sort.index.tolist()
Out[201]: [2, 4]

For values,

In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]

Detail, for the above solution df was

In [197]: df
Out[197]:
   num
0    1
1    6
2    4
3    5
4    2

I recommend using iloc in addition to John Galt's answer since this will work even with unsorted integer index, since .ix first looks at the index labels

df.iloc[(df['num']-input).abs().argsort()[:2]]

A disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).

However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.

This gives the following code snippet:

def find_neighbours(value):
  exactmatch=df[df.num==value]
  if !exactmatch.empty:
      return exactmatch.index[0]
  else:
      lowerneighbour_ind = df[df.num<value].idxmax()
      upperneighbour_ind = df[df.num>traversed].idxmin()
      return lowerneighbour_ind, upperneighbour_ind

This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.

If your series is already sorted, you could use something like this.

def closest(df, col, val, direction):
    n = len(df[df[col] <= val])
    if(direction < 0):
        n -= 1
    if(n < 0 or n >= len(df)):
        print('err - value outside range')
        return None
    return df.ix[n, col]    

df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
    lc = closest(df, 'num', find, -1)
    hc = closest(df, 'num', find, 1)
    print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))


df:     num
    0   0
    1   2
    2   4
    3   6
    4   8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.

If the series is already sorted, an efficient method of finding the indexes is by using bisect. An example:

idx = bisect_right(df['num'].values, 3)

So for the problem cited in the question, considering that the column "col" of the dataframe "df" is sorted:

from bisect import bisect_right, bisect_left
def get_closests(df, col, val):
    lower_idx = bisect_right(df[col].values, val)
    higher_idx = bisect_left(df[col].values, val)
if higher_idx == lower_idx:
    return lower_idx
else: 
    return lower_idx, higher_idx

It is quite efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.

来源：https://stackoverflow.com/questions/30112202/how-do-i-find-the-closest-values-in-a-pandas-series-to-an-input-number

标签

python

pandas

dataframe

ranking