I have seen:
- how do I find the closest value to a given number in an array?
- How do I find the closest array element to an arbitrary (non-member) number?.
These relate to vanilla python and not pandas.
If I have the series:
ix num
0 1
1 6
2 4
3 5
4 2
And I input 3, how can I (efficiently) find?
- The index of 3 if it is found in the series
- The index of the value below and above 3 if it is not found in the series.
Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).
You could use argsort()
like
Say, input = 3
In [198]: input = 3
In [199]: df.ix[(df['num']-input).abs().argsort()[:2]]
Out[199]:
num
2 4
4 2
df_sort
is the dataframe with 2 closest values.
In [200]: df_sort = df.ix[(df['num']-input).abs().argsort()[:2]]
For index,
In [201]: df_sort.index.tolist()
Out[201]: [2, 4]
For values,
In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]
Detail, for the above solution df
was
In [197]: df
Out[197]:
num
0 1
1 6
2 4
3 5
4 2
I recommend using iloc
in addition to John Galt's answer since this will work even with unsorted integer index, since .ix first looks at the index labels
df.iloc[(df['num']-input).abs().argsort()[:2]]
A disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).
However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.
This gives the following code snippet:
def find_neighbours(value):
exactmatch=df[df.num==value]
if !exactmatch.empty:
return exactmatch.index[0]
else:
lowerneighbour_ind = df[df.num<value].idxmax()
upperneighbour_ind = df[df.num>traversed].idxmin()
return lowerneighbour_ind, upperneighbour_ind
This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.
If your series is already sorted, you could use something like this.
def closest(df, col, val, direction):
n = len(df[df[col] <= val])
if(direction < 0):
n -= 1
if(n < 0 or n >= len(df)):
print('err - value outside range')
return None
return df.ix[n, col]
df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
lc = closest(df, 'num', find, -1)
hc = closest(df, 'num', find, 1)
print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))
df: num
0 0
1 2
2 4
3 6
4 8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.
If the series is already sorted, an efficient method of finding the indexes is by using bisect. An example:
idx = bisect_right(df['num'].values, 3)
So for the problem cited in the question, considering that the column "col" of the dataframe "df" is sorted:
from bisect import bisect_right, bisect_left
def get_closests(df, col, val):
lower_idx = bisect_right(df[col].values, val)
higher_idx = bisect_left(df[col].values, val)
if higher_idx == lower_idx:
return lower_idx
else:
return lower_idx, higher_idx
It is quite efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.
来源:https://stackoverflow.com/questions/30112202/how-do-i-find-the-closest-values-in-a-pandas-series-to-an-input-number