Finding k closest numbers to a given number

老子叫甜甜 提交于 2020-04-29 05:37:37

问题


Say I have a list [1,2,3,4,5,6,7]. I want to find the 3 closest numbers to, say, 6.5. Then the returned value would be [5,6,7].

Finding one closest number is not that tricky in python, which can be done using

min(myList, key=lambda x:abs(x-myNumber))

But I am trying not to put a loop around this to find k closest numbers. Is there a pythonic way to achieve the above task?


回答1:


The short answer

The heapq.nsmallest() function will do this neatly and efficiently:

>>> from heapq import nsmallest
>>> s = [1,2,3,4,5,6,7]
>>> nsmallest(3, s, key=lambda x: abs(x-6.5))
[6, 7, 5]

Essentially this says, "Give me the three input values that have the lowest absolute difference from the number 6.5".

The algorithm and its running time

The algorithm for nsmallest makes a single pass over the data, keeping no more than the n best values in memory at any time (that means that it works with any input iterator, is cache-efficient, and space-efficient).

The algorithm only adds new values to the heap when a new "best" value is found. Accordingly, it minimizes the number of comparisons made. For example, if you are looking for the 100 best values out of 1,000,000 random inputs, it typically makes fewer than 1,008,000 comparisons (about 0.8% more compares than using min() to find the single best value).

The key functions for min(), nsmallest(), and sorted() all guarantee that the key function is called exactly once per value in the input iterable. That means that this technique will be efficient for even more complex and interesting examples of the n-closest value problem (i.e. words that sound the most alike, closest colors, smallest diffs, fewest genetic mutations, Euclidean distance, etc).

Both nsmallest() and sorted() will return a list rank ordered by nearness (ties are settled by which value was seen first).

For those who are interested, there is a somewhat involved analysis of expected number of comparisons here and here. Quick summary:

  • Average case for random inputs: n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k))
  • Best case for ascending inputs: n + k * log(k, 2)
  • Worst case for descending inputs: n * log(k, 2)

Optimizing for repeated lookups

In the comments, @Phylliida, asked how to optimize for repeated lookups with differing start points. The key is to pre-sort the data and then use bisect to locate the center of a small search segment:

from bisect import bisect

def k_nearest(k, center, sorted_data):
    'Return *k* members of *sorted_data* nearest to *center*'
    i = bisect(sorted_data, center)
    segment = sorted_data[max(i-k, 0) : i+k]
    return nsmallest(k, segment, key=lambda x: abs(x - center))

For example:

>>> s.sort()
>>> k_nearest(3, 6.5, s)
[6, 7, 5]
>>> k_nearest(3, 0.5, s)
[1, 2, 3]
>>> k_nearest(3, 4.5, s)    
[4, 5, 3]
>>> k_nearest(3, 5.0, s)
[5, 4, 6]

Both bisect() and nsmallest() take advantage of the sorted data. The former runs O(log2 k) time and the latter runs in O(n) time.




回答2:


You could compute distances, and sort:

[n for d, n in sorted((abs(x-myNumber), x) for x in myList)[:k]]

This does the following:

  1. Create a sequence of tuples (d, x) where d is the distance to your target
  2. Select the first k elements of that list
  3. Extract just the number values from the result, discarding the distance



回答3:


Both answers were good, and Greg was right, Raymond's answer is more high level and easier to implement, but I built upon Greg's answer because it was easier to manipulate to fit my need.

In case anyone is searching for a way to find the n closest values from a list of dicts.

My dict looks like this, where npi is just an identifier that I need along with the value:

mydict = {u'fnpi': u'1982650024',
 u'snpi': {u'npi': u'1932190360', u'value': 2672},
 u'snpis': [{u'npi': u'1831289255', u'value': 20},
  {u'npi': u'1831139799', u'value': 20},
  {u'npi': u'1386686137', u'value': 37},
  {u'npi': u'1457355257', u'value': 45},
  {u'npi': u'1427043645', u'value': 53},
  {u'npi': u'1477548675', u'value': 53},
  {u'npi': u'1851351514', u'value': 57},
  {u'npi': u'1366446171', u'value': 60},
  {u'npi': u'1568460640', u'value': 75},
  {u'npi': u'1326046673', u'value': 109},
  {u'npi': u'1548281124', u'value': 196},
  {u'npi': u'1912989989', u'value': 232},
  {u'npi': u'1336147685', u'value': 284},
  {u'npi': u'1801894142', u'value': 497},
  {u'npi': u'1538182779', u'value': 995},
  {u'npi': u'1932190360', u'value': 2672},
  {u'npi': u'1114020336', u'value': 3264}]}

value = mydict['snpi']['value'] #value i'm working with below
npi = mydict['snpi']['npi'] #npi (identifier) i'm working with below
snpis = mydict['snpis'] #dict i'm working with below

To get an [id, value] list (not just a list of values) , I use this:

[[id,val] for diff, val, id in sorted((abs(x['value']-value), x['value'], x['npi']) for x in snpis)[:6]]

Which produces this:

[[u'1932190360', 2672],
 [u'1114020336', 3264],
 [u'1538182779', 995],
 [u'1801894142', 497],
 [u'1336147685', 284],
 [u'1912989989', 232]]

EDIT

I actually found it pretty easy to manipulate Raymond's answer too, if you're dealing with a dict (or list of lists).

from heapq import nsmallest
[[i['npi'], i['value']] for i in nsmallest(6, snpis, key=lambda x: abs(x['value']-value))]

This will produce the same as the above output.

And this

nsmallest(6, snpis, key=lambda x: abs(x['value']-value)) will produce a dict instead.



来源:https://stackoverflow.com/questions/24112259/finding-k-closest-numbers-to-a-given-number

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!