问题
I have an int[] array that contains values with the following properties:
- They are sorted
- They are unique (no duplicates)
- They are in a known range [0..MAX)
- MAX is typically quite a lot larger than the length of the array (say 10-100x)
- Sometimes the numbers are evenly distributed across the range, but at other times there are quite long sequences of consecutive numbers. I estimate it is about 50/50 between the two cases.
Given this list, I want to efficiently find the index of a specific value in the array (or if the value is not present, find the next higher value).
I've already implemented a straight binary search with interval bisection that works fairly well, but I have a suspicion that the nature/distribution of the data can be exploited to converge to a solution faster.
I'm interested in optimising the average case search time, but it is important that the worst case is never worse than O(log n) as the arrays are sometimes very large.
Question: it is possible to do much better than a plain binary search in the average case?
EDIT (to clarify additional questions / comments)
- The constant in O(log n) definitely matters. In fact assuming that better algorithmic complexity than O(log n) isn't possible, the constant is probably the only thing that matters.....
- It's often a one-off search, so while preprocessing is possible it's probably not going to be worth it.
回答1:
Let's name the interval x here and z the searched number.
Since you expect the values to be evenly distributed, you can use interpolation search. This is similar to binary search, but splits the index range at start + ((z - x[start]) * (end - start)) / (x[end] - x[start]).
To get a running time of O(log n) you have to do combine interpolation search with binary search (do step from binary search and step from interpolation search alternating):
public int search(int[] values, int z) {
int start = 0;
int end = values.length-1;
if (values[0] == z)
return 0;
else if (values[end] == z) {
return end;
}
boolean interpolation = true;
while (start < end) {
int mid;
if (interpolation) {
mid = start + ((z - values[start]) * (end - start)) / (values[end] - values[start]);
} else {
mid = (end-start) / 2;
}
int v = values[mid];
if (v == z)
return mid;
else if (v > z)
end = mid;
else
start = mid;
interpolation = !interpolation;
}
return -1;
}
Since every second iteration of the while loop does a step in binary search, it uses at most twice the number of iterations a binary search would use (O(log n)). Since every second step is a step from interpolation search, it the algorithm should reduce the intervall size fast, if the input has the desired properties.
回答2:
This is in the comments and should be an answer. It's a joint effort, so I'm making it a CW answer:
You may want to look at an interpolation search. In the worst case, they can be worse than O(log n) and so if that's a hard requirement, this wouldn't apply. But if your interpolation is decent, depending on the data distribution an interpolation search can beat a straight binary.
To know, you'd have to implement the interpolation search with a reasonably smart interpolation algorithm, and then run several representative data sets through both to see whether the interpolation or the binary is better suited. I'd think it'd be one of the two, but I'm not au fait with truly cutting edge searching algorithms.
回答3:
If int[] is
- sorted
- have unique values
- you know the range (in advance)
Than instead of searching why not to save the value at its index.
Say the number is 243 than save the value in int[243] = 243.
That way searching will be easy and faster. Only thing left is to find out next higher value.
回答4:
I have one solution.
you are saying array can be
1)numbers are evenly distributed across the range
2)there are quite long sequences of consecutive numbers.
So, first we start a simple test to make sure whether its of type1 or type2.
To test for type 1,
lenght =array.length;
range = array[length-1] - array[0];
Now consider the values of array at
{ length(1/5),length(2/5),length(3/5),length(4/5)},
If the array distribution is of type 1, then we approximately know what must be the value at array[i], so we check whether at those above 4 positions whether they are close to known values if its equal distribution.
If they are close, then its equal distribution and so we can easily find any element in array.If we can't find element based on above approach, we consider it is of type 2.
If above test Fails then it is of type 2, which means in the array there are few places where long sequences of consecutive numbers is present.
so, we solve it in terms like binary search.Explanation is below
*we first search in the middle of the array,(say at length/2, index as i)
left =0,right=length;
BEGIN:
i=(left+right)/2;
case a.1: our search number is greater than array[i]
left=i;
*Now we check at that position is there any long consecutive sequence is present, i.e
array[i],array[i+1],array[i+2] are consecutive ints.
case a.1.1: (If they are in consecutive),
as they are consecutive ,and the sequence might be long, we directly search at particular index based on our search integer value.
For example, if our search int is 10, and sequence is 5,6,7,8,9,10,11 15,100,103,
and array[i]=5, then we directly search at array[i+10-5],
If we find our search int, return it, else continue from case a.2 only [because it will obviously less than it] by setting right as
right=(array[i+10-5])
case a.1.2, if they are not consecutive
continue from BEGIN;
case a.2: our search number is less than array[i],
*case a.2 is exactly similar to a.1
*similarly check is there any back sequence , i.e array[i-2],array[i-1],array[i] are in sequence,
If they are in consecutive sequence , search back to exact value as we did in case a.1.1
If they are not consecutive, repeat similar to case a.1.2.
case a.3, it is our search int,
then return it.
HOPE THIS helps
来源:https://stackoverflow.com/questions/20934412/efficient-search-of-sorted-numerical-values