in a very tight loop I need to access tenthousands of values in an array containing millions of elements. The key can be undefinied: In that case it shall be legal to return
I did some bench marking with the following code:
set_time_limit(100);
$count = 2500000;
$search_index_end = $count * 1.5;
$search_index_start = $count * .5;
$array = array();
for ($i = 0; $i < $count; $i++)
$array[md5($i)] = $i;
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
$key = md5($i);
$test = isset($array[$key]) ? $array[$key] : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
$key = md5($i);
$test = array_key_exists($key, $array) ? $array[$key] : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
$key = md5($i);
$test = @$array[$key];
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
$error_reporting = error_reporting();
error_reporting(0);
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
$key = md5($i);
$test = $array[$key];
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
error_reporting($error_reporting);
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
$key = md5($i);
$tmp = &$array[$key];
$test = isset($tmp) ? $tmp : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
and I found that the fastest running test was the one that uses isset($array[$key]) ? $array[$key] : null
followed closely by the solution that just disables error reporting.
First, re-organize the data for performance by saving a new array where the data is sorted by the keys, but the new array contains a regular numeric index.
This part will be time consuming, but only done once.
// first sort the array by it's keys
ksort($data);
// second create a new array with numeric index
$tmp = new array();
foreach($data as $key=>$value)
{
$tmp[] = array('key'=>$key,'value'=>$value);
}
// now save and use this data instead
save_to_file($tmp);
Once that is done it should be quick to find the key using a Binary Search. Later you can use a function like this.
function findKey($key, $data, $start, $end)
{
if($end < $start)
{
return null;
}
$mid = (int)(($end - $start) / 2) + $start;
if($data[$mid]['key'] > $key)
{
return findKey($key, $data, $start, $mid - 1);
}
else if($data[$mid]['key'] < $key)
{
return findKey($key, $data, $mid + 1, $end);
}
return $data[$mid]['value'];
}
To perform a search for a key you would do this.
$result = findKey($key, $data, 0, count($data));
if($result === null)
{
// key not found.
}
If the count($data)
is done all the time, then you could cache that in the file that you stored the array data.
I suspect this method will be a lot faster in performance then a regular linear search that is repeated against the $data
. I can't promise it's faster. Only an octree would be quicker, but the time to build the octree might cancel out the search performance (I've experienced that before). It depends on how much searching in the data you have to do.