问题
I recently built a PHP-based application that typically requires several (>10) seconds to parse a target string (>10 seconds because there are many thousands of checks on a typically 100kB+ string). I am looking for ways to reduce the execution time.
I started to wonder how each of PHP's "built-in" functions are written. For example, if you go to the strpos()
reference in the manual (this link), there is a lot of info but not the algorithm.
Who knows, maybe I can write a function that is faster than the built-in function for my particular application? But I have no way of knowing the algorithm for e.g. strpos(). Does the algorithm use a method such as this one:
function strposHypothetical($haystack, $needle) {
$haystackLength = strlen($haystack);
$needleLength = strlen($needle);//for this question let's assume > 0
$pos = false;
for($i = 0; $i < $haystackLength; $i++) {
for($j = 0; $j < $needleLength; $j++) {
$thisSum = $i + $j;
if (($thisSum > $haystackLength) || ($needle[$j] !== $haystack[$thisSum])) break;
}
if ($j === $needleLength) {
$pos = $i;
break;
}
}
return $pos;
}
or would it use a much slower method, with let's say combination of substr_count() for occurrences of the needle, and if occurrences > 0, then a for loop, or some other method?
I have profiled the functions and methods in my application and made significant progress in this way. Also, note that this post doesn't really help much. Where can I find out the algorithm used for each built-in function in PHP, or is this information proprietary?
回答1:
The built-in PHP functions can be found in /ext/standard/ in the PHP source code.
In the case of strpos
, you can find the PHP implementation in /ext/standard/string.c. At its core, this function actually uses php_memnstr, which is actually an alias of zend_memnstr:
found = (char*)php_memnstr(ZSTR_VAL(haystack) + offset,
Z_STRVAL_P(needle),
Z_STRLEN_P(needle),
ZSTR_VAL(haystack) + ZSTR_LEN(haystack));
And if we read the source of zend_memnstr, we can find the algorithm itself used to implement strpos
:
while (p <= end) {
if ((p = (const char *)memchr(p, *needle, (end-p+1))) && ne == p[needle_len-1]) {
if (!memcmp(needle, p, needle_len-1)) {
return p;
}
}
if (p == NULL) {
return NULL;
}
p++;
}
ne
here represents the last character of needle
, and p
is a pointer which is incremented to scan through the haystack
.
The function memchr
is a C function which should do a simple linear search through a sequence of bytes to find the first occurrence of a given byte / character in a string of bytes. memcmp
is a C function which compares two byte / character ranges which can be within strings by comparing them byte-by-byte.
A pseudo-code version of this function is as follows:
while (p <= end) {
find the next occurrence of the first character of needle;
if (occurrence is found) {
set `p` to point to this new location in the string;
if ((character at `p` + `length of needle`) == last character of needle) {
if ((next `length of needle` characters after `p`) == needle) {
return p; // Found position `p` of needle in haystack!
}
}
} else {
return NULL; // Needle does not exist in haystack.
}
p++;
}
This is a fairly efficient algorithm for finding the index of a substring in a string. It is pretty much the same algorithm to your strposHypothetical
, and should be just as efficient complexity-wise, unless memcpy
doesn't return early as soon as it sees the strings differ by one character, and of course, being implemented in C, it will be leaner and faster.
来源:https://stackoverflow.com/questions/38571807/where-can-i-find-the-algorithm-used-to-write-each-php-built-in-function