Where can I find the algorithm used to write each PHP “built-in” function?

问题

I recently built a PHP-based application that typically requires several (>10) seconds to parse a target string (>10 seconds because there are many thousands of checks on a typically 100kB+ string). I am looking for ways to reduce the execution time.

I started to wonder how each of PHP's "built-in" functions are written. For example, if you go to the strpos() reference in the manual (this link), there is a lot of info but not the algorithm.

Who knows, maybe I can write a function that is faster than the built-in function for my particular application? But I have no way of knowing the algorithm for e.g. strpos(). Does the algorithm use a method such as this one:

function strposHypothetical($haystack, $needle) {

    $haystackLength = strlen($haystack);
    $needleLength   = strlen($needle);//for this question let's assume > 0

    $pos = false;

    for($i = 0; $i < $haystackLength; $i++) {
        for($j = 0; $j < $needleLength; $j++) {
            $thisSum = $i + $j;
            if (($thisSum > $haystackLength) || ($needle[$j] !== $haystack[$thisSum])) break;          
        }
        if ($j === $needleLength) {
            $pos = $i;
            break;
        }
    }
    return $pos;
}

or would it use a much slower method, with let's say combination of substr_count() for occurrences of the needle, and if occurrences > 0, then a for loop, or some other method?

I have profiled the functions and methods in my application and made significant progress in this way. Also, note that this post doesn't really help much. Where can I find out the algorithm used for each built-in function in PHP, or is this information proprietary?

回答1:

The built-in PHP functions can be found in /ext/standard/ in the PHP source code.

In the case of strpos, you can find the PHP implementation in /ext/standard/string.c. At its core, this function actually uses php_memnstr, which is actually an alias of zend_memnstr:

found = (char*)php_memnstr(ZSTR_VAL(haystack) + offset,
                           Z_STRVAL_P(needle),
                           Z_STRLEN_P(needle),
                           ZSTR_VAL(haystack) + ZSTR_LEN(haystack));

And if we read the source of zend_memnstr, we can find the algorithm itself used to implement strpos:

while (p <= end) {
    if ((p = (const char *)memchr(p, *needle, (end-p+1))) && ne == p[needle_len-1]) {
        if (!memcmp(needle, p, needle_len-1)) {
            return p;
        }
    }

    if (p == NULL) {
        return NULL;
    }
    p++;
}

ne here represents the last character of needle, and p is a pointer which is incremented to scan through the haystack.

The function memchr is a C function which should do a simple linear search through a sequence of bytes to find the first occurrence of a given byte / character in a string of bytes. memcmp is a C function which compares two byte / character ranges which can be within strings by comparing them byte-by-byte.

A pseudo-code version of this function is as follows:

while (p <= end) {
    find the next occurrence of the first character of needle;
    if (occurrence is found) {
        set `p` to point to this new location in the string;
        if ((character at `p` + `length of needle`) == last character of needle) {
            if ((next `length of needle` characters after `p`) == needle) {
                return p; // Found position `p` of needle in haystack!
            }
        }
    } else {
        return NULL; // Needle does not exist in haystack.
    }
    p++;
}

This is a fairly efficient algorithm for finding the index of a substring in a string. It is pretty much the same algorithm to your strposHypothetical, and should be just as efficient complexity-wise, unless memcpy doesn't return early as soon as it sees the strings differ by one character, and of course, being implemented in C, it will be leaner and faster.

来源：https://stackoverflow.com/questions/38571807/where-can-i-find-the-algorithm-used-to-write-each-php-built-in-function

标签

php

built-in