PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

后端 未结 6 1899
旧时难觅i
旧时难觅i 2020-12-17 02:20

THE PROBLEM: I need a XML file \"full encoded\" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are

6条回答
  •  庸人自扰
    2020-12-17 02:44

    I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.

    Note: I'm assuming default encoding is UTF-8

    // Search for named entities (strings like "&abc1;").
    echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
        // Decode the entity and re-encode as XML entities. This means "&"
        // will remain "&" whereas "€" becomes "€".
        return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
    }, "€&foo Ç") . "\n";
    
    /* €&foo Ç */
    

    Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:

    // Search for named entities (strings like "&abc1;").
    $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
        // Decode the entity and re-encode as XML entities. This means "&"
        // will remain "&" whereas "€" becomes "€".
        return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
    }, "€&foo Ç") . "\n";
    
    echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
    
    /* €&foo Ç */
    

    In your case you want it the other way around. Encode numbered entities as UTF-8:

    // Search for named entities (strings like "&abc1;").
    $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
        // Decode the entity and re-encode as XML entities. This means "&"
        // will remain "&" whereas "€" becomes "€".
        return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
    }, "€&foo Ç") . "\n";
    
    // Encodes (uncaught) numbered entities to UTF-8.
    echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
    
    /* €&foo Ç */
    

    Benchmark

    I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.

    €&foo Ç é #_x_amp#; ∬
    

    Your method

    php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"€&foo Ç é #_x_amp#; ∬"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    €&foo Ç é & ∬
    =====
    Time taken: 2.0397531986237
    

    My method

    php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"€&foo Ç é #_x_amp#; ∬"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    €&foo Ç é #_x_amp#; ∬
    =====
    Time taken: 4.045273065567
    

    My method (with unicode to numbered entity):

    php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"€&foo Ç é #_x_amp#; ∬"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    €&foo Ç é #_x_amp#; ∬
    =====
    Time taken: 5.4407880306244
    

    My method (with numbered entity to unicode):

    php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"€&foo Ç é #_x_amp#;"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    €&foo Ç é #_x_amp#; ∬
    =====
    Time taken: 5.5400078296661
    

提交回复
热议问题