Is there a way to keep entities intact while parsing html with DomDocument?

后端 未结 3 998
孤街浪徒
孤街浪徒 2020-12-10 06:56

I have this function to ensure every img tag has absolute URL:

function absoluteSrc($html, $encoding = \'utf-8\')
{
    $dom = new DOMDocume         


        
相关标签:
3条回答
  • 2020-12-10 07:36

    I'd like to know the answer to this as well.

    I ended up converting &..; entities to **ENTITY-...-ENTITY** before parsing and converting back after it is done.

    0 讨论(0)
  • 2020-12-10 07:38

    An alternative solution is to use DOMDocument->saveHTMLFile() (which doesn't convert HTML entities) and read the contents of the saved file back into a string.

    It's not super pretty, but it has the advantage of not having to manually find-and-replace entity codes yourself (twice) as per some other solutions proffered here.

    0 讨论(0)
  • 2020-12-10 07:56

    The following code seems to work

       $dom= new DOMDocument('1.0', 'UTF-8');
       $dom->loadHTML($this->htmlentities2stringcode(rawurldecode($content)) );
       $dom->preserveWhiteSpace = true; 
    
       $innerHTML = str_replace("<html></html><html><body>", "", 
       str_replace("</body></html>", "", 
    str_replace("+","%2B",str_replace("<p></p>", "", $this->getInnerHTML( $dom )))));
    
       return $this->stringcode2htmlentities($innerHTML));
    }
    // ----------------------------------------------------------
    function htmlentities2stringcode($string) {
       // This method will convert htmlentities such as &copy; into the pseudo string version ^copy^; etc
            $from   = array_keys($this->getHTMLEntityStringCodeArray());
            $to     = array_values($this->getHTMLEntityStringCodeArray());
       return str_replace($from, $to, $string);
     }
     // ----------------------------------------------------------
     function stringcode2htmlentities ($string) {
        // This method will convert pseudo string such as ^copy^ to the original html entity &copy; etc
        $from   = array_values($this->getHTMLEntityStringCodeArray());
        $to     = array_keys($this->getHTMLEntityStringCodeArray());
        return str_replace($from, $to, $string);
      } 
      // -------------------------------------------------------------
      function getHTMLEntityStringCodeArray() {
    
          return array('&Alpha;'=>'^Alpha^', 
                                        '&Beta;'=>'^Beta^', 
                                        '&Chi;'=>'^Chi^', 
                                        '&Dagger;'=>'^Dagger^', 
                                        '&Delta;'=>'^Delta^', 
                                        '&Epsilon;'=>'^Epsilon^', 
                                        '&Eta;'=>'^Eta^', 
                                        '&Gamma;'=>'^Gamma^', 
                                        '&Iota;'=>'^lota^', 
                                        '&Kappa;'=>'^Kappa^', 
                                        '&Lambda;'=>'^Lambda^', 
                                        '&Mu;'=>'^Mu^', 
                                        '&Nu;'=>'^Nu^', 
                                        '&OElig;'=>'^OElig^', 
                                        '&Omega;'=>'^Omega^', 
                                        '&Omicron;'=>'^Omicron^',
                                        '&Phi;'=>'^Phi^', 
                                        '&Pi;'=>'^Pi^', 
                                        '&Prime;'=>'^Prime^', 
                                        '&Psi;'=>'^Psi^', 
                                        '&Rho;'=>'^Rho^', 
                                        '&Scaron;'=>'^Scaron^',
                                        '&Scaron;'=>'^Scaron^',
                                        '&Sigma;'=>'^Sigma^',
                                        '&Tau;'=>'^Tau^',
                                        '&Theta;'=>'^Theta^',
                                        '&Upsilon;'=>'^Upsilon^',
                                        '&Xi;'=>'^Xi^',
                                        '&Yuml;'=>'^Yuml^',
                                        '&Zeta;'=>'^Zeta^',
                                        '&alefsym;'=>'^alefsym^',
                                        '&alpha;'=>'^alpha^',
                                        '&and;'=>'^and^',
                                        '&ang;'=>'^ang^',
                                        '&asymp;'=>'^asymp^',
                                        '&bdquo;'=>'^bdquo^',
                                        '&beta;'=>'^beta^',
                                        '&bull;'=>'^bull^',
                                        '&cap;'=>'^cap^',
                                        '&chi;'=>'^chi^',
                                        '&circ;'=>'^circ^',
                                        '&clubs;'=>'^clubs^',
                                        '&cong;'=>'^cong^',
                                        '&crarr;'=>'^crarr^',
                                        '&cup;'=>'^cup^',
                                        '&dArr;'=>'^dArr^',
                                        '&dagger;'=>'^dagger^',
                                        '&darr;'=>'^darr^',
                                        '&delta;'=>'^delta^',
                                        '&diams;'=>'^diams^',
                                        '&empty;'=>'^empty^',
                                        '&emsp;'=>'^emsp^',
                                        '&ensp;'=>'^ensp^',
                                        '&epsilon;'=>'^epsilon^',
                                        '&equiv;'=>'^equiv^',
                                        '&eta;'=>'^eta^',
                                        '&euro;'=>'^euro^',
                                        '&exist;'=>'^exist^',
                                        '&fnof;'=>'^fnof^',
                                        '&forall;'=>'^forall^',
                                        '&frasl;'=>'^frasl^',
                                        '&gamma;'=>'^gamma^',
                                        '&ge;'=>'^ge^',
                                        '&hArr;'=>'^hArr^',
                                        '&harr;'=>'^harr^',
                                        '&hearts;'=>'^hearts^',
                                        '&hellip;'=>'^hellip^',
                                        '&image;'=>'^image^',
                                        '&infin;'=>'^infin^',
                                        '&int;'=>'^int^',
                                        '&iota;'=>'^iota^',
                                        '&isin;'=>'^isin^',
                                        '&kappa;'=>'^kappa^',
                                        '&lArr;'=>'^lArr^',
                                        '&lambda;'=>'^lambda^',
                                        '&lang;'=>'^lang^',
                                        '&larr;'=>'^larr^',
                                        '&lceil;'=>'^lceil^',
                                        '&ldquo;'=>'^ldquo^',
                                        '&le;'=>'^le^',
                                        '&lfloor;'=>'^lfloor^',
                                        '&lowast;'=>'^lowast^',
                                        '&loz;'=>'^loz^',
                                        '&lrm;'=>'^lrm^',
                                        '&lsaquo;'=>'^lsaquo^',
                                        '&lsquo;'=>'^lsquo^',
                                        '&mdash;'=>'^mdash^',
                                        '&minus;'=>'^minus^',
                                        '&mu;'=>'^mu^',
                                        '&nabla;'=>'^nabla^',
                                        '&ndash;'=>'^ndash^',
                                        '&ne;'=>'^ne^',
                                        '&ni;'=>'^ni^',
                                        '&notin;'=>'^notin^',
                                        '&nsub;'=>'^nsub^',
                                        '&nu;'=>'^nu^',
                                        '&oelig;'=>'^oelig^',
                                        '&oline;'=>'^oline^',
                                        '&omega;'=>'^omega^',
                                        '&omicron;'=>'^omicron^',
                                        '&oplus;'=>'^oplus^',
                                        '&or;'=>'^or^',
                                        '&otimes;'=>'^otimes^',
                                        '&part;'=>'^part^',
                                        '&permil;'=>'^permil^',
                                        '&perp;'=>'^perp^',
                                        '&phi;'=>'^phi^',
                                        '&pi;'=>'^pi^', 
                                        '&piv;'=>'^piv^',
                                        '&prime;'=>'^prime^',
                                        '&prod;'=>'^prod^',
                                        '&prop;'=>'^prop^',
                                        '&psi;'=>'^psi^',
                                        '&rArr;'=>'^rArr^',
                                        '&radic;'=>'^radic^',
                                        '&rang;'=>'^rang^',
                                        '&rarr;'=>'^rarr^',
                                        '&rceil;'=>'^rceil^',
                                        '&rdquo;'=>'^rdquo^',
                                        '&real;'=>'^real^',
                                        '&rfloor;'=>'^rfloor^',
                                        '&rho;'=>'^rho^',
                                        '&rlm;'=>'^rlm^',
                                        '&rsaquo;'=>'^rsaquo^',
                                        '&rsquo;'=>'^rsquo^',
                                        '&sbquo;'=>'^sbquo^',
                                        '&scaron;'=>'^scaron^',
                                        '&sdot;'=>'^sdot^',
                                        '&sigma;'=>'^sigma^',
                                        '&sigmaf;'=>'^sigmaf^',
                                        '&sim;'=>'^sim^',
                                        '&spades;'=>'^spades^',
                                        '&sub;'=>'^sub^',
                                        '&sube;'=>'^sube^',
                                        '&sum;'=>'^sum^',
                                        '&sup;'=>'^sup^',
                                        '&supe;'=>'^supe^',
                                        '&tau;'=>'^tau^',
                                        '&there4;'=>'^there4^',
                                        '&theta;'=>'^thetasym^',
                                        '&thetasym;'=>'^thetasym^',
                                        '&thinsp;'=>'^thinsp^',
                                        '&tilde;'=>'^tilde^',
                                        '&trade;'=>'^trade^',
                                        '&uArr;'=>'^uArr^',
                                        '&uarr;'=>'^uarr^',
                                        '&upsih;'=>'^upsih^',
                                        '&upsilon;'=>'^upsilon^',
                                        '&weierp;'=>'^weierp^',
                                        '&xi;'=>'^xi^',
                                        '&yuml;'=>'^yuml^',
                                        '&zeta;'=>'^zeta^',
                                        '&zwj;'=>'^zwj^',
                                        '&zwnj;'=>'^zwnj^');
        }
    
    0 讨论(0)
提交回复
热议问题