Problems Indenting HTML(5) with PHP

问题

Disclaimer: Please bare with the length of this question. This is a recurring question for a real world problem that I've seen asked hundreds of times with no clear, working solution ever being presented.

I have hundreds of HTML files I want to mass indent using PHP. At first I thought of using Tidy but as you should know, it's not compatible by default with HTML5 tags and attributes, after some research and even more tests I came up with the following implementation that "fakes" HTML 5 support:

function Tidy5($string, $options = null, $encoding = 'utf8')
{
    $tags = array();
    $default = array
    (
        'anchor-as-name' => false,
        'break-before-br' => true,
        'char-encoding' => $encoding,
        'decorate-inferred-ul' => false,
        'doctype' => 'omit',
        'drop-empty-paras' => false,
        'drop-font-tags' => true,
        'drop-proprietary-attributes' => false,
        'force-output' => true,
        'hide-comments' => false,
        'indent' => true,
        'indent-attributes' => false,
        'indent-spaces' => 2,
        'input-encoding' => $encoding,
        'join-styles' => false,
        'logical-emphasis' => false,
        'merge-divs' => false,
        'merge-spans' => false,
        'new-blocklevel-tags' => ' article aside audio details dialog figcaption figure footer header hgroup menutidy nav section source summary track video',
        'new-empty-tags' => 'command embed keygen source track wbr',
        'new-inline-tags' => 'btidy canvas command data datalist embed itidy keygen mark meter output progress time wbr',
        'newline' => 0,
        'numeric-entities' => false,
        'output-bom' => false,
        'output-encoding' => $encoding,
        'output-html' => true,
        'preserve-entities' => true,
        'quiet' => true,
        'quote-ampersand' => true,
        'quote-marks' => false,
        'repeated-attributes' => 1,
        'show-body-only' => true,
        'show-warnings' => false,
        'sort-attributes' => 1,
        'tab-size' => 4,
        'tidy-mark' => false,
        'vertical-space' => true,
        'wrap' => 0,
    );

    $doctype = $menu = null;

    if ((strncasecmp($string, '<!DOCTYPE', 9) === 0) || (strncasecmp($string, '<html', 5) === 0))
    {
        $doctype = '<!DOCTYPE html>'; $options['show-body-only'] = false;
    }

    $options = (is_array($options) === true) ? array_merge($default, $options) : $default;

    foreach (array('b', 'i', 'menu') as $tag)
    {
        if (strpos($string, '<' . $tag . ' ') !== false)
        {
            $tags[$tag] = array
            (
                '<' . $tag . ' ' => '<' . $tag . 'tidy ',
                '</' . $tag . '>' => '</' . $tag . 'tidy>',
            );

            $string = str_replace(array_keys($tags[$tag]), $tags[$tag], $string);
        }
    }

    $string = tidy_repair_string($string, $options, $encoding);

    if (empty($string) !== true)
    {
        foreach ($tags as $tag)
        {
            $string = str_replace($tag, array_keys($tag), $string);
        }

        if (isset($doctype) === true)
        {
            $string = $doctype . "\n" . $string;
        }

        return $string;
    }

    return false;
}

It works but has 2 flaws: HTML comments, script and style tags are not correctly indented:

<link href="/_/style/form.css" rel="stylesheet" type="text/css"><!--[if lt IE 9]>
    <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!--<script type="text/javascript" src="//raw.github.com/kevinburke/tecate/master/tecate.js"></script>-->

</script><script charset="UTF-8" src="//cdnjs.cloudflare.com/ajax/libs/bootstrap-datepicker/1.0.0/js/locales/bootstrap-datepicker.pt.js" type="text/javascript">
</script><!--<script src="/3rd/parsley/i18n/messages.pt_br.js"></script>-->
    <!--<script src="//cdnjs.cloudflare.com/ajax/libs/parsley.js/1.1.10/parsley.min.js"></script>-->
    <script src="/3rd/select2/locales/select2_locale_pt-PT.js" type="text/javascript">
</script><script src="/3rd/tcrosen/bootstrap-typeahead.js" type="text/javascript">

And the other flaw, which is way more critical: Tidy converts all menu tags to ul and insists on dropping any empty inline tag, forcing me to hack my way around it. To make that absolutely clear, here are some examples:

  empty tag
text inline tag
 empty inline tag (example from Font Awesome)

If you inspect the code, you'll notice that I've accounted for b, i and menu tags using a not-perfect str_replace hack - I could have used a more robust regular expression or even str_ireplace to accomplish the same thing, but for my purposes str_replace is faster and good enough. However, that still leaves behind any other empty inline tags that I haven't accounted for, which sucks.

So I turned to DOMDocument, but I soon discovered that in order for formatOutput to work I have to:

strip all whitespace between tags (using a regex of course: '~>[[:space:]]++<~m' > ><)
convert all newline combinations to \n so it doesn't encode \r as  for instance
load the input string as HTML, output as XML

To my surprise, DOMDocument also has problems with empty inline tags, basically, whenever it sees <someOtherTag>text</someOtherTag> or similar, it will turn that to <someOtherTag>text</someOtherTag> which will completely mess up the browser rendering of the page. To overcome that, I've found that using LIBXML_NOEMPTYTAG along with DOMDocument::saveXML() will turn any tag without content (including truly empty tags such as  ) into a inline closing tag, so for instance:

 stays the same (as it should)
  becomes   messing up the browser rendering (yet again)

function DOM5($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

Seems like the two most recommended and validated methods of indenting HTML don't produce correct or reliable results for HTML5 in-the-wild, and I have to succumb to the dark god Cthulhu.

I did try other libraries, such as:

html5lib - couldn't get DOMDocument::$formatOutput to work
tidy-html5 - same problems as normal tidy, except it supports HTML5 tags / attributes

At this point, I'm considering writing something that works only with regexes if no better solution exists. But I thought that perhaps DOMDocument could be forced to work with HTML5 and script / style tags by using a custom XSLT. I've never played around with XSLTs before so I don't know if this is realistic or not, perhaps one of you XML experts could tell me and perhaps provide a starting point.

回答1:

You have not mentioned whether your intention is to transform pages for production purposes or for development, e.g. when debugging HTML output.

If it is the latter, and since you have mentioned writing Regex based solution already, I have written Dindent for that purpose.

You have not included sample of input and expected output. You can test my implementation using the sandbox.

回答2:

to beautify my HTML5-code I wrote a small PHP-Class. It's not perfect, but basically does the stuff for my purpose in a relatively quick way. Maybe it's usefull.

<?php
namespace LBR\LbrService;

/**
 * This script has no licensing-model - do what you want to do with it.
 * 
 * This script is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *  
 * @author 2014 sunixzs <sunixzs@gmail.com>
 *
 * What does this script do?
 * Take unlovely HTML-sourcecode, remove temporarily any sections that should not 
 * be processed (p.e. textarea, pre and script), then remove all spaces and linebreaks
 * to define them new by referencing some tag-lists. After this intend the new created
 * lines also by refence to tag-lists. At the end put the temporary stuff back to the
 * new generated hopefully beautiful sourcecode.
 *
 */
class BeautifyMyHtml {

    /**
     * HTML-Tags which should not be processed.
     * Only tags with opening and closing tag does work: <example some="attributes">some content</example>
     * <img src="some.source" alt="" /> does not work because of the short end.
     * 
     * @var array
     */
    protected $tagsToIgnore = array (
            'script',
            'textarea',
            'pre',
            'style' 
    );

    /**
     * Code-Blocks which should not be processed are temporarily stored in this array.
     * 
     * @var array
     */
    protected $tagsToIgnoreBlocks = array ();

    /**
     * The tag to ignore at currently used runtime.
     * I had to define this in class and not local in method to get the
     * possibility to access this on anonymous function in preg_replace_callback.
     * 
     * @var string
     */
    protected $currentTagToIgnore;

    /**
     * Remove white-space before and after each line of blocks, which should not be processed?
     *
     * @var boolen
     */
    protected $trimTagsToIgnore = false;

    /**
     * Character used for indentation
     * 
     * @var string
     */
    protected $spaceCharacter = "\t";

    /**
     * Remove html-comments?
     *
     * @var boolen
     */
    protected $removeComments = false;

    /**
     * preg_replace()-Pattern which define opening tags to wrap with newlines.
     * <tag> becomes \n<tag>\n
     * 
     * @var array
     */
    protected $openTagsPattern = array (
            "/(<html\b[^>]*>)/i",
            "/(<head\b[^>]*>)/i",
            "/(<body\b[^>]*>)/i",
            "/(<link\b[^>]*>)/i",
            "/(<meta\b[^>]*>)/i",
            "/(<div\b[^>]*>)/i",
            "/(<section\b[^>]*>)/i",
            "/(<nav\b[^>]*>)/i",
            "/(<table\b[^>]*>)/i",
            "/(<thead\b[^>]*>)/i",
            "/(<tbody\b[^>]*>)/i",
            "/(<tr\b[^>]*>)/i",
            "/(<th\b[^>]*>)/i",
            "/(<td\b[^>]*>)/i",
            "/(<ul\b[^>]*>)/i",
            "/(<li\b[^>]*>)/i",
            "/(<figure\b[^>]*>)/i",
            "/(<select\b[^>]*>)/i" 
    );

    /**
     * preg_replace()-Pattern which define tags prepended with a newline.
     * <tag> becomes \n<tag>
     * 
     * @var array
     */
    protected $patternWithLineBefore = array (
            "/(<p\b[^>]*>)/i",
            "/(<h[0-9]\b[^>]*>)/i",
            "/(<option\b[^>]*>)/i" 
    );

    /**
     * preg_replace()-Pattern which define closing tags to wrap with newlines.
     * </tag> becomes \n</tag>\n
     * 
     * @var array
     */
    protected $closeTagsPattern = array (
            "/(<\/html>)/i",
            "/(<\/head>)/i",
            "/(<\/body>)/i",
            "/(<\/link>)/i",
            "/(<\/meta>)/i",
            "/(<\/div>)/i",
            "/(<\/section>)/i",
            "/(<\/nav>)/i",
            "/(<\/table>)/i",
            "/(<\/thead>)/i",
            "/(<\/tbody>)/i",
            "/(<\/tr>)/i",
            "/(<\/th>)/i",
            "/(<\/td>)/i",
            "/(<\/ul>)/i",
            "/(<\/li>)/i",
            "/(<\/figure>)/i",
            "/(<\/select>)/i" 
    );

    /**
     * preg_match()-Pattern with tag-names to increase indention.
     * 
     * @var string
     */
    protected $indentOpenTagsPattern = "/<(html|head|body|div|section|nav|table|thead|tbody|tr|th|td|ul|figure|li)\b[ ]*[^>]*[>]/i";

    /**
     * preg_match()-Pattern with tag-names to decrease indention.
     * 
     * @var string
     */
    protected $indentCloseTagsPattern = "/<\/(html|head|body|div|section|nav|table|thead|tbody|tr|th|td|ul|figure|li)>/i";

    /**
     * Constructor
     */
    public function __construct() {
    }

    /**
     * Adds a Tag which should be returned as the way in source.
     * 
     * @param string $tagToIgnore
     * @throws RuntimeException
     * @return void
     */
    public function addTagToIgnore($tagToIgnore) {
        if (! preg_match( '/^[a-zA-Z]+$/', $tagToIgnore )) {
            throw new RuntimeException( "Only characters from a to z are allowed as tag.", 1393489077 );
        }

        if (! in_array( $tagToIgnore, $this->tagsToIgnore )) {
            $this->tagsToIgnore[] = $tagToIgnore;
        }
    }

    /**
     * Setter for trimTagsToIgnore.
     *
     * @param boolean $bool
     * @return void
     */
    public function setTrimTagsToIgnore($bool) {
        $this->trimTagsToIgnore = $bool;
    }

    /**
     * Setter for removeComments.
     *  
     * @param boolean $bool
     * @return void
     */
    public function setRemoveComments($bool) {
        $this->removeComments = $bool;
    }

    /**
     * Callback function used by preg_replace_callback() to store the blocks which should be ignored and set a marker to replace them later again with the blocks.
     * 
     * @param array $e
     * @return string
     */
    private function tagsToIgnoreCallback($e) {
        // build key for reference
        $key = '<' . $this->currentTagToIgnore . '>' . sha1( $this->currentTagToIgnore . $e[0] ) . '</' . $this->currentTagToIgnore . '>';

        // trim each line
        if ($this->trimTagsToIgnore) {
            $lines = explode( "\n", $e[0] );
            array_walk( $lines, function (&$n) {
                $n = trim( $n );
            } );
            $e[0] = implode( PHP_EOL, $lines );
        }

        // add block to storage
        $this->tagsToIgnoreBlocks[$key] = $e[0];

        return $key;
    }

    /**
     * The main method.
     * 
     * @param string $buffer The HTML-Code to process
     * @return string The nice looking sourcecode
     */
    public function beautify($buffer) {
        // remove blocks, which should not be processed and add them later again using keys for reference 
        foreach ( $this->tagsToIgnore as $tag ) {
            $this->currentTagToIgnore = $tag;
            $buffer = preg_replace_callback( '/<' . $this->currentTagToIgnore . '\b[^>]*>([\s\S]*?)<\/' . $this->currentTagToIgnore . '>/mi', array (
                    $this,
                    'tagsToIgnoreCallback' 
            ), $buffer );
        }

        // temporarily remove comments to keep original linebreaks
        $this->currentTagToIgnore = 'htmlcomment';
        $buffer = preg_replace_callback( "/<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->/ms", array (
                $this,
                'tagsToIgnoreCallback' 
        ), $buffer );

        // cleanup source
        // ... all in one line
        // ... remove double spaces
        // ... remove tabulators
        $buffer = preg_replace( array (
                "/\s\s+|\n/",
                "/ +/",
                "/\t+/" 
        ), array (
                "",
                " ",
                "" 
        ), $buffer );

        // remove comments, if 
        if ($this->removeComments) {
            $buffer = preg_replace( "/<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->/ms", "", $buffer );
        }

        // add newlines for several tags
        $buffer = preg_replace( $this->patternWithLineBefore, "\n$1", $buffer ); // tags with line before tag
        $buffer = preg_replace( $this->openTagsPattern, "\n$1\n", $buffer ); // opening tags
        $buffer = preg_replace( $this->closeTagsPattern, "\n$1\n", $buffer ); // closing tags


        // get the html each line and do indention
        $lines = explode( "\n", $buffer );
        $indentionLevel = 0;
        $cleanContent = array (); // storage for indented lines
        foreach ( $lines as $line ) {
            // continue loop on empty lines
            if (! $line) {
                continue;
            }

            // test for closing tags
            if (preg_match( $this->indentCloseTagsPattern, $line )) {
                $indentionLevel --;
            }

            // push content
            $cleanContent[] = str_repeat( $this->spaceCharacter, $indentionLevel ) . $line;

            // test for opening tags
            if (preg_match( $this->indentOpenTagsPattern, $line )) {
                $indentionLevel ++;
            }
        }

        // write indented lines back to buffer
        $buffer = implode( PHP_EOL, $cleanContent );

        // add blocks, which should not be processed
        $buffer = str_replace( array_keys( $this->tagsToIgnoreBlocks ), $this->tagsToIgnoreBlocks, $buffer );

        return $buffer;
    }
}

$BeautifyMyHtml = new \LBR\LbrService\BeautifyMyHtml();
$BeautifyMyHtml->setTrimTagsToIgnore( true );
//$BeautifyMyHtml->setRemoveComments(true);
echo $BeautifyMyHtml->beautify( file_get_contents( 'http://example.org' ) );
?>

来源：https://stackoverflow.com/questions/17172824/problems-indenting-html5-with-php

标签

php

html

domdocument

indentation

tidy