Parse HTML user input | 易学教程

问题

Let's say I have a string from the user ($input). I can go and strip tags, to allow only allowed tags in. I can convert to text with htmlspecialchars(). I can even replace all tags I don't want with text.

function html($input) {
    $input = '<bl>'.htmlspecialchars($input).'</bl>'; // bl is a custom tag that I style (stands for block)
    global $open;
    $open = []; //Array of open tags
    for ($i = 0; $i < strlen($input); $i++) {
        if (!in_array('code', $open) && !in_array('codebl', $open)) { //If we are parsing
            $input = preg_replace_callback('#^(.{'.$i.'})&lt;(em|i|del|sub|sup|sml|code|kbd|pre|codebl|quote|bl|sbl)&gt;\s*#s', function($match) {
                global $open; //...then add new tags to the array
                array_push($open,$match[2]);
                return $match[1].'<'.$match[2].'>'; //And replace them
            }, $input);
            $input = preg_replace_callback('#^(.{'.$i.'})(https?):\/\/([^\s"\(\)<>]+)#', function($m) {
                return $m[1].'<a href="'.$m[2].'://'.$m[3].'" target="_blank">'.$m[3].'</a>';
            }, $input, -1, $num); //Simple linking
            $i += $num * 9;
            $input = preg_replace_callback('#^(.{'.$i.'})\n\n#', function($m) {
                return $m[1].'</bl><bl>';
            }, $input); // More of this bl element
        }
        if (end($open)) { //Close tags
            $input = preg_replace_callback('#^(.{'.$i.'})&lt;/('.end($open).')&gt;#s', function($match) {
                global $open;
                array_pop($open);
                return trim($match[1]).'</'.$match[2].'>';
            }, $input);
        }
    }
    while ($open) { //Handle unclosed tags
        $input .= '</'.end($open).'>';
        array_pop($open);
    }
    return $input;
}

The problem is that after that, there is no way to write literally <i&lgt;, because it will be automatically parsed into either  (if you write ), or &amplt;i&ampgt;&amplt;/i&ampgt; (if you write ). I want the user to be able to enter < (or any other HTML entity) and get < back. If I just send it straight to the browser unparsed, it would (obviously) be vulnerable to whatever sorcery the hackers are trying (and I'm letting) to (be) put on my site. So, How can I let the user use any of the pre-defined set of HTML tags, while still letting them use html entities?

回答1:

This is what I eventually used:

function html($input) {
    $input = preg_replace(["#&([^A-z])#","#<([^A-z/])#","#&$#","#<$#"], ['&amp;$1','&lt;$1','&amp;','&lt;'], $input); //Fix single "<"s and "&"s
    $open = []; //Array of open tags
    $close = false; //Is the current tag a close tag?
    for ($i = 0; $i <= strlen($input); $i++) { //Start the loop
        if ($tag) { //Are we in a tag?
            if (preg_match("/[^a-z]/", $input[$i])) { //The tag has ended
                if ($close) {
                    $close = false;
                    $sPos = strrpos(substr($input,0,$i), '<') + 2; //start position of tag
                    $tag = substr($input,$sPos,$i-$sPos); //tag name
                    if (end($open) == $tag) {
                        array_pop($open); //Good, it's a valid XML closing
                    } else {
                        $input = substr($input, 0, $sPos-2) . '&lt;/' . $tag . substr($input, $i); //BAD! Convert tag to text (open tag will be handled later)
                    }
                } else {
                    $sPos = strrpos(substr($input,0,$i), '<') + 1; //start position of tag
                    $tag = substr($input,$sPos,$i-$sPos); //tag name
                    if (in_array($tag, ['em','i','del','sub','sup','sml','code','kbd','pre','codebl','bl','sbl'])) { //Is it an acceptable tag?
                        array_push($open, $tag); //Add it to the array
                        $j = $i + 1;
                        while (preg_match("/\s/", $input[$j])) { //Get rid of whitespace
                            $j++;
                        }
                        $input = substr($input, 0, $sPos - 1) . '<' . $tag . '>' . substr($input, $j); //Seems legit
                    } else {
                        $input = substr($input, 0, $sPos - 1) . '&lt;' . $tag . substr($input, $i); //BAD! Convert tag to text
                    }
                }
                $tag = false;
            }
        } else if (!in_array('code', $open) && !in_array('codebl', $open) && !in_array('pre', $open)) { //Standard parsing of text
            if ($input[$i] == '<') { //Is it a tag?
                $tag = true;
                if ($input[$i+1] == '/') { //Is it a close tag?
                    $i++;
                    $close = true;
                }
            } else if (substr($input, $i, 4) == 'http') { //Link
                if (preg_match('#^.{'.$i.'}(https?):\/\/([^\s"\(\)<>]+)#', $input, $m)) {
                    $insert = '<a href="'.$m[1].'://'.$m[2].'" target="_blank">'.$m[2].'</a>';
                    $input = substr($input, 0, $i) . $insert . substr($input, $i + strlen($m[1].'://'.$m[2]));
                    $i += strlen($insert);
                }
            } else if ($input[$i] == "\n" && $input[$i+1] == "\n") { //Insert <bl> tag? (I use this to separate sections of text)
                $input = substr($input, 0, $i + 1) . '</bl><bl>' . substr($input, $i + 1);
            }
        } else { // We're in a code tag
            if (substr($input, $i+1, strlen(end($open)) + 3) == '</'.current($open).'>') {
                array_pop($open);
                $i += 2;
            } elseif ($input[$i] == '<') {
                $input = substr($input, 0, $i) . '&lt;' . substr($input, $i + 1);
                $i += 3; //Code tags have raw text
            } elseif (in_array('code', $open) && $input[$i] == "\n") { //No linebreaks are allowed in inline tags, convert to <codebl>
                $open[count($open) - 1] = 'codebl';
                $input = substr($input, 0, strrpos($input,'<code>')) . '<codebl>' . substr($input, strrpos($input,'<code>') + 6, strpos(substr($input, strrpos($input,'<code>')),'</code>') - 6) . '</codebl>' . substr($input, strpos(substr($input, strrpos($input,'<code>')),'</code>') + strrpos($input,'<code>') + 7);
                $i += 4;
            }
        }
    }
    while ($open) { //Handle open tags
        $input .= '</'.end($open).'>';
        array_pop($open);
    }
    return '<bl>'.$input.'</bl>';
}

I know it's a bit more risky, but you can first assume the input's good, then filter out the stuff explicitly found as bad.

来源：https://stackoverflow.com/questions/21792609/parse-html-user-input

标签

php

html

validation

parsing

user-generated-content