(php) regexto remove comments but ignore occurances within strings

问题

I am writing a comment-stripper and trying to accommodate for all needs here. I have the below stack of code which removes pretty much all comments, but it actually goes too far. A lot of time was spent trying and testing and researching the regex patterns to match, but I don't claim that they are the best at each.

My problem is that I also have situation where I have 'PHP comments' (that aren't really comments' in standard code, or even in PHP strings, that I don't actually want to have removed.

Example:

<?php $Var = "Blah blah //this must not comment"; // this must comment. ?>

What ends up happening is that it strips out religiously, which is fine, but it leaves certain problems:

<?php  $Var = "Blah blah  ?>

Also:

will also cause problems, as the comment removes the rest of the line, including the ending ?>

See the problem? So this is what I need...

Comment characters within '' or "" need to be ignored
PHP Comments on the same line, that use double-slashes, should remove perhaps only the comment itself, or should remove the entire php codeblock.

Here's the patterns I use at the moment, feel free to tell me if there's improvement I can make in my existing patterns? :)

$CompressedData = $OriginalData;
$CompressedData = preg_replace('!/\*.*?\*/!s', '', $CompressedData);  // removes /* comments */
$CompressedData = preg_replace('!//.*?\n!', '', $CompressedData); // removes //comments
$CompressedData = preg_replace('!#.*?\n!', '', $CompressedData); // removes # comments
$CompressedData = preg_replace('/<!--(.*?)-->/', '', $CompressedData); // removes HTML comments

Any help that you can give me would be greatly appreciated! :)

回答1:

If you want to parse PHP, you can use token_get_all to get the tokens of a given PHP code. Then you just need to iterate the tokens, remove the comment tokens and put the rest back together.

But you would need a separate procedure for the HTML comments, preferably a real parser too (like DOMDocument provides with DOMDocument::loadHTML).

回答2:

You should first think carefully whether you actually want to do this. Though what you're doing may seem simple, in the worst case scenario, it becomes extremely complex problem (to solve with just few regular expressions). Let me just illustrate just of the few problems you would be facing when trying to strip both HTML and PHP comments from a file.

You can't straight out strip HTML comments, because you may have PHP inside the HTML comments, like:

<!-- HTML comment <?php echo 'Actual PHP'; ?> -->

You can't just simply separately deal with stuff inside the <?php and ?> tags either, since the ending thag ?> can be inside strings or even comments, like:

<?php /* ?> This is still a PHP comment <?php */ ?>

Let's not forget, that ?> actually ends the PHP, if it's preceded by one line comment. For example:

<?php // ?> This is not a PHP comment <?php ?>

Of course, like you already illustrated, there will be plenty of problems with comment indicators inside strings. Parsing out strings to ignore them isn't that simple either, since you have to remember that quotes can be escaped. Like:

<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>

Parsing order will also cause you headache. You can't just simply choose to parse either the one line comments first or the multi line comments first. They both have to be parsed at the same time (i.e. in the order they appear in the document). Otherwise you may end up with broken code. Let me illustrate:

<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>

If you parse multi line comments first, the second /* will eat up part of the string destroying the code. If you parse the single line comments first, you will end up eating the first */, which will also destroy the code.

As you can see, there are many complex scenarios you'd have to account, if you intend to solve your problem with regular expression. The only correct solution is to use some sort of PHP parser, like token_get_all(), to tokenize the entire source code and strip the comment tokens and rebuild the file. Which, I'm afraid, isn't entirely simple either. It also won't help with HTML comments, since the HTML is left untouched. You can't use XML parsers to get the HTML comments either, because the HTML is rarely well formed with PHP.

To put it short, the idea of what you're doing is simple, but the actual implementation is much harder than it seems. Thus, I would recommend trying to avoid doing this, unless you have a very good reason to do it.

回答3:

One way to do this in REGEX is to use one compound expression and preg_replace_callback.

I was going to post a poor example but the best place to look is at the source code to the PHP port of Dean Edwards' JS packer script - you should see the general idea.

http://joliclic.free.fr/php/javascript-packer/en/

回答4:

try this

private function removeComments( $content ){
    $content = preg_replace( "!/\*.*?\*/!s" , '', $content );
    $content = preg_replace( "/\n\s*\n/" , "\n", $content );    
    $content = preg_replace( '#^\s*//.+$#m' , "", $content );
    $content = preg_replace( '![\s\t]//.*?\n!' , "\n", $content );
    $content = preg_replace( '/<\!--.*-->/' , "\n", $content );
    return $content;
}

来源：https://stackoverflow.com/questions/2475876/php-regexto-remove-comments-but-ignore-occurances-within-strings