How can I adapt my regex to allow for escaped quotes?

百般思念 提交于 2019-12-04 17:58:14

The following tested script first checks that a given string is valid, consisting solely of single quoted, double quoted and un-quoted chunks. The $re_valid regex performs this validation task. If the string is valid, it then parses the string one chunk at a time using preg_replace_callback() and the $re_parse regex. The callback function processes the unquoted chunks using preg_replace(), and returns all quoted chunks unaltered. The only tricky part of the logic is passing the $replace and $with argument values from the main function to the callback function. (Note that PHP procedural code makes this variable passing from the main function to the callback function a bit awkward.) Here is the script:

<?php // test.php Rev:20121113_1500
function str_replace_outside_quotes($replace, $with, $string){
    $re_valid = '/
        # Validate string having embedded quoted substrings.
        ^                           # Anchor to start of string.
        (?:                         # Zero or more string chunks.
          "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
        | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk,
        | [^\'"\\\\]+               # or an unquoted chunk (no escapes).
        )*                          # Zero or more string chunks.
        \z                          # Anchor to end of string.
        /sx';
    if (!preg_match($re_valid, $string)) // Exit if string is invalid.
        exit("Error! String not valid.");
    $re_parse = '/
        # Match one chunk of a valid string having embedded quoted substrings.
          (                         # Either $1: Quoted chunk.
            "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
          | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk.
          )                         # End $1: Quoted chunk.
        | ([^\'"\\\\]+)             # or $2: an unquoted chunk (no escapes).
        /sx';
    _cb(null, $replace, $with); // Pass args to callback func.
    return preg_replace_callback($re_parse, '_cb', $string);
}
function _cb($matches, $replace = null, $with = null) {
    // Only set local static vars on first call.
    static $_replace, $_with;
    if (!isset($matches)) { 
        $_replace = $replace;
        $_with = $with;
        return; // First call is done.
    }
    // Return quoted string chunks (in group $1) unaltered.
    if ($matches[1]) return $matches[1];
    // Process only unquoted chunks (in group $2).
    return preg_replace('/'. preg_quote($_replace, '/') .'/',
        $_with, $matches[2]);
}
$data = file_get_contents('testdata.txt');
$output = str_replace_outside_quotes('?', '%s', $data);
file_put_contents('testdata_out.txt', $output);
?>

» Code has been updated to solve ALL issues brought in comments and is now working properly «


Having $s an input, $p a phrase string and $v a replacement variable, use preg_replace as follows:

$r = '/\G((?:(?:[^\x5C"\']|\x5C(?!["\'])|\x5C["\'])*?(?:\'(?:[^\x5C\']|\x5C(?!\')' .
     '|\x5C\')*\')*(?:"(?:[^\x5C"]|\x5C(?!")|\x5C")*")*)*?)' . preg_quote($p) . '/';
$s = preg_match($r, $s) ? preg_replace($r, "$1" . $v, $s) : $s;

Check this demo.


Note: In regex, \x5C represents a \ character.

This regex matches valid quoted strings. This means it is aware of escaped quotes.

^("[^\"\\]*(?:\\.[^\"\\]*)*(?![^\\]\\)")|('[^\'\\]*(?:\\.[^\'\\]*)*(?![^\\]\\)')$

Ready for PHP use:

$pattern = '/^((?:"([^"\\\\]*(?:\\\\.[^"\\\\]*)*(?![^\\\\]\\\\))")|(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*(?![^\\\\]\\\\))\'))$/';

Adapted for str_replace_outside_quotes():

$pattern = '/((?:"(?:[^"\\\\]*(?:\\\\.[^"\\\\]*)*(?![^\\\\]\\\\))")|(?:\'(?:[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*(?![^\\\\]\\\\))\'))/';

Edit, changed answer. Does not works with regex(only what is now regex - I thought it would be better to use preg_replace instead of str_replace, but you can change that)):

function replace_special($what, $with, $str) {
   $res = '';
   $currPos = 0;
   $doWork = true;

   while (true) {
     $doWork = false; //pesimistic approach

     $pos = get_quote_pos($str, $currPos, $quoteType);
     if ($pos !== false) {
       $posEnd = get_specific_quote_pos($str, $quoteType, $pos + 1);
       if ($posEnd !== false) {
           $doWork = $posEnd !== strlen($str) - 1; //do not break if not end of string reached

           $res .= preg_replace($what, $with, 
                                substr($str, $currPos, $pos - $currPos));
           $res .= substr($str, $pos, $posEnd - $pos + 1);                      

           $currPos = $posEnd + 1;
       }
     }

     if (!$doWork) {
        $res .= preg_replace($what, $with, 
                             substr($str, $currPos, strlen($str) - $currPos + 1));
        break;
     }

   }   

   return $res;
}

function get_quote_pos($str, $currPos, &$type) {
   $pos1 = get_specific_quote_pos($str, '"', $currPos);
   $pos2 = get_specific_quote_pos($str, "'", $currPos);
   if ($pos1 !== false) {
      if ($pos2 !== false && $pos1 > $pos2) {
        $type = "'";
        return $pos2;
      }
      $type = '"';
      return $pos1;
   }
   else if ($pos2 !== false) {
      $type = "'";
      return $pos2;
   }

   return false;
}

function get_specific_quote_pos($str, $type, $currPos) {
   $pos = $currPos - 1; //because $fromPos = $pos + 1 and initial $fromPos must be currPos
   do {
     $fromPos = $pos + 1;
     $pos = strpos($str, $type, $fromPos);
   }
   //iterate again if quote is escaped!
   while ($pos !== false && $pos > $currPos && $str[$pos-1] == '\\');
   return $pos;
}

Example:

   $str = 'hello ? ="is it me your are looking for\\"?" AND mist="???" WHERE test=? AND dzo=?';
   echo replace_special('/\?/', '#', $str);

returns

hello # ="is it me your are looking for\"?" AND mist="???" WHERE test=# AND dzo=#

----

--old answer (I live it here because it does solve something although not full question)

<?php
function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    var_dump($string);
    $pattern = '/(?<!\\\\)"/';
    $outside = preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
   var_dump($outside);
    for ($i = 0; $i < count($outside); ++$i) {
       $replaced = str_replace($replace, $with, $outside[$i]);
       if ($i != 0 && $i != count($outside) - 1) { //first and last are not inside quote
          $replaced = '"'.$replaced.'"';
       }
       $result .= $replaced;
    }
   return $result;
}
echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');
Treffynnon

As @ridgerunner mentions in the comments on the question there is another possible regex solution:

function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    $pattern = '/("[^"\\\\]*(?:\\\\.[^"\\\\]*)*")' // hunt down unescaped double quotes
             . "|('[^'\\\\]*(?:\\\\.[^'\\\\]*)*')/s"; // or single quotes
    $outside = array_filter(preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE));
    while ($outside) {
        $result .= str_replace($replace, $with, array_shift($outside)) // outside quotes
                .  array_shift($outside); // inside quotes
    }
    return $result;
}

Note the use of array_filter to remove some matches that were coming back from the regex empty and breaking the alternating nature of this function.


A no regex approach that I knocked up quickly. It works, but I am sure there are some optimisations that could be done.

function str_replace_outside_quotes($replace, $with, $string){
    $string = str_split($string);
    $accumulation = '';
    $current_unquoted_string = null;
    $inside_quote = false;
    $quotes = array("'", '"');
    foreach($string as $char) {
        if ($char == $inside_quote && "\\" != substr($accumulation, -1)) {
            $inside_quote = false;
        } else if(false === $inside_quote && in_array($char, $quotes)) {
            $inside_quote = $char;
        }

        if(false === $inside_quote) {
            $current_unquoted_string .= $char;
        } else {
            if(null !== $current_unquoted_string) {
                $accumulation .= str_replace($replace, $with, $current_unquoted_string);
                $current_unquoted_string = null;
            }
            $accumulation .= $char;
        }
    }
    if(null !== $current_unquoted_string) {
        $accumulation .= str_replace($replace, $with, $current_unquoted_string);
        $current_unquoted_string = null;
    }
    return $accumulation;
}

In my benchmarking it takes double the time of the regex approach above and when the string length is increased the regex options resource use doesn't go up by much. The approach above on the other hand increases linearly with the length of text fed to it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!