Parsing command arguments in PHP

后端 未结 11 729
萌比男神i
萌比男神i 2020-12-11 01:19

Is there a native \"PHP way\" to parse command arguments from a string? For example, given the following string:

foo \"bar \\\"baz         


        
相关标签:
11条回答
  • 2020-12-11 01:22

    Well, you could also build this parser with a recursive regex:

    $regex = "([a-zA-Z0-9.-]+|\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"|'([^'\\\\]+(?2)|\\\\.(?2)|)')s";
    

    Now that's a bit long, so let's break it out:

    $identifier = '[a-zA-Z0-9.-]+';
    $doubleQuotedString = "\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"";
    $singleQuotedString = "'([^'\\\\]+(?2)|\\\\.(?2)|)'";
    $regex = "($identifier|$doubleQuotedString|$singleQuotedString)s";
    

    So how does this work? Well, the identifier should be obvious...

    The two quoted sub-patterns are basically, the same, so let's look at the single quoted string:

    '([^'\\\\]+(?2)|\\\\.(?2)|)'
    

    Really, that's a quote character followed by a recursive sub-pattern, followed by a end quote.

    The magic happens in the sub-pattern.

    [^'\\\\]+(?2)
    

    That part basically consumes any non-quote and non-escape character. We don't care about them, so eat them up. Then, if we encounter either a quote or a backslash, trigger an attempt to match the entire sub-pattern again.

    \\\\.(?2)
    

    If we can consume a backslash, then consume the next character (without caring what it is), and recurse again.

    Finally, we have an empty component (if the escaped character is last, or if there's no escape character).

    Running this on the test input @HamZa provided returns the same result:

    array(8) {
      [0]=>
      string(3) "foo"
      [1]=>
      string(13) ""bar \"baz\"""
      [2]=>
      string(10) "'\'quux\''"
      [3]=>
      string(9) "'foo"bar'"
      [4]=>
      string(9) ""baz'boz""
      [5]=>
      string(5) "hello"
      [6]=>
      string(16) ""regex
    
    world\"""
      [7]=>
      string(18) ""escaped escape\\""
    }
    

    The main difference that happens is in terms of efficiency. This pattern should backtrack less (since it's a recursive pattern, there should be next to no backtracking for a well-formed string), where the other regex is a non-recursive regex and will backtrack every single character (that's what the ? after the * forces, non-greedy pattern consumption).

    For short inputs this doesn't matter. The test case provided, they run within a few % of each other (margin of error is greater than the difference). But with a single long string with no escape sequences:

    "with a really long escape sequence match that will force a large backtrack loop"
    

    The difference is significant (100 runs):

    • Recursive: float(0.00030398368835449)
    • Backtracking: float(0.00055909156799316)

    Of course, we can partially lose this advantage with a lot of escape sequences:

    "This is \" A long string \" With a\lot \of \"escape \sequences"
    
    • Recursive: float(0.00040411949157715)
    • Backtracking: float(0.00045490264892578)

    But note that the length still dominates. That's because the backtracker scales at O(n^2), where the recursive solution scales at O(n). However, since the recursive pattern always needs to recurse at least once, it's slower than the backtracking solution on short strings:

    "1"
    
    • Recursive: float(0.0002598762512207)
    • Backtracking: float(0.00017595291137695)

    The tradeoff appears to happen around 15 characters... But both are fast enough that it won't make a difference unless you're parsing several KB or MB of data... But it's worth discussing...

    On sane inputs, it won't make a significant difference. But if you're matching more than a few hundred bytes, it may start to add up significantly...

    Edit

    If you need to handle arbitrary "bare words" (unquoted strings), then you can change the original regex to:

    $regex = "([^\s'\"]\S*|\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"|'([^'\\\\]+(?2)|\\\\.(?2)|)')s";
    

    However, it really depends on your grammar and what you consider a command or not. I'd suggest formalizing the grammar you expect...

    0 讨论(0)
  • 2020-12-11 01:26

    I've worked out the following expression to match the various enclosures and escapement:

    $pattern = <<<REGEX
    /
    (?:
      " ((?:(?<=\\\\)"|[^"])*) "
    |
      ' ((?:(?<=\\\\)'|[^'])*) '
    |
      (\S+)
    )
    /x
    REGEX;
    
    preg_match_all($pattern, $input, $matches, PREG_SET_ORDER);
    

    It matches:

    1. Two double quotes, inside of which a double quote may be escaped
    2. Same as #1 but for single quotes
    3. Unquoted string

    Afterwards, you need to (carefully) remove the escaped characters:

    $args = array();
    foreach ($matches as $match) {
        if (isset($match[3])) {
            $args[] = $match[3];
        } elseif (isset($match[2])) {
            $args[] = str_replace(['\\\'', '\\\\'], ["'", '\\'], $match[2]);
        } else {
            $args[] = str_replace(['\\"', '\\\\'], ['"', '\\'], $match[1]);
        }
    }
    print_r($args);
    

    Update

    For the fun of it, I've written a more formal parser, outlined below. It won't give you better performance, it's about three times slower than the regular expression mostly due its object oriented nature. I suppose the advantage is more academic than practical:

    class ArgvParser2 extends StringIterator
    {
        const TOKEN_DOUBLE_QUOTE = '"';
        const TOKEN_SINGLE_QUOTE = "'";
        const TOKEN_SPACE = ' ';
        const TOKEN_ESCAPE = '\\';
    
        public function parse()
        {
            $this->rewind();
    
            $args = [];
    
            while ($this->valid()) {
                switch ($this->current()) {
                    case self::TOKEN_DOUBLE_QUOTE:
                    case self::TOKEN_SINGLE_QUOTE:
                        $args[] = $this->QUOTED($this->current());
                        break;
    
                    case self::TOKEN_SPACE:
                        $this->next();
                        break;
    
                    default:
                        $args[] = $this->UNQUOTED();
                }
            }
    
            return $args;
        }
    
        private function QUOTED($enclosure)
        {
            $this->next();
            $result = '';
    
            while ($this->valid()) {
                if ($this->current() == self::TOKEN_ESCAPE) {
                    $this->next();
                    if ($this->valid() && $this->current() == $enclosure) {
                        $result .= $enclosure;
                    } elseif ($this->valid()) {
                        $result .= self::TOKEN_ESCAPE;
                        if ($this->current() != self::TOKEN_ESCAPE) {
                            $result .= $this->current();
                        }
                    }
                } elseif ($this->current() == $enclosure) {
                    $this->next();
                    break;
                } else {
                    $result .= $this->current();
                }
                $this->next();
            }
    
            return $result;
        }
    
        private function UNQUOTED()
        {
            $result = '';
    
            while ($this->valid()) {
                if ($this->current() == self::TOKEN_SPACE) {
                    $this->next();
                    break;
                } else {
                    $result .= $this->current();
                }
                $this->next();
            }
    
            return $result;
        }
    
        public static function parseString($input)
        {
            $parser = new self($input);
    
            return $parser->parse();
        }
    }
    

    It's based on StringIterator to walk through the string one character at a time:

    class StringIterator implements Iterator
    {
        private $string;
    
        private $current;
    
        public function __construct($string)
        {
            $this->string = $string;
        }
    
        public function current()
        {
            return $this->string[$this->current];
        }
    
        public function next()
        {
            ++$this->current;
        }
    
        public function key()
        {
            return $this->current;
        }
    
        public function valid()
        {
            return $this->current < strlen($this->string);
        }
    
        public function rewind()
        {
            $this->current = 0;
        }
    }
    
    0 讨论(0)
  • 2020-12-11 01:26

    There really is no native function for parsing commands to my knowledge. However, I have created a function which does the trick natively in PHP. By using str_replace several times, you are able to convert the string into something array convertible. I don't know how fast you consider fast, but when running the query 400 times, the slowest query was under 34 microseconds.

    function get_array_from_commands($string) {
        /*
        **  Turns a command string into a field
        **  of arrays through multiple lines of 
        **  str_replace, until we have a single
        **  string to split using explode().
        **  Returns an array.
        */
    
        // replace single quotes with their related
        // ASCII escape character
        $string = str_replace("\'","&#x27;",$string);
        // Do the same with double quotes
        $string = str_replace("\\\"","&quot;",$string);
        // Now turn all remaining single quotes into double quotes
        $string = str_replace("'","\"",$string);
        // Turn " " into " so we don't replace it too many times
        $string = str_replace("\" \"","\"",$string);
        // Turn the remaining double quotes into @@@ or some other value
        $string = str_replace("\"","@@@",$string);
        // Explode by @@@ or value listed above
        $string = explode("@@@",$string);
        return $string;
    }
    
    0 讨论(0)
  • 2020-12-11 01:30

    You can simply just use str_getcsv and do few cosmetic surgery with stripslashes and trim

    Example :

    $str =<<<DATA
    "bar \"baz\"" '\'quux\''
    "foo"
    'foo'
    "foo'foo"
    'foo"foo'
    "foo\"foo"
    'foo\'foo'
    "foo\foo"
    "foo\\foo"
    "foo foo"
    'foo foo' "foo\\foo" \'quux\' \"baz\" "foo'foo"
    DATA;
    
    
    $str = explode("\n", $str);
    
    foreach($str as $line) {
        $line = array_map("stripslashes",str_getcsv($line," "));
        print_r($line);
    }
    

    Output

    Array
    (
        [0] => bar "baz"
        [1] => ''quux''
    )
    Array
    (
        [0] => foo
    )
    Array
    (
        [0] => 'foo'
    )
    Array
    (
        [0] => foo'foo
    )
    Array
    (
        [0] => 'foo"foo'
    )
    Array
    (
        [0] => foo"foo
    )
    Array
    (
        [0] => 'foo'foo'
    )
    Array
    (
        [0] => foooo
    )
    Array
    (
        [0] => foofoo
    )
    Array
    (
        [0] => foo foo
    )
    Array
    (
        [0] => 'foo
        [1] => foo'
        [2] => foofoo
        [3] => 'quux'
        [4] => "baz"
        [5] => foo'foo
    )
    

    Caution

    There is nothing like a unversal format for argument is best you spesify specific format and the easiest have seen is CSV

    Example

     app.php arg1 "arg 2" "'arg 3'" > 4 
    

    Using CSV you can simple have this output

    Array
    (
        [0] => app.php
        [1] => arg1
        [2] => arg 2
        [3] => 'arg 3'
        [4] => >
        [5] => 4
    )
    
    0 讨论(0)
  • 2020-12-11 01:32

    I suggest something like:

    $str = <<<EOD
    foo "bar \"baz\"" '\'quux\''
    EOD;
    
    $match = preg_split("/('(?:.*)(?<!\\\\)(?>\\\\\\\\)*'|\"(?:.*)(?<!\\\\)(?>\\\\\\\\)*\")/U", $str, null, PREG_SPLIT_DELIM_CAPTURE);
    
    var_dump(array_filter(array_map('trim', $match)));
    

    With some assistance from: string to array, split by single and double quotes for the regexp

    You still have to unescape the strings in the array after.

    array(3) {
      [0]=>
      string(3) "foo"
      [1]=>
      string(13) ""bar \"baz\"""
      [3]=>
      string(10) "'\'quux\''"
    }
    

    But you get the picture.

    0 讨论(0)
  • 2020-12-11 01:34

    Since you request a native way to do this, and PHP doesn't provide any function that would map $argv creation, you could workaround this lack like this :

    Create an executable PHP script foo.php :

    <?php
    
    // Skip this file name
    array_shift( $argv );
    
    // output an valid PHP code
    echo 'return '. var_export( $argv, 1 ).';';
    
    ?>
    

    And use it to retrieve arguments, the way PHP will actually do if you exec $command :

    function parseCommand( $command )
    {
        return eval(
            shell_exec( "php foo.php ".$command )
        );
    }
    
    
    $command = <<<CMD
    foo "bar \"baz\"" '\'quux\''
    CMD;
    
    
    $args = parseCommand( $command );
    
    var_dump( $args );
    

    Advantages :

    • Very simple code
    • Should be faster than any regular expression
    • 100% close to PHP behavior

    Drawbacks :

    • Requires execution privilege on the host
    • Shell exec + eval on the same $var, let's party ! You have to trust input or to do so much filtering that simple regexp may be be faster (I dindn't dig deep into that).
    0 讨论(0)
提交回复
热议问题