Regular Expressions - Matching IRC-like parameters?

后端 未结 2 1583
野性不改
野性不改 2020-12-17 06:21

I am looking to create a IRC-like command format:

/commandname parameter1 \"parameter 2\" \"parameter \\\"3\\\"\" parameter\"4 parameter\\\"5
相关标签:
2条回答
  • 2020-12-17 06:42

    You have shown your code - that's good, but it seems that you haven't thought about whether it is reasonable to parse the command like that:

    • Firstly, your code will allow new line character inside the command name and parameters. It would be reasonable if you assume that new line character can never be there.
    • Secondly, \ also needs to be escaped like ", since there will be no way to specify a single \ at the end of a parameter without causing any confusion.
    • Thirdly, it is a bit weird to have the command name parsed the same way as parameters - command names are usually per-determined and fixed, so there is no need to allow for flexible ways to specify it.

    I cannot think of one-line solution in JavaScript that is general. JavaScript regex lacks \G, which asserts the last match boundary. So my solution will have to make do with beginning of string assertion ^ and chomping off the string as a token is matched.

    (There is not much code here, mostly comments)

    function parseCommand(str) {
        /*
         * Trim() in C# will trim off all whitespace characters
         * \s in JavaScript regex also match any whitespace character
         * However, the set of characters considered as whitespace might not be
         * equivalent
         * But you can be sure that \r, \n, \t, space (ASCII 32) are included.
         * 
         * However, allowing all those whitespace characters in the command
         * is questionable.
         */
        str = str.replace(/^\s*\//, "");
    
        /* Look-ahead (?!") is needed to prevent matching of quoted parameter with
         * missing closing quote
         * The look-ahead comes from the fact that your code does not backtrack
         * while the regex engine will backtrack. Possessive qualifier can prevent
         * backtracking, but it is not supported by JavaScript RegExp.
         *
         * We emulate the effect of \G by using ^ and repeatedly chomping off
         * the string.
         *
         * The regex will match 2 cases:
         * (?!")([^ ]+)
         * This will match non-quoted tokens, which are not allowed to 
         * contain spaces
         * The token is captured into capturing group 1
         *
         * "((?:[^\\"]|\\[\\"])*)"
         * This will match quoted tokens, which consists of 0 or more:
         * non-quote-or-backslash [^\\"] OR escaped quote \"
         * OR escaped backslash \\
         * The text inside the quote is captured into capturing group 2
         */
        var regex = /^ *(?:(?!")([^ ]+)|"((?:[^\\"]|\\[\\"])*)")/;
        var tokens = [];
        var arr;
    
        while ((arr = str.match(regex)) !== null) {
            if (arr[1] !== void 0) {
                // Non-space token
                tokens.push(arr[1]);
            } else {
                // Quoted token, needs extra processing to
                // convert escaped character back
                tokens.push(arr[2].replace(/\\([\\"])/g, '$1'));
            }
    
            // Remove the matched text
            str = str.substring(arr[0].length);
        }
    
        // Test that the leftover consists of only space characters
        if (/^ *$/.test(str)) {
            return tokens;
        } else {
            // The only way to reach here is opened quoted token
            // Your code returns the tokens successfully parsed
            // but I think it is better to show an error here.
            return null;
        }
    }
    
    0 讨论(0)
  • 2020-12-17 07:05

    I created a simple regex that matches the command line you wrote.

    /\w+\s((("([^\\"]*\\")*[^\\"]*")|[^ ]+)(\b|\s+))+$
    
    • /\w+\s finds the first part of your command
    • (((
    • "([^\\"]*\\")* finds any string starting with " that doesn't contain \" followed by a \" one or more times (thus allowing "something\", "some\"thing\" and so on
    • [^\\"]*" followed by a list of characters not containing \ or " and at last a "
    • )|[^ ]+ this is an alternative: finds any nonspace character sequence
    • )
    • (\b|\s+) all followerd by a space or a word boundary
    • )+$ one or more times, one per command, until the end of the string

    I'm afraid that this can fail sometimes, but I posted this to show that sometimes the arguments have a structure based on repetition, for example see "something\"something\"something\"end" where the repeated structure is something\", and you can use this idea to build your regex

    0 讨论(0)
提交回复
热议问题