Regular expression to match escaped characters (quotes)

雨燕双飞 提交于 2019-11-30 19:06:08

The problem with all the other answers is they only match for the initial obvious testing, but fall short to further scrutiny. For example, all of the answers expect that the very first quote will not be escaped. But most importantly, escaping is a more complex process than just a single backslash, because that backslash itself can be escaped. Imagine trying to actually match a string which ends with a backslash. How would that be possible?

This would be the pattern you are looking for. It doesn't assume that the first quote is the working one, and it will allow for backslashes to be escaped.

(?<!\\)(?:\\{2})*"(?:(?<!\\)(?:\\{2})*\\"|[^"])+(?<!\\)(?:\\{2})*"

Here is one that I've used in the past:

("[^"\\]*(?:\\.[^"\\]*)*")

This will capture quoted strings, along with any escaped quote characters, and exclude anything that doesn't appear in enclosing quotes.

For example, the pattern will capture "This is valid" and "This is \" also \" valid" from this string:

"This is valid" this won't be captured "This is \" also \" valid"

This pattern will not match the string "I don't \"have\" a closing quote, and will allow for additional escape codes in the string (e.g., it will match "hello world!\n").

Of course, you'll have to escape the pattern to use it in your code, like so:

"(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")"

Try this one... It prefers the \", if that matches, it will pick it, otherwise it will pick ".

"((?:\\"|[^"])*)"

Once you have matched the string, you'll need to take the first captured group's value and replace \" with ".

Edit: Fixed grouping logic.

Please find in the below code comprising expression evaluation for String, Number and Decimal.

public static void commaSeparatedStrings() {        
    String value = "'It\\'s my world', 'Hello World', 'What\\'s up', 'It\\'s just what I expected.'";

    if (value.matches("'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+'(((,)|(,\\s))'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+')*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedDecimals() {
    String value = "-111.00, 22111.00, -1.00";
    // "\\d+([,]|[,\\s]\\d+)*"
    if (value.matches(
            "^([-]?)\\d+\\.\\d{1,10}?(((,)|(,\\s))([-]?)\\d+\\.\\d{1,10}?)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedNumbers() {
    String value = "-11, 22, -31";      
    if (value.matches("^([-]?)\\d+(((,)|(,\\s))([-]?)\\d+)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

This

("((?:[^"\\])*(?:\\\")*(?:\\\\)*)*")

will capture all strings (within double quotes), including \" and \\ escape sequences. (Note that this answer assumes that the only escape sequences in your string are \" or \\ sequences -- no other backslash characters or escape sequences will be captured.)

("(?:         # begin with a quote and capture...
  (?:[^"\\])* # any non-\, non-" characters
  (?:\\\")*   # any combined \" sequences
  (?:\\\\)*   # and any combined \\ sequences
  )*          # any number of times
")            # then, close the string with a quote

Try it out here!

Also, note that maksymiuk's accepted answer contains an "edge case" ("Imagine trying to actually match a string which ends with a backslash") which is actually just a malformed string. Something like

"this\"

...is not a "string ending on a backslash", but an unclosed string ending on an escaped quotation mark. A string which truly ends on a backslash would look like

"this\\"

...and the above solution handles this case.


If you want to expand a bit, this...

(\\(?:b|t|n|f|r|\"|\\)|\\(?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))|\\(?:u(?:[0-9a-fA-F]{4})))

...captures all common escape sequences (including escaped quotes):

(\\                       # get the preceding slash (for each section)
  (?:b|t|n|f|r|\"|\\)     # capture common sequences like \n and \t

  |\\                     # OR (get the preceding slash and)...
  # capture variable-width octal escape sequences like \02, \13, or \377
  (?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))

  |\\                     # OR (get the preceding slash and)...
  (?:u(?:[0-9a-fA-F]{4})) # capture fixed-width Unicode sequences like \u0242 or \uFFAD
)

See this Gist for more information on the second point.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!