RegExp - find all occurences, but not inside quotes

冷暖自知 提交于 2021-01-28 03:25:07

问题


I have this text (it's a string value, not a language expression):

hello = world + 'foo bar' + gizmo.hoozit + "escaped \"quotes\"";

And I would like to find all words ([a-zA-Z]+) which are not enclosed in double or single quotes. The quotes can be escaped (\" or \'). The result should be:

hello, world, gizmo, hoozit

Can I do this using regular expressions in JavaScript?


回答1:


you can use this pattern, what you need is in the second capturing group:

EDIT: a little bit shorter with a negative lookahead:

var re = /(['"])(?:[^"'\\]+|(?!\1)["']|\\{2}|\\[\s\S])*\1|([a-z]+)/ig

var mystr = 'hello = world + \'foo bar\' + gizmo.hoozit + "escaped \\"quotes\\"";';

var result = Array();
while (match = re.exec(mystr)) {
    if (match[2]) result.push(match[2]);
}

console.log(mystr);
console.log(result);

the idea is to match content enclosed between quotes before the target.

Enclosed content details: '(?:[^'\\]+|\\{2}|\\[\s\S])*'

(["'])         # literal single quote
(?:            # open a non capturing group
    [^"'\\]+   # all that is not a quote or a backslash
  |            # OR
    (?!\1)["'] # a quote but not the captured quote
  |            # OR
    \\{2}      # 2 backslashes (to compose all even numbers of backslash)*
  |            # OR
    \\[\s\S]   # an escaped character (to allow escaped single quotes)
)*             # repeat the group zero or more times
\1             # the closing single quote (backreference)

(* an even number of backslashes doesn't escape anything)




回答2:


You might want to use several regular expression methods one after the other for simplicity and clarity of function (large Regexes may be fast, but they're hard to construct, understand and edit): first remove all escaped quotes, then remove all quoted strings, then run your search.

var matches = string
  .replace( /\\'|\\"/g,         '' )
  .replace( /'[^']*'|"[^']*"/g, '' )
  .match( /\w+/g );

A few notes on the regular expressions involved:

  • The central construct in the 2nd replacement is character ('), followed by zero or more (*) of any character from the set ([]) which does not (^) conform to character (')
  • | means or, meaning either the part before or after the pipe can be matched
  • '\w' means 'any word character', and works as a shorthand for '[a-zA-Z]'

jsFiddle demo.




回答3:


  1. Replace each escaped quote with an empty string;
  2. Replace each pair of quotes and the string between with an empty string:
    • If you use a capture group for the opening quote (["']) then you can use a back-reference \1 to match the same style quote at the other end of the quoted string;
    • Matching with a back reference means you need to use a non-greedy (match as few characters as possible) wildcard match .*? to get the minimum possible quoted string.
  3. Finally, find the matches using your regular expression [a-zA-Z]+.

Like this:

var text = "hello = world + 'foo bar' + gizmo.hoozit + \"escaped \\\"quotes\\\"\";";

var matches = text.replace( /\\["']/g,      '' )
                  .replace( /(["']).*?\1/g, '' )
                  .match(   /[a-zA-Z]+/g );

console.log( matches );


来源:https://stackoverflow.com/questions/20351377/regexp-find-all-occurences-but-not-inside-quotes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!