std::regex, to match begin/end of string

江枫思渺然 提交于 2020-01-22 13:42:51

问题


In JS regular expressions symbols ^ and $ designate start and end of the string. And only with /m modifier (multiline mode) they match start and end of line - position before and after CR/LF.

But in std::regex/ECMAscript mode symbols ^ and $ match start and end of line always.

Is there any way in std::regex to define start and end of the string match points? In other words: to support JavaScript multiline mode ...


回答1:


By default, ECMAscript mode already treats ^ as both beginning-of-input and beginning-of-line, and $ as both end-of-input and end-of-line. There is no way to make them match only beginning or end-of-input, but it is possible to make them match only beginning or end-of-line:

When invoking std::regex_match, std::regex_search, or std::regex_replace, there is an argument of type std::regex_constants::match_flag_type that defaults to std::regex_constants::match_default.

  • To specify that ^ matches only beginning-of-line, specify std::regex_constants::match_not_bol
  • To specify that $ matches only end-of-line, specify std::regex_constants::match_not_eol
  • As these values are bitflags, to specify both, simply bitwise-or them together (std::regex_constants::match_not_bol | std::regex_constants::match_not_eol)
  • Note that beginning-of-input can be implied without using ^ and regardless of the presence of std::regex_constants::match_not_bol by specifying std::regex_constants::match_continuous

This is explained well in the ECMAScript grammar documentation on cppreference.com, which I highly recommend over cplusplus.com in general.

Caveat: I've tested with MSVC, Clang + libc++, and Clang + libstdc++, and only MSVC has the correct behavior at present.




回答2:


TL;DR

  • MSVC: the ^ and $ already match start and end of lines
  • C++17: use std::regex_constants::multiline option
  • Other compilers only match start of string with ^ and end of string with $ with no a possibility to redefine their behavior.

In all std::regex implementations other than MSVC and before C++17, the ^ and $ match beginning and end of the string, not a line. See this demo that does not find any match in "1\n2\n3" with ^\d+$ regex. When you add alternations (see below), there are 3 matches.

However, in MSVC and C++17, the ^ and $ may match start/end of the line.

C++17

Use the std::regex_constants::multiline option.

MSVC compiler

In a C++ project in Visual Studio, the following

std::regex r("^\\d+$");
std::string st("1\n2\n3");
for (std::sregex_iterator i = std::sregex_iterator(st.begin(), st.end(), r);
    i != std::sregex_iterator();
    ++i)
{
    std::smatch m = *i;
    std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
}

will output

Match value: 1 at Position 0
Match value: 2 at Position 2
Match value: 3 at Position 4

Workarounds that work across C++ compilers

There is no universal option in std::regex to make the anchors match start/end of the line across all compilers. You need to emulate it with alternations:

^ -> (^|\n)
$ -> (?=\n|$)

Note that $ can be "emulated" fully with (?=\n|$) (where you may add more line terminator symbols or symbol sequences, like (?=\r?\n|\r|$)), but with ^, you cannot find a 100% workaround.

Since there is no lookbehind support, you might have to adjust other parts of your regex pattern because of (^|\n) like using capturing groups more often than you could with a lookbehind support.




回答3:


The following code snippet matches email addresses starting [a-z] followed by 0 or 1 dot, then by 0 or more a-z letters, then ending with "@gmail.com". I tested it.

string reg = "^[a-z]+\\.*[a-z]*@gmail\\.com$";

regex reg1(reg, regex_constants::icase);
reg1(regex_str, regex_constants::icase);
string email;
cin>>email;
if (regex_search(email, reg1))



回答4:


You can emulate Perl/Python/PCRE \A, which matches at beginning of string but not after a newline, with the Javascript regex ^(?<!(.|\n)]), which translates to English as "match the beginning of a line which has no preceding character".

You can emulate Perl/Python/PCRE \z, which matches only at end-of-string, using (?!(.|\n))$. To get the effect of \Z, which matches only at end-of-string but allows a single newline just before that end-of-string, just add an optional newline: \n?(?!(.|\n))$.



来源:https://stackoverflow.com/questions/39645660/stdregex-to-match-begin-end-of-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!