Why does this regex return true?

戏子无情 提交于 2020-01-24 10:06:20

问题


Why does this regex return true?

Regex.IsMatch("العسكرية", "العسكري")

I googled and nothing came up.


回答1:


I suspect what you posted is actually reversed, where the shorter text is in fact the pattern, and the longer input is the input being matched against. In that case, this would return true since the pattern matches everything but the last letter in the word.

To clarify, العسكري is the pattern, and العسكرية is the input. Since I know Arabic I can tell you that the latter would indeed be a partial match of the former, so the result would be true if the values were actually reversed. If you refer to this table of Arabic alphabets, you can see that the letter yā’ (at the bottom of the table) is the same letter in question. Its appearance depends on where it occurs in a word. In the former word, it appears at the end, and in the latter it is the second-last letter.

When I copy/paste from your post, the values get reversed, resulting in a true value. To work with this better, we can split the words apart to see the expected results in both scenarios:

string first = "العسكري";
string second = "العسكرية";
Console.WriteLine(Regex.IsMatch(first, second)); // false
Console.WriteLine(Regex.IsMatch(second, first)); // true



回答2:


This is an interesting result of text rendering rules designed for prose, not code.

The first argument in your method call as written above is "العسكرية", the argument that is rendered(*) on the right-hand side. This longer argument is the input, and the shorter substring rendered on the left is actually the pattern, hence the match.

(*: this is assuming that your browser knows how to do right-to-left rendering. If you paste the code snippet into an editor or console that doesn't have complex text layout support, you'll see it for what it really is... although the Arabic will then be broken.)

The trick is that punctuation characters like quote marks and the comma are directionless, so can render left-to-right or right-to-left depending on their surroundings. The logical order of the snippet is:

>>>>>>>>>>>>>>>
               <<<<<<<<<<<<<<<<<<<
                                  >>
Regex.IsMatch("العسكرية", "العسكري")

(Which has the further confusing property that the quotes that appear to be around each separate parameter, actually aren't.)

This makes some sort of arguable sense for stretches of readable mixed language, but makes code very confusing! You can stop it happening by breaking up the run of directionless characters with something that has left-to-right directionality:

Regex.IsMatch("العسكرية", /* foo */ "العسكري")

This is functionally the same code as the original, but it displays quite differently. You can watch the positions of the arguments swap places as you type the first Latin letter.




回答3:


It seems that Regex.IsMatch() tells whether there is an occurence of the regex in the string, not that the whole string matches the regex (according to the docs, it "Indicates whether the specified regular expression finds a match in the specified input string."). First argument is input, the other is pattern according to the docs, but here it seems to be the other way around. The last (left-most) character looks like a different one in the two strings, but it's probably because of the way ligatures are rendered. When dumped as UTF-8 bytes, the strings are:

d8 a7 d9 84 d8 b9 d8 b3 d9 83 d8 b1 d9 8a

and

d8 a7 d9 84 d8 b9 d8 b3 d9 83 d8 b1 d9 8a d8 a9

so the first is actually a substring of the other which would explain the match (it does require for the argument order to actually be reversed to what the documentation says).



来源:https://stackoverflow.com/questions/9794858/why-does-this-regex-return-true

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!