Regex won't match whitespace character with [\r\n\t\f\s]

一笑奈何 提交于 2019-12-20 04:15:32

问题


This is likely a very simple fix but I can't figure it out!

I'm trying to match (up to) 3 capitalized words in a row given the following text.

Russell Lake West. The match should include all 3 words.

This regex will match the first 2 words but not the third (demo here):

(([A-Z][a-z]+)\s{0,2}([A-Z][a-z]+)?\s{0,2}([A-Z][a-z]+)?)

This regex will match all 3 words, but I had to copy/paste the whitespace between Lake and West for it to work (demo here):

(([A-Z][a-z'-]+)\s{0,2}([A-Z][a-z'-]+)? \s{0,2}([A-Z][a-z'-]+)?)

                                       ^ pasted it here

So I assumed that maybe the whitespace isn't being treated as whitespace, but perhaps a newline character or similar, so I tried this (demo here):

[\r\n\t\f\s]West

But it doesn't recognize any of those characters before West, thus returning no results.

Why can't regex101 or Java recognize this apparent whitespace between Lake and West? What's a reliable way to handle this?


回答1:


There are many kinds of spaces. The one you are using in your demo is non-breaking one (indexed as 160 in Unicode table) which doesn't belong to \s (whitespaces character class) as it doesn't represent place on which we can expect text to be split into separate parts like lines.
BTW \s already represents: \r \n \t \f.

To match it you can use \p{Zs} class.
You can also combine both \s and \p{Zs} classes with [\\p{Zs}\\s].



来源:https://stackoverflow.com/questions/34710972/regex-wont-match-whitespace-character-with-r-n-t-f-s

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!