Ignoring invisible characters in RegEx

问题

I've run into a bit of a conundrum.

I am currently trying to build a regex to filter out some particularly nasty scam emails. I'm sure you've seen them before, using a data dump from a compromised website to threaten to reveal intimate videos.

That's all well and good, except I noticed while testing the regex that some of these messages insert special invisible characters in the middle of words. Like you might see here (I've found it especially hard to find a place that keeps these special characters): Regexr link

I find myself looking for a way to create a regex that might ignore these characters all together, as some emails have them and some don't. In the end, I'm trying to create a match with something like

/all (.*)your contacts

回答1:

If there's a particular string you're trying to flag, you could do something like this:

Detect "email" with optional invis characters: /e[^\w]?m[^\w]?a[^\w]?i[^\w]?l/

[^\w]? will detect anything that's not a letter or digit. You could also use [^\w]* if you're seeing more than one invisible character being used between letters.

回答2:

Most invisible characters are just whitespace.
These don't matter which character set they're rendered in,
it's probably invisible.

If using a Unicode aware regex engine, you could probably just stick
in the whitespace class between the characters you're looking for.

If not, you could try using the class equivalent [ ].

\s =

 [\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

Same, but without CRLF's
[^\S\r\n] =

[\x{9}\x{B}-\x{C}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

来源：https://stackoverflow.com/questions/52936265/ignoring-invisible-characters-in-regex

标签

regex

office365