Ignoring invisible characters in RegEx

北战南征 提交于 2021-01-29 16:59:38

问题


I've run into a bit of a conundrum.

I am currently trying to build a regex to filter out some particularly nasty scam emails. I'm sure you've seen them before, using a data dump from a compromised website to threaten to reveal intimate videos.

That's all well and good, except I noticed while testing the regex that some of these messages insert special invisible characters in the middle of words. Like you might see here (I've found it especially hard to find a place that keeps these special characters): Regexr link

I find myself looking for a way to create a regex that might ignore these characters all together, as some emails have them and some don't. In the end, I'm trying to create a match with something like

/all (.*)your contacts

回答1:


If there's a particular string you're trying to flag, you could do something like this:

Detect "email" with optional invis characters: /e[^\w]?m[^\w]?a[^\w]?i[^\w]?l/

[^\w]? will detect anything that's not a letter or digit. You could also use [^\w]* if you're seeing more than one invisible character being used between letters.




回答2:


Most invisible characters are just whitespace.
These don't matter which character set they're rendered in,
it's probably invisible.

If using a Unicode aware regex engine, you could probably just stick
in the whitespace class between the characters you're looking for.

If not, you could try using the class equivalent [ ].

\s =

 [\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

Same, but without CRLF's
[^\S\r\n] =

[\x{9}\x{B}-\x{C}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]


来源:https://stackoverflow.com/questions/52936265/ignoring-invisible-characters-in-regex

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!