How to escape unicode string for regular expressions?

风格不统一 提交于 2019-12-22 10:42:34

问题


I need to build an re pattern based on the unicode string (e.g. I have "word", and I need something like ^"word"| "word"). However the "word" can contain special re characters. To match the "word" as it is, I need to escape special re characters in unicode string. The basic re.escape() function does the job for ascii strings. How can I do this for unicode?


回答1:


re.escape() inserts a backslash before every character that's not an ASCII alphanumeric. This may in fact lead to a multitude of unnecessary backslashes to be inserted, however, Python ignores backslashes that don't start a recognized escape sequence, so there is no big harm done (except possibly some performance penalty).

But if you want to build a stricter escape(), you can:

def escape(s):
    return re.sub(r"[(){}\[\].*?|^$\\+-]", r"\\\g<0>", s)

which only touches the actual regex metacharacters. I sure hope I didn't miss any :)



来源:https://stackoverflow.com/questions/38505706/how-to-escape-unicode-string-for-regular-expressions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!