Escaping regex unicode string in Python

我是研究僧i 提交于 2021-02-04 21:22:02

问题


I have a user defined string. I want to use it in regex with small improvement: search by three apostrophes instead of one. For example,

APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])

It works good for latin, but for unicode list comprehension gives the following string: "[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"

Looks like it finds backslashes in both strings and then substitutes APOSTROPHES

Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].

How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"


回答1:


What I understand is: you want to create a regular expression which can match a given word with any apostrophe:

The RegEx which match any apostrophe can be defined in a group:

APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'

For instance, you have this (Ukrainian?) word which contains a single quote:

word = "п'ять"

EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:

word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)

To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".

You can replace this r"\'" by your apostrophe RegEx:

import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)

The new RegEx can then be used to match the same word with any apostrophe:

assert re.match(word_regex, "п'ять")  # '
assert re.match(word_regex, "п’ять")  # \u2019
assert re.match(word_regex, "пʼять")  # \u02bc

Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".



来源:https://stackoverflow.com/questions/40626458/escaping-regex-unicode-string-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!