Regex to match all unicode quotation marks

后端 未结 2 1910
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-12 04:54

Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this:

quotes = ur\"[\\\"\'\\u2018\\u2019\\u201c\\u201         


        
相关标签:
2条回答
  • 2020-12-12 05:17

    Quotation marks will often have the Unicode category Pi (punctuation, initial quote) or Pf (Punctuation, final quote). You'll have to handle the "neutral" quotation marks ' and " manually.

    0 讨论(0)
  • 2020-12-12 05:28

    Python doesn't support Unicode properties, therefore you can't use the Pi and Pf properties, so I guess your solution is as good as it gets.

    You might also want to consider the "false quotation marks" that are sadly being used - the acute and grave accent (´ and `` ):\u0060and\u00B4`.

    Then there are guillemets (« » ‹ ›), do you want those, too? Use \u00BB\u203A\u00AB\u2039 for those.

    Also, your command has a little bug: you're adding the backslash to the quotes string (because you're using a raw string). Use a triple-quoted string instead.

    >>> quotes = ur"[\"'\u2018\u2019\u201c\u201d\u0060\u00b4]"
    >>> "\\" in quotes
    True
    >>> quotes
    u'[\\"\'\u2018\u2019\u201c\u201d`\xb4]'
    >>> quotes = ur"""["'\u2018\u2019\u201c\u201d\u0060\u00b4]"""
    >>> "\\" in quotes
    False
    >>> quotes
    u'["\'\u2018\u2019\u201c\u201d`\xb4]'
    
    0 讨论(0)
提交回复
热议问题