Python regex, remove all punctuation except hyphen for unicode string

前端 未结 3 1488
南旧
南旧 2020-12-05 05:14

I have this code for removing all punctuation from a regex string:

import regex as re    
re.sub(ur\"\\p{P}+\", \"\", txt)

How would I chan

相关标签:
3条回答
  • 2020-12-05 05:17
    [^\P{P}-]+
    

    \P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.

    Example: http://www.rubular.com/r/JsdNM3nFJ3

    If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
    Working example: http://www.rubular.com/r/5G62iSYTdk

    0 讨论(0)
  • 2020-12-05 05:24

    Here's how to do it with the re module, in case you have to stick with the standard libraries:

    # works in python 2 and 3
    import re
    import string
    
    remove = string.punctuation
    remove = remove.replace("-", "") # don't remove hyphens
    pattern = r"[{}]".format(remove) # create the pattern
    
    txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
    re.sub(pattern, "", txt) 
    # >>> 'this - is - a - test'
    

    If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

    0 讨论(0)
  • 2020-12-05 05:42

    You could either specify the punctuation you want to remove manually, as in [._,] or supply a function instead of the replacement string:

    re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)
    
    0 讨论(0)
提交回复
热议问题