regex python with unicode (japanese) character issue

后端 未结 2 1491
南旧
南旧 2020-12-18 03:41

I want to remove part of a string (shown in bold) below, this is stored in the string oldString

[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY<

相关标签:
2条回答
  • 2020-12-18 04:21

    You can use the following snippet to solve the issue:

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    import re
    str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
    regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)'
    p = re.compile(regex, re.U)
    match = p.sub("", str)
    print match.encode("UTF-8")
    

    See IDEONE demo

    Beside # -*- coding: utf-8 -*- declaration, I have added @nhahtdh's character class to detect Japanese symbols.

    Note that the match needs to be encoded as UTF-8 string "manually" since Python 2 needs to be "reminded" we are working with Unicode all the time.

    0 讨论(0)
  • 2020-12-18 04:41

    I think you should use a regular expression like this one:

    ([\p{Hiragana}\p{Katakana}\p{Han}]+)
    

    please refer also to this documentation.

    EDIT: I also tested it here.

    0 讨论(0)
提交回复
热议问题