Python \\ufffd after replacement with Chinese content

匿名 (未验证) 提交于 2019-12-03 01:45:01

问题:

After we found the answer to this question we are faced with next unusual replacement behavior:

Our regex is:

[\\((\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[)\\)\\]}】]+ 

We are trying to match all content inside any type of brackets including the brackets The original text is:

物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类)) 

The result is:

The code for the replacement is:

 delimiter = ' '  if localization == 'CN':         delimiter = ''   p = re.compile(codecs.encode(unicode(regex), "utf-8"), flags=re.I)   columnString = (p.sub(delimiter, columnString).strip() 

Same problem we are faced when we used regex:

(\\d*[满|元])  print repr(columnString)='\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'  print repr(regex)=u'[\\(\uff08\\[{\u3010]+(\\w+|\\s+|\\S+|\\W+)?[\uff09\\)\\]}\u3011]+'  print repr(p.pattern)='[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+' 

回答1:

You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:

When using UTF-8 in your pattern consists of separate bytes for the codepoint:

>>> '【' '\xe3\x80\x90' 

which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.



回答2:

In [1]: import re    ...: subject = '物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))'.decode('utf-8')    ...: reobj = re.compile(r"[\((\[{【]+(\w+|\s+|\S+|\W+)?[)\)\]}】]+", re.IGNORECASE | re.MULTILINE)    ...: result = reobj.sub("", subject)    ...: print result    ...:  物理化学名校考研真题详解  


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!