Replace all hyphen types by the ascii hyphen “-”

淺唱寂寞╮ 提交于 2019-12-10 14:54:16

问题


Is there a way to replace all types of hyphens by the simple ascii "-"? I am looking for something like this that works for spaces:

txt = re.sub(r'[\s]+',' ',txt)

I believe that some non-ascii "-" hyphens are avoiding the correct process of removing some specific stopwords (name of projects that are connected by hyphens):

I want to replace this AR–L1003' for instance by AR-L1003, but I want to do this for the entire text.


回答1:


You can just list those hyphens in a class. Here is one possible list -- extend it to your needs:

txt = re.sub(r'[‐᠆﹣-⁃−]+','-',txt)

The standard re library does not support the \p syntax for matching unicode categories, but if you can import regex, then it is possible:

import regex

txt = regex.sub(r'\p{Pd}+', '-', txt)


来源:https://stackoverflow.com/questions/53750961/replace-all-hyphen-types-by-the-ascii-hyphen

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!