Regex for a (twitter-like) hashtag that allows non-ASCII characters

问题

I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).

This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.

Any suggestions for how to do it?

回答1:

Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.

回答2:

With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:

> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]

The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.

回答3:

#([^#]+)[\s,;]*

Explanation: This regular expression will search for a # followed by one or more non-# characters, followed by 0 or more spaces, commas or semicolons.

var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);

Result:

["#hasta ", "#mañana ", "#babהַ"]

EDIT - Replaced \b for word boundary

来源：https://stackoverflow.com/questions/16941861/regex-for-a-twitter-like-hashtag-that-allows-non-ascii-characters

标签

javascript

regex

twitter

hashtag

unicode-string

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!