What characters are allowed in twitter hashtags?

血红的双手。 提交于 2019-11-29 01:24:30

Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.

I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.

You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:

  1. Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.

  2. Check if there is regex modifier that can enable Unicode character range support for your language.

Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:

Perl:

Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).

Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.

See Perl documentation for more info:

Ruby:

Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.

See Ruby documentation for more info:

Examples:

Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) and are followed by another word character, a number or an underscore:

    m/^#[[:alpha:]][[:alnum:]_]+$/u     # Perl

    /^#[[:alpha:]][[:alnum:]_]+$/       # Ruby

Twitter allows letters, numbers, and underscores.

I checked this by generating tweets via their API. For example, tweeting

Hash tag test #foo[bar

resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.

StackExchange User

Well, for starters you can't use a # in the hashtag (##hash).

The guidelines below are being quoted from Twitter's help center:

  • People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
  • Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
  • Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
  • Hashtagged words that become very popular are often Trending Topics.
    Example: In the Tweet below, @eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.

Using hashtags correctly:

  • If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
  • Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
  • Use hashtags only on Tweets relevant to the topic.

I had the same issue to implement in golang. It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters. Instead, I could use \p{L} for this purpose.

My test with \p{L} is here. * Arabic, Hebrew, Hindi...etc is not confirmed yet.

osakasaul

Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.

I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!