I need to take a string, and shorten it to 140 characters.
Currently I am doing:
if len(tweet) > 140:
tweet = re.sub(r\"\\s+\", \" \", tweet)
What you're looking for is Chinese word segmentation tools. Word segmentation is not an easy task and is currently not perfectly solved. There are several tools:
CkipTagger
Developed by Academia Sinica, Taiwan.
jieba
Developed by Sun Junyi, a Baidu engineer.
pkuseg
Developed by Language Computing and Machine Learning Group, Peking University
If what you want is character segmentation, it can be done albeit not very useful.
>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> chars = list(s)
>>> chars
[u'\u7b80', u'\u8baf', u'\uff1a', u'\u65b0', u'\u83ef', u'\u793e', u'\u5831', u'\u9053', u'\uff0c', u'\u7f8e', u'\u570b', u'\u7e3d', u'\u7d71', u'\u5967', u'\u5df4', u'\u99ac', u'\u4e58', u'\u5750', u'\u7684', u'\u300c', u'\u7a7a', u'\u8ecd', u'\u4e00', u'\u865f', u'\u300d', u'\u5c08', u'\u6a5f', u'\u665a', u'\u4e0a', u'1', u'0', u'\u6642', u'4', u'2', u'\u5206', u'\u9032', u'\u5165', u'\u4e0a', u'\u6d77', u'\u7a7a', u'\u57df', u'\uff0c', u'\u9810', u'\u8a08', u'\u7d04', u'3', u'0', u'\u5206', u'\u9418', u'\u5f8c', u'\u62b5', u'\u9054', u'\u6d66', u'\u6771', u'\u570b', u'\u969b', u'\u6a5f', u'\u5834', u'\uff0c', u'\u958b', u'\u5c55', u'\u4ed6', u'\u4e0a', u'\u4efb', u'\u5f8c', u'\u9996', u'\u6b21', u'\u8a2a', u'\u83ef', u'\u4e4b', u'\u65c5', u'\u3002']
>>> print('/'.join(chars))
简/讯/:/新/華/社/報/道/,/美/國/總/統/奧/巴/馬/乘/坐/的/「/空/軍/一/號/」/專/機/晚/上/1/0/時/4/2/分/進/入/上/海/空/域/,/預/計/約/3/0/分/鐘/後/抵/達/浦/東/國/際/機/場/,/開/展/他/上/任/後/首/次/訪/華/之/旅/。