Python: Split unicode string on word boundaries

前端 未结 9 1128
清酒与你
清酒与你 2020-12-31 13:16

I need to take a string, and shorten it to 140 characters.

Currently I am doing:

if len(tweet) > 140:
    tweet = re.sub(r\"\\s+\", \" \", tweet)          


        
9条回答
  •  半阙折子戏
    2020-12-31 13:42

    What you're looking for is Chinese word segmentation tools. Word segmentation is not an easy task and is currently not perfectly solved. There are several tools:

    1. CkipTagger

      Developed by Academia Sinica, Taiwan.

    2. jieba

      Developed by Sun Junyi, a Baidu engineer.

    3. pkuseg

      Developed by Language Computing and Machine Learning Group, Peking University

    If what you want is character segmentation, it can be done albeit not very useful.

    >>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
    >>> chars = list(s)
    >>> chars
    [u'\u7b80', u'\u8baf', u'\uff1a', u'\u65b0', u'\u83ef', u'\u793e', u'\u5831', u'\u9053', u'\uff0c', u'\u7f8e', u'\u570b', u'\u7e3d', u'\u7d71', u'\u5967', u'\u5df4', u'\u99ac', u'\u4e58', u'\u5750', u'\u7684', u'\u300c', u'\u7a7a', u'\u8ecd', u'\u4e00', u'\u865f', u'\u300d', u'\u5c08', u'\u6a5f', u'\u665a', u'\u4e0a', u'1', u'0', u'\u6642', u'4', u'2', u'\u5206', u'\u9032', u'\u5165', u'\u4e0a', u'\u6d77', u'\u7a7a', u'\u57df', u'\uff0c', u'\u9810', u'\u8a08', u'\u7d04', u'3', u'0', u'\u5206', u'\u9418', u'\u5f8c', u'\u62b5', u'\u9054', u'\u6d66', u'\u6771', u'\u570b', u'\u969b', u'\u6a5f', u'\u5834', u'\uff0c', u'\u958b', u'\u5c55', u'\u4ed6', u'\u4e0a', u'\u4efb', u'\u5f8c', u'\u9996', u'\u6b21', u'\u8a2a', u'\u83ef', u'\u4e4b', u'\u65c5', u'\u3002']
    >>> print('/'.join(chars))
    简/讯/:/新/華/社/報/道/,/美/國/總/統/奧/巴/馬/乘/坐/的/「/空/軍/一/號/」/專/機/晚/上/1/0/時/4/2/分/進/入/上/海/空/域/,/預/計/約/3/0/分/鐘/後/抵/達/浦/東/國/際/機/場/,/開/展/他/上/任/後/首/次/訪/華/之/旅/。
    

提交回复
热议问题