Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端 未结 5 1651
后悔当初
后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答
  •  情深已故
    2021-01-31 23:39

    Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:

    def rec_group_words():
        regex = []
    
        # Match a whole word:
        regex += [r'[A-za-z0-9\xc0-\xff]+']
    
        # Match a single CJK character:
        regex += [r'[\u4e00-\ufaff]']
    
        regex = "|".join(regex)
        return re.compile(regex)
    

提交回复
热议问题