Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端 未结 5 1689
后悔当初
后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答
  •  暖寄归人
    2021-01-31 23:26

    In Python 3, it also splits the number if you needed.

    def spliteKeyWord(str):
        regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
        matches = re.findall(regex, str, re.UNICODE)
        return matches
    
    print(spliteKeyWord("Testing English text我爱Python123"))
    

    => ['Testing', 'English', 'text', '我', '爱', 'Python', '123']

提交回复
热议问题