How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

后端 未结 9 2500
梦如初夏
梦如初夏 2020-12-03 03:25

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> \"This is a sentence.         


        
9条回答
  •  星月不相逢
    2020-12-03 04:07

    The list() is the answer for Chinese only sentence. For those hybrid English/Chines in most of case. It answered at hybrid-split, just copy answer from Winter as below.

    def spliteKeyWord(str):
        regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
        matches = re.findall(regex, str, re.UNICODE)
        return matches
    

提交回复
热议问题