Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端未结

关注

 5  1696

后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答

情深已故 (楼主)

2021-01-31 23:39

Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:

def rec_group_words():
    regex = []

    # Match a whole word:
    regex += [r'[A-za-z0-9\xc0-\xff]+']

    # Match a single CJK character:
    regex += [r'[\u4e00-\ufaff]']

    regex = "|".join(regex)
    return re.compile(regex)

0 讨论(0)

查看其它5个回答