I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean
Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:
def rec_group_words(): regex = [] # Match a whole word: regex += [r'[A-za-z0-9\xc0-\xff]+'] # Match a single CJK character: regex += [r'[\u4e00-\ufaff]'] regex = "|".join(regex) return re.compile(regex)