Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端 未结 5 1699
后悔当初
后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答
  •  天命终不由人
    2021-01-31 23:33

    Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)

    A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.

    I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.

    Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.

    # -*- coding: utf-8 -*-
    import itertools, unicodedata
    
    def group_words(s):
        # This is a closure for key(), encapsulated in an array to work around
        # 2.x's lack of the nonlocal keyword.
        sequence = [0x10000000]
    
        def key(part):
            val = ord(part)
            if part.isspace():
                return 0
    
            # This is incorrect, but serves this example; finding a more
            # accurate categorization of characters is up to the user.
            asian = unicodedata.category(part) == "Lo"
            if asian:
                # Never group asian characters, by returning a unique value for each one.
                sequence[0] += 1
                return sequence[0]
    
            return 2
    
        result = []
        for key, group in itertools.groupby(s, key):
            # Discard groups of whitespace.
            if key == 0:
                continue
    
            str = "".join(group)
            result.append(str)
    
        return result
    
    if __name__ == "__main__":
        print group_words(u"Testing English text")
        print group_words(u"我爱蟒蛇")
        print group_words(u"Testing English text我爱蟒蛇")
    

提交回复
热议问题