Python: find a series of Chinese characters within a string and apply a function

前端 未结 4 2014
遥遥无期
遥遥无期 2021-01-14 06:55

I\'ve got a series of text that is mostly English, but contains some phrases with Chinese characters. Here\'s two examples:

s1 = \"You say: 你好. I say: 再見\"
s         


        
4条回答
  •  醉酒成梦
    2021-01-14 07:36

    You can't get the indexes using re.findall(). You could use re.finditer() instead, and refer to m.group(), m.start() and m.end().

    However, for your particular case, it seems more practical to call a function using re.sub().

    If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string

    Code:

    import re
    
    s = "You say: 你好. I say: 再見. 答案, my friend, 在風在吹"
    utf_line = s.decode('utf-8')
    
    dict = {"你好" : "hello",
            "再見" : "goodbye",
            "答案" : "The answer",
            "在風在吹" : "is blowing in the wind",
           }
    
    def translate(m):
        block = m.group().encode('utf-8')
        # Do your translation here
    
        # this is just an example
        if block in dict:
            return dict[ block ]
        else:
            return "{unknown}"
    
    
    utf_translated = re.sub(ur'[\u4e00-\u9fff]+', translate, utf_line, re.UNICODE)
    
    print utf_translated.encode('utf-8')
    

    Output:

    You say: hello. I say: goodbye. The answer, my friend, is blowing in the wind
    
    • Ideone demo

提交回复
热议问题