I\'ve got a series of text that is mostly English, but contains some phrases with Chinese characters. Here\'s two examples:
s1 = \"You say: 你好. I say: 再見\"
s
You can't get the indexes using re.findall(). You could use re.finditer() instead, and refer to m.group(), m.start() and m.end().
However, for your particular case, it seems more practical to call a function using re.sub().
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string
Code:
import re
s = "You say: 你好. I say: 再見. 答案, my friend, 在風在吹"
utf_line = s.decode('utf-8')
dict = {"你好" : "hello",
"再見" : "goodbye",
"答案" : "The answer",
"在風在吹" : "is blowing in the wind",
}
def translate(m):
block = m.group().encode('utf-8')
# Do your translation here
# this is just an example
if block in dict:
return dict[ block ]
else:
return "{unknown}"
utf_translated = re.sub(ur'[\u4e00-\u9fff]+', translate, utf_line, re.UNICODE)
print utf_translated.encode('utf-8')
Output:
You say: hello. I say: goodbye. The answer, my friend, is blowing in the wind