问题
I'm reading a Unicode stream and would rather not have to pass the entire string through a regex. Is there a simple (reliable) character I can use to break words across languages?
My byte array is likely going to be based in UTF-16 or UTF-8
回答1:
If you are using Java then you can use the BreakIterator.
来源:https://stackoverflow.com/questions/4900408/how-do-i-determine-a-word-boundary-in-unicode-stream-in-c