I have a predefined vocab which build from the common-used 3500 Chinese characters. Now I want to tokenize the Dataset with this vocab to fix each character. Any mature